factuality evaluation

62 papers

Explore in graph

Co-occurring keywords

large language model (12755) hallucination detection (505) factual consistency (121) text generation (2903) retrieval-augmented generation (1459) abstractive summarization (631) text summarization (889) natural language inference (1278) language model (4573) evidence retrieval (229)

Papers

Factuality Evaluation Using Reasoning and World Modeling AAAI 2026

Auto-BenchmarkCard: Automated Synthesis of Benchmark Documentation AAAI 2026

Evaluating the Factuality of Large Language Models Using Multiple Plug-and-Play Fact Sources AAAI 2026

Video SimpleQA: Towards Factuality Evaluation in Large Video Language Models AAAI 2026

When Benchmarks Age: Temporal Misalignment through Large Language Model Factuality Evaluation EACL 2026

Is the Top Still Spinning? Evaluating Subjectivity in Narrative Understanding EMNLP 2025

FaStFact: Faster, Stronger Long-Form Factuality Evaluations in LLMs EMNLP 2025

CaLMQA: Exploring culturally specific long-form question answering across 23 languages ACL 2025

Verify with Caution: The Pitfalls of Relying on Imperfect Factuality Metrics ACL 2025

FactBench: A Dynamic Benchmark for In-the-Wild Language Model Factuality Evaluation ACL 2025

VeriFact: Enhancing Long-Form Factuality Evaluation with Refined Fact Extraction and Reference Facts EMNLP 2025

Benchmarking LLM Faithfulness in RAG with Evolving Leaderboards EMNLP 2025

VeriFastScore: Speeding up long-form factuality evaluation EMNLP 2025

How Does Response Length Affect Long-Form Factuality ACL 2025

A Tale of Evaluating Factual Consistency: Case Study on Long Document Summarization Evaluation ACL 2025

Chinese SafetyQA: A Safety Short-form Factuality Benchmark for Large Language Models ACL 2025

Improving Model Factuality with Fine-grained Critique-based Evaluator ACL 2025

LMU at PerAnsSumm 2025: LlaMA-in-the-loop at Perspective-Aware Healthcare Answer Summarization Task 2.2 Factuality NAACL 2025

See the World, Discover Knowledge: A Chinese Factuality Evaluation for Large Vision Language Models ACL 2025

Fact, Fetch, and Reason: A Unified Evaluation of Retrieval-Augmented Generation NAACL 2025

Long-Form Information Alignment Evaluation Beyond Atomic Facts EMNLP 2025

Can Large Language Models Accurately Generate Answer Keys for Health-related Questions? ACL 2025

T2I-FactualBench: Benchmarking the Factuality of Text-to-Image Models with Knowledge-Intensive Concepts ACL 2025

Chinese SimpleQA: A Chinese Factuality Evaluation for Large Language Models ACL 2025

HalluLens: LLM Hallucination Benchmark ACL 2025