Papers
EssayBench: Evaluating Large Language Models in Multi-Genre Chinese Essay Writing
Fan Gao, Dongyuan Li, Ding Xia et al.
Estimating Online Influence Needs Causal Modeling! Counterfactual Analysis of Misinformation Engagement on Social Media
Lin Tian, Marian-Andrei Rizoiu
Estimating the True Distribution of Data Collected with Randomized Response
Carlos Antonio Pinzón, Ehab ElSalamouny, Lucas Massot et al.
ETOM: A Five-Level Benchmark for Evaluating Tool Orchestration within the MCP Ecosystem
Jia-Kai Dong, I-Wei Huang, Chun-Tin Wu et al.
EvalMuse-40K: A Fine-Grained Benchmark with Comprehensive Human Annotations for Text-to-Image Generation Model Alignment Evaluation
Shuhao Han, Haotian Fan, Jiachen Fu et al.
EvalQAG: A Framework for Automatic Complex QA Generation and a Benchmark QA Dataset for Policy Documents
Kirtan Brijeshbhai Soni, Krish Rupapara, Arpit Rana et al.
EvalSense: A Framework for Domain-Specific LLM (Meta-)Evaluation
Adam Dejl, Jonathan Pearson
Evaluating Adversarial Robustness of Concept Representations in Sparse Autoencoders
Aaron J. Li, Suraj Srinivas, Usha Bhalla et al.
Evaluating Cost-Efficiency of LLMs in a RAG Setup on Polish Wikipedia: Quality vs. Energy Consumption
Patrycja Smits, Tomasz Walkowiak
Evaluating Large Language Models on Lithuanian Grammatical Cases
Urtė Jakubauskaitė, Raquel G. Alhama
Evaluating LLMs for Police Decision-Making: A Framework Based on Police Action Scenarios
Sangyub Lee, Heedou Kim, Hyeoncheol Kim
Evaluating Morphological Plausibility of Subword Tokenization via Statistical Alignment with Morpho-Syntactic Features
Abishek Stephen, Jindřich Libovický
Evaluating Multi-Hop Reasoning in Large Language Models: A Chemistry-Centric Benchmark
Mohammad Khodadad, Ali Shiraee Kasmaee, Mahdi Astaraki et al.
Evaluating Native-Speaker Preferences on Machine Translation and Post-Edits for Five African Languages
Hiba El Oirghi, Tajuddeen Gwadabe, Marine Carpuat
Evaluating Online Moderation via LLM-Powered Counterfactual Simulations
Giacomo Fidone, Lucia Passaro, Riccardo Guidotti
Evaluating Retrieval-Augmented Generation for Medication Question Answering on Nigerian Drug Labels in Yorùbá
Aramide Adebesin, Zainab Tairu
Evaluating Sparse Autoencoders for Monosemantic Representation
Moghis Fereidouni, Muhammad Umair Haider, Peizhong Ju et al.
Evaluating, Synthesizing, and Enhancing for Customer Support Conversation
Jie Zhu, Huaixia Dou, Junhui Li et al.
Evaluating Text-to-Image and Text-to-Video Synthesis with a Conditional Frechet Distance
Jaywon Koo, Jefferson Hernandez, Moayed Haji-Ali et al.
Evaluating the Capability of Video Question Generation for Expert Knowledge Elicitation
Huaying Zhang, Atsushi Hashimoto, Tosho Hirasawa
Evaluating the Effect of Retrieval Augmentation on Social Biases
Tianhui Zhang, Yi Zhou, Danushka Bollegala
Evaluating the Factuality of Large Language Models Using Multiple Plug-and-Play Fact Sources
Zhaoheng Huang, Yutao Zhu, Jirong Wen et al.
Evaluating the Impact of SAE-based Language Steering on LLM Performance
Sebastian Zwirner, Wentao Hu, Koshiro Aoki et al.