Papers
17,973 papers found
Evaluating NL2SQL via SQL2NL
Mohammadtaher Safarzadeh, Afshin Oroojlooy, Dan Roth
Evaluating Prompt Relevance in Arabic Automatic Essay Scoring: Insights from Synthetic and Real-World Data
Chatrine Qwaider, Kirill Chirkunov, Bashar Alhafni et al.
Evaluating Robustness of Large Audio Language Models to Audio Injection: An Empirical Study
Guanyu Hou, Jiaming He, Yinhang Zhou et al.
Evaluating Spatiotemporal Consistency in Automatically Generated Sewing Instructions
Luisa Geiger, Mareike Hartmann, Michael Sullivan et al.
Evaluating Step-by-step Reasoning Traces: A Survey
Jinu Lee, Julia Hockenmaier
Evaluating Taxonomy Free Character Role Labeling (TF-CRL) in News Stories using Large Language Models
David G Hobson, Derek Ruths, Andrew Piper
Evaluating Test-Time Scaling LLMs for Legal Reasoning: OpenAI o1, DeepSeek-R1, and Beyond
Yinghao Hu, Yaoyao Yu, Leilei Gan et al.
Evaluating Text Generation Quality Using Spectral Distances of Surprisal
Zhichen Liu, Yongyuan Li, Yang Xu et al.
Evaluating Textual and Visual Semantic Neighborhoods of Abstract and Concrete Concepts
Sven Naber, Diego Frassinelli, Sabine Schulte Im Walde
Evaluating the Creativity of LLMs in Persian Literary Text Generation
Armin Tourajmehr, Mohammad Reza Modarres, Yadollah Yaghoobzadeh
Evaluating the Effectiveness and Scalability of LLM-Based Data Augmentation for Retrieval
Pranjal A Chitale, Bishal Santra, Yashoteja Prabhu et al.
Evaluating the Evaluators: Are readability metrics good measures of readability?
Isabel Cachola, Daniel Khashabi, Mark Dredze
Evaluating the Robustness and Accuracy of Text Watermarking Under Real-World Cross-Lingual Manipulations
Mansour Al Ghanim, Jiaqi Xue, Rochana Prih Hastuti et al.
Evaluating the Role of Verifiers in Test-Time Scaling for Legal Reasoning Tasks
Davide Romano, Jonathan Richard Schwarz, Daniele Giofrè
Evaluating Uncertainty Quantification Methods in Argumentative Large Language Models
Kevin Zhou, Adam Dejl, Gabriel Freedman et al.
Evaluating WMT 2025 Metrics Shared Task Submissions on the SSA-MTE African Challenge Set
Senyu Li, Felermino Dario Mario Ali, Jiayi Wang et al.
Evaluation and Facilitation of Online Discussions in the LLM Era: A Survey
Katerina Korre, Dimitris Tsirmpas, Nikos Gkoumas et al.
Evaluation of LLM for English to Hindi Legal Domain Machine Translation Systems
Kshetrimayum Boynao Singh, Deepak Kumar, Asif Ekbal
Evaluation of QWEN-3 for English to Ukrainian Translation
Cristian Grozea, Oleg Verbitsky
Evaluation of Text-to-Image Generation from a Creativity Perspective
Xinhao Wang, Xinyu Ma, ShengYong Ding et al.
Evaluation Sheet for Deep Research: A Use Case for Academic Survey Writing
Israel Abebe Azime, Tadesse Destaw Belay, Atnafu Lambebo Tonja
EventRelBench: A Comprehensive Benchmark for Evaluating Event Relation Understanding in Large Language Models
Jie Gong, Biaoshuai Zheng, Qiwang Hu
E-Verify: A Paradigm Shift to Scalable Embedding-based Factuality Verification
Zeyang Liu, Jingfeng Xue, Xiuqi Yang et al.
EverTracer: Hunting Stolen Large Language Models via Stealthy and Robust Probabilistic Fingerprint
Zhenhua Xu, Meng Han, Wenpeng Xing
Evil twins are not that evil: Qualitative insights into machine-generated prompts
Nathanaël Carraz Rakotonirina, Corentin Kervadec, Francesca Franzon et al.