Papers
The Law of Knowledge Overshadowing: Towards Understanding, Predicting, and Preventing LLM Hallucination
Yuji Zhang, Sha Li, Cheng Qian et al.
Gender Bias in Nepali-English Machine Translation: A Comparison of LLMs and Existing MT Systems
Supriya Khadka, Bijayan Bhattarai
Assessing the Reliability of LLMs Annotations in the Context of Demographic Bias and Model Explanation
Hadi Mohammadi, Tina Shahedi, Pablo Mosteiro et al.
Colombian Waitresses y Jueces canadienses: Gender and Country Biases in Occupation Recommendations from LLMs
Elisa Forcada Rodríguez, Olatz Perez-de-Vinaspre, Jon Ander Campos et al.
Examining the Cultural Encoding of Gender Bias in LLMs for Low-Resourced African Languages
Abigail Oppong, Hellina Hailu Nigatu, Chinasa T. Okolo
Adapting Psycholinguistic Research for LLMs: Gender-inclusive Language in a Coreference Context
Marion Bartl, Thomas Brendan Murphy, Susan Leavy
Psycholinguistic Word Features: a New Approach for the Evaluation of LLMs Alignment with Humans
Javier Conde, Miguel González Saiz, María Grandury et al.
The Fellowship of the LLMs: Multi-Model Workflows for Synthetic Preference Optimization Dataset Generation
Samee Arif, Sualeha Farid, Abdul Hameed Azeemi et al.
Knockout LLM Assessment: Using Large Language Models for Evaluations through Iterative Pairwise Comparisons
Isik Baran Sandan, Tu Anh Dinh, Jan Niehues
Can LLMs Detect Intrinsic Hallucinations in Paraphrasing and Machine Translation?
Evangelia Gogoulou, Shorouq Zahra, Liane Guillou et al.
Evaluating LLMs with Multiple Problems at once
Zhengxiang Wang, Jordan Kodner, Owen Rambow
Modeling the One-to-Many Property in Open-Domain Dialogue with LLMs
Jing Yang Lee, Kong Aik Lee, Woon-Seng Gan
Cleanse: Uncertainty Estimation Approach Using Clustering-based Semantic Consistency in LLMs
Minsuh Joo, Hyunsoo Cho
Clustering Zero-Shot Uncertainty Estimations to Assess LLM Response Accuracy for Yes/No Q&A
Christopher T. Franck, Amy Vennos, W. Graham Mueller et al.
Using LLM Judgements for Sanity Checking Results and Reproducibility of Human Evaluations in NLP
Rudali Huidrom, Anya Belz
HuGME: A benchmark system for evaluating Hungarian generative LLMs
Noémi Ligeti-Nagy, Gabor Madarasz, Flora Foldesi et al.
ELAB: Extensive LLM Alignment Benchmark in Persian Language
Zahra Pourbahman, Fatemeh Rajabi, Mohammadhossein Sadeghi et al.
Fine-Tune on the Format: First Improving Multiple-Choice Evaluation for Intermediate LLM Checkpoints
Alec Bunn, Sarah Wiegreffe, Ben Bogin
Prompt, Translate, Fine-Tune, Re-Initialize, or Instruction-Tune? Adapting LLMs for In-Context Learning in Low-Resource Languages
Christopher Toukmaji, Jeffrey Flanigan
From Calculation to Adjudication: Examining LLM Judges on Mathematical Reasoning Tasks
Andreas Stephan, Dawei Zhu, Matthias Aßenmacher et al.
Single- vs. Dual-Prompt Dialogue Generation with LLMs for Job Interviews in Human Resources
Joachim De Baer, A. Seza Doğruöz, Thomas Demeester et al.
SparQLe: Speech Queries to Text Translation Through LLMs
Amirbek Djanibekov, Hanan Aldarmaki
Prompting LLMs: Length Control for Isometric Machine Translation
Dávid Javorský, Ondřej Bojar, François Yvon
Simultaneous Translation with Offline Speech and LLM Models in CUNI Submission to IWSLT 2025
Dominik Macháček, Peter Polák