Papers
AraEval: An Arabic Multi-Task Evaluation Suite for Large Language Models
Alhanoof Althnian, Norah A. Alzahrani, Shaykhah Z. Alsubaie et al.
AraHalluEval: A Fine-grained Hallucination Evaluation Framework for Arabic LLMs
Aisha Alansari, Hamzah Luqman
AraHealthQA 2025: The First Shared Task on Arabic Health Question Answering
Hassan Alhuzali, Walid Al-Eisawi, Muhammad Abdul-Mageed et al.
AraMinds at AraHealthQA 2025: A Retrieval-Augmented Generation System for Fine-Grained Classification and Answer Generation of Arabic Mental Health Q&A
Mohamed Zaytoon, Ahmed Mahmoud Salem, Ahmed Sakr et al.
AraMinds at MAHED 2025: Leveraging Vision-Language Models and Contrastive Multi-task Learning for Multimodal Hate Speech Detection
Mohamed Zaytoon, Ahmed Mahmoud Salem, Ahmed Sakr et al.
AraNLP at MAHED 2025 Shared Task: Using AraBERT for Text-based Hate and Hope Speech Classification
Wafaa S. El-Kassas, Enas A. Hakim Khalil
AraReasoner: Evaluating Reasoning-Based LLMs for Arabic NLP
Ahmed Abul Hasanaath, Aisha Alansari, Ahmed Ashraf et al.
AraS2P: Arabic Speech-to-Phonemes System
Bassam Mattar, Mohamed Fayed, Ayman Khalafallah
AraSafe: Benchmarking Safety in Arabic LLMs
Hamdy Mubarak, Abubakr Mohamed, Majd Hawasly
Archaeology at TSAR 2025 Shared Task Teaching Small Models to do CEFR Simplifications
Rareş-Alexandru Roşcan, Sergiu Nisioi
A Reasoner for Real-World Event Detection: Scaling Reinforcement Learning via Adaptive Perplexity-Aware Sampling Strategy
Xiaoyun Zhang, Jingqing Ruan, Xing Ma et al.
Are BabyLMs Deaf to Gricean Maxims? A Pragmatic Evaluation of Sample-efficient Language Models
Raha Askari, Sina Zarrieß, Özge Alacam et al.
Are Checklists Really Useful for Automatic Evaluation of Generative Tasks?
Momoka Furuhashi, Kouta Nakayama, Takashi Kodama et al.
Are Economists Always More Introverted? Analyzing Consistency in Persona-Assigned LLMs
Manon Reusens, Bart Baesens, David Jurgens
Are Generative Models Underconfident? Better Quality Estimation with Boosted Model Probability
Tu Anh Dinh, Jan Niehues
Are Knowledge and Reference in Multilingual Language Models Cross-Lingually Consistent?
Xi Ai, Mahardika Krisna Ihsani, Min-Yen Kan
Are Language Models Consequentialist or Deontological Moral Reasoners?
Keenan Samway, Max Kleiman-Weiner, David Guzman Piedrahita et al.
Are Large Language Models Chronically Online Surfers? A Dataset for Chinese Internet Meme Explanation
Yubo Xie, Chenkai Wang, Zongyang Ma et al.
Are LLMs Better than Reported? Detecting Label Errors and Mitigating Their Effect on Model Performance
Omer Nahum, Nitay Calderon, Orgad Keller et al.
Are LLMs Court-Ready? Evaluating Frontier Models on Indian Legal Reasoning
Kush Juvekar, Arghya Bhattacharya, Sai Khadloya et al.
Are LLMs Empathetic to All? Investigating the Influence of Multi-Demographic Personas on a Model’s Empathy
Ananya Malik, Nazanin Sabri, Melissa M. Karnaze et al.
Arena-lite: Efficient and Reliable Large Language Model Evaluation via Tournament-Based Direct Comparisons
Seonil Son, Ju-Min Oh, Heegon Jin et al.
Are Stereotypes Leading LLMs’ Zero-Shot Stance Detection ?
Anthony Dubreuil, Antoine Gourru, Christine Largeron et al.
Are Vision-Language Models Safe in the Wild? A Meme-Based Benchmark Study
DongGeon Lee, Joonwon Jang, Jihae Jeong et al.