Papers
16,749 papers found
BehaviorBox: Automated Discovery of Fine-Grained Performance Differences Between Language Models
Lindia Tjuatja, Graham Neubig
Behavioural vs. Representational Systematicity in End-to-End Models: An Opinionated Survey
Ivan Vegner, Sydelle de Souza, Valentin Forch et al.
Behind Closed Words: Creating and Investigating the forePLay Annotated Dataset for Polish Erotic Discourse
Anna Kołos, Katarzyna Lorenc, Emilia Wiśnios et al.
BelarusianGLUE: Towards a Natural Language Understanding Benchmark for Belarusian
Maksim Aparovich, Volha Harytskaya, Vladislav Poritski et al.
Bel Esprit: Multi-Agent Framework for Building AI Model Pipelines
Yunsu Kim, Ahmedelmogtaba Abdelaziz, Thiago Castro Ferreira et al.
BELLE: A Bi-Level Multi-Agent Reasoning Framework for Multi-Hop Question Answering
Taolin Zhang, Dongyang Li, Qizhou Chen et al.
Bemba Speech Translation: Exploring a Low-Resource African Language
Muhammad Hazim Al Farouq, Aman Kassahun Wassie, Yasmin Moslem
Benchmarking and Improving Large Vision-Language Models for Fundamental Visual Graph Understanding and Reasoning
Yingjie Zhu, Xuefeng Bai, Kehai Chen et al.
Benchmarking LLMs and LLM-based Agents in Practical Vulnerability Detection for Code Repositories
Alperen Yildiz, Sin G Teo, Yiling Lou et al.
Benchmarking Long-Context Language Models on Long Code Understanding
Jia Li, Xuyuan Guo, Lei Li et al.
Benchmarking Multimodal Models for Ukrainian Language Understanding Across Academic and Cultural Domains
Yurii Paniv, Artur Kiulian, Dmytro Chaplynskyi et al.
Benchmarking Multi-National Value Alignment for Large Language Models
Chengyi Ju, Weijie Shi, Chengzhong Liu et al.
Benchmarking Open-ended Audio Dialogue Understanding for Large Audio-Language Models
Kuofeng Gao, Shu-Tao Xia, Ke Xu et al.
Benchmarking Query-Conditioned Natural Language Inference
Marc E. Canby, Xinchi Chen, Xing Niu et al.
Benchmarking Table Extraction: Multimodal LLMs vs Traditional OCR
Guilherme Nunes, Vitor Rolla, Duarte Pereira et al.
Benchmarking the Benchmarks: Reproducing Climate-Related NLP Tasks
Tom Calamai, Oana Balalau, Fabian M. Suchanek
Benchmarking zero-shot biomedical relation triplet extraction across language model architectures
Frederik Gade, Ole Lund, Marie Lisandra Mendoza
BenNumEval: A Benchmark to Assess LLMs’ Numerical Reasoning Capabilities in Bengali
Kawsar Ahmed, Md Osama, Omar Sharif et al.
BERTastic at SemEval-2025 Task 10: State-of-the-Art Accuracy in Coarse-Grained Entity Framing for Hindi News
Tarek Mahmoud, Zhuohan Xie, Preslav Nakov
BERT-like Models for Slavic Morpheme Segmentation
Dmitry Morozov, Lizaveta Astapenka, Anna Glazkova et al.
BESSTIE: A Benchmark for Sentiment and Sarcasm Classification for Varieties of English
Dipankar Srirag, Aditya Joshi, Jordan Painter et al.
Better Aligned with Survey Respondents or Training Data? Unveiling Political Leanings of LLMs on U.S. Supreme Court Cases
Shanshan Xu, Santosh T.y.s.s, Yanai Elazar et al.
Better Embeddings with Coupled Adam
Felix Stollenwerk, Tobias Stollenwerk
Better Process Supervision with Bi-directional Rewarding Signals
Wenxiang Chen, Wei He, Zhiheng Xi et al.
Better Red Teaming via Searching with Large Language Model
Yongkang Chen, Chongyang Zhao, Jianwen Tian et al.