Anya Belz

28 papers · 2020–2026 · 5 conferences · across top CS/AI conferences

Achievements

+9 more ↓

🏃 Academic Marathon (5) 🧭 Keyword Pioneer 🌉 Interdisciplinary Bridge 🌍 Conference Polyglot (5) 🐝 Cross-Pollinator (8)

🌉 Interdisciplinary Bridge 🌍 Conference Polyglot (5) 🏆 Keyword Champion (5) 👥 Mega-Team (42) 🧬 Topic Evolution 🗃️ Keyword Collector (117) 💎 Century Club (27) 🔥 Unstoppable (6) ⚡ Prolific Year (8)

Conferences

ACL (14) EMNLP (6) EACL (4) COLING (3) NAACL (1)

Top co-authors

Ehud Reiter (7) Craig Thomson (7) Simon Mille (5) Michela Lorandi (4) Yufang Hou (4) Francesco Moramarco (3) Mark Perera (3) Alex Papadopoulos Korfiatis (3) Massimiliano Pronesti (3) Oisín Redmond (3)

Keywords

human evaluation (8) nlp evaluation (7) natural language processing (7) text generation (6) systematic review (5) evaluation methodology (5) large language model (5) reproducibility assessment (4) nlp research (3) multilingual nlp (3) experimental methodology (3) sentiment analysis (2) medical note generation (2) biomedical text mining (2) inter-annotator agreement (2) language model (2) prompt engineering (2) error annotation (2) natural language inference (2) controllable text generation (2)

Papers

AutoForest: Automatically Generating Forest Plots from Biomedical Studies with End-to-End Evidence Extraction and Synthesis ACL 2026 Using LLM Judgements for Sanity Checking Results and Reproducibility of Human Evaluations in NLP ACL 2025 Evolving Stances on Reproducibility: A Longitudinal Study of NLP and ML Researchers’ Views and Experience of Reproducibility EMNLP 2025 Enhancing Study-Level Inference from Clinical Trial Papers via Reinforcement Learning-Based Numeric Reasoning EMNLP 2025 Ask Me Like I’m Human: LLM-based Evaluation with For-Human Instructions Correlates Better with Human Evaluations than Human Judges ACL 2025 The 2025 ReproNLP Shared Task on Reproducibility of Evaluations in NLP: Overview and Results ACL 2025 Query-driven Document-level Scientific Evidence Extraction from Biomedical Studies ACL 2025 HEDS 3.0: The Human Evaluation Data Sheet Version 3.0 ACL 2025 Standard Quality Criteria Derived from Current NLP Evaluations for Guiding Evaluation Design and Grounding Comparability and AI Compliance Assessments ACL 2025 The 2024 ReproNLP Shared Task on Reproducibility of Evaluations in NLP: Overview and Results COLING 2024 Beyond Abstracts: A New Dataset, Prompt Design Strategy and Method for Biomedical Synthesis Generation ACL 2024 Reproducing the Metric-Based Evaluation of a Set of Controllable Text Generation Techniques COLING 2024 High-quality Data-to-Text Generation for Severely Under-Resourced Languages with Out-of-the-box Large Language Models EACL 2024 Assessing the Portability of Parameter Matrices Trained by Parameter-Efficient Finetuning Methods EACL 2024 Missing Information, Unresponsive Authors, Experimental Flaws: The Impossibility of Assessing the Reproducibility of Previous Human Evaluations in NLP EACL 2023 Non-Repeatable Experiments and Non-Reproducible Results: The Reproducibility Crisis in Human Evaluation in NLP ACL 2023 Generating Irish Text with a Flexible Plug-and-Play Architecture EMNLP 2023 Exploring Variation of Results from Different Experimental Conditions ACL 2023 How to Control Sentiment in Text Generation: A Survey of the State-of-the-Art in Sentiment-Control Techniques ACL 2023 On reporting scores and agreement for error annotation tasks EMNLP 2022 A Survey of Recent Error Annotation Schemes for Automatically Generated Text EMNLP 2022 Quantified Reproducibility Assessment of NLP Results ACL 2022 The Human Evaluation Datasheet: A Template for Recording Details of Human Evaluation Experiments in NLP ACL 2022 User-Driven Research of Medical Note Generation Software NAACL 2022 Human Evaluation and Correlation with Automatic Metrics in Consultation Note Generation ACL 2022 Consultation Checklists: Standardising the Human Evaluation of Medical Note Generation EMNLP 2022 A Systematic Review of Reproducibility Research in Natural Language Processing EACL 2021 The Third Multilingual Surface Realisation Shared Task (SR’20): Overview and Evaluation Results COLING 2020