Yilun Zhao
68 papers · 2022–2026 · 7 conferences · across top CS/AI conferences
Achievements
Jump to papers ↓+12 more ↓ Show less ↑
🌍 Conference Polyglot (7) 🐝 Cross-Pollinator (15) 🌉 Interdisciplinary Bridge 🧭 Keyword Pioneer 🌈 Renaissance Researcher (10)
🌈
Renaissance Researcher
(10)
🐣
Hot Topic Early Bird
🏠
Conference Loyalist
(26)
👥
Mega-Team
(35)
🤝
Dynamic Duo
(41)
🔬
Deep Specialist
(18)
🏆
Keyword Champion
(4)
🗃️
Keyword Collector
(220)
⚡
Prolific Year
(8)
❓
The Questioner
(6)
🔥
Unstoppable
(5)
💎
Century Club
(55)
Conferences
ACL (28)
EMNLP (26)
NAACL (7)
EACL (3)
ICLR (2)
CVPR (1)
NIPS (1)
Top co-authors
Research topics
Keywords
large language model
(32)
benchmark evaluation
(13)
question answering
(12)
retrieval-augmented generation
(10)
information retrieval
(6)
evaluation benchmark
(5)
table reasoning
(5)
scientific literature
(4)
table question answering
(4)
multimodal learning
(4)
foundation model
(4)
language model
(4)
table-to-text generation
(4)
instruction following
(3)
llm evaluation
(3)
code generation
(3)
synthetic data generation
(3)
information seeking
(3)
few-shot learning
(3)
mathematical reasoning
(3)
Papers
MultiFinBen: Benchmarking Large Language Models for Multilingual and Multimodal Financial Application
ACL 2026
SciMDR: Advancing Scientific Multimodal Document Reasoning
ACL 2026
A Survey of Multimodal Mathematical Reasoning: From Perception, Alignment to Reasoning
ACL 2026
A Survey of Reasoning-Intensive Retrieval: Progress and Challenges
ACL 2026
Rethinking Composed Image Retrieval Evaluation: A Fine-Grained Benchmark from Image Editing
ACL 2026
Patient-Similarity Cohort Reasoning in Clinical Text-to-SQL
EACL 2026
SciRAG: Adaptive, Citation-Aware, and Outline-Guided Retrieval and Synthesis for Scientific Literature
EACL 2026
Rethinking Reasoning-Intensive Retrieval: Evaluating and Advancing Retrievers in Agentic Search Systems
ACL 2026
TexOCR: Advancing Document OCR Models for Compilable Page-to-LaTeX Reconstruction
ACL 2026
MMSciCode: Real-world Evaluation of Multilingual Multi-Discipline Scientific Research Coding
ACL 2026
Can AI Be a Good Peer Reviewer? A Survey of Peer Review Process, Evaluation, and the Future
ACL 2026
Experience Retrieval-Augmentation with Electronic Health Records Enables Accurate Discharge QA
ACL 2026
Anchor: Branch-Point Data Generation for GUI Agents
ACL 2026
MMVU: Measuring Expert-Level Multi-Discipline Video Understanding
CVPR 2025
IFIR: A Comprehensive Benchmark for Evaluating Instruction-Following in Expert-Domain Information Retrieval
NAACL 2025
Are Multimodal LLMs Robust Against Adversarial Perturbations? RoMMath: A Systematic Evaluation on Multimodal Math Reasoning
NAACL 2025
ReIFE: Re-evaluating Instruction-Following Evaluation
NAACL 2025
SciVer: Evaluating Foundation Models for Multimodal Scientific Claim Verification
ACL 2025
AbGen: Evaluating Large Language Models in Ablation Study Design and Evaluation for Scientific Research
ACL 2025
Can LLMs Identify Critical Limitations within Scientific Research? A Systematic Evaluation on AI Research Papers
ACL 2025
VF-Eval: Evaluating Multimodal LLMs for Generating Feedback on AIGC Videos
ACL 2025
Can LLMs Generate High-Quality Test Cases for Algorithm Problems? TestCase-Eval: A Systematic Evaluation of Fault Coverage and Exposure
ACL 2025
SportReason: Evaluating Retrieval-Augmented Reasoning across Tables and Text for Sports Question Answering
EMNLP 2025
Diffusion vs. Autoregressive Language Models: A Text Embedding Perspective
EMNLP 2025
Table-R1: Inference-Time Scaling for Table Reasoning Tasks
EMNLP 2025
FinTrust: A Comprehensive Benchmark of Trustworthiness Evaluation in Finance Domain
EMNLP 2025
LimRank: Less is More for Reasoning-Intensive Information Reranking
EMNLP 2025
SciSketch: An Open-source Framework for Automated Schematic Diagram Generation in Scientific Papers
EMNLP 2025
Z1: Efficient Test-time Scaling with Code
EMNLP 2025
Efficiency-Effectiveness Reranking FLOPs for LLM-based Rerankers
EMNLP 2025
MCTS-RAG: Enhancing Retrieval-Augmented Generation with Monte Carlo Tree Search
EMNLP 2025
FinLFQA: Evaluating Attributed Text Generation of LLMs in Financial Long-Form Question Answering
EMNLP 2025
Judging with Many Minds: Do More Perspectives Mean Less Prejudice? On Bias Amplification and Resistance in Multi-Agent Based LLM-as-Judge
EMNLP 2025
TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models
ICLR 2025
ChemAgent: Self-updating Memories in Large Language Models Improves Chemical Reasoning
ICLR 2025
Physics: Benchmarking Foundation Models on University-Level Physics Problem Solving
ACL 2025
HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation Task
ACL 2025
Can Multimodal Foundation Models Understand Schematic Diagrams? An Empirical Study on Information-Seeking QA over Scientific Papers
ACL 2025
P-FOLIO: Evaluating and Improving Logical Reasoning with Abundant Human-Written Reasoning Chains
EMNLP 2024
TaPERA: Enhancing Faithfulness and Interpretability in Long-Form Table QA by Content Planning and Execution-based Reasoning
ACL 2024
FinanceMATH: Knowledge-Intensive Math Reasoning in Finance Domains
ACL 2024
DocMath-Eval: Evaluating Math Reasoning Capabilities of LLMs in Understanding Long and Specialized Documents
ACL 2024
MedAgents: Large Language Models as Collaborators for Zero-shot Medical Reasoning
ACL 2024
Unveiling the Spectrum of Data Contamination in Language Model: A Survey from Detection to Remediation
ACL 2024
Revisiting Automated Evaluation for Long-form Table Question Answering
EMNLP 2024
FinDVer: Explainable Claim Verification over Long and Hybrid-content Financial Documents
EMNLP 2024
FOLIO: Natural Language Reasoning with First-Order Logic
EMNLP 2024
TAIL: A Toolkit for Automatic and Realistic Long-Context Large Language Model Evaluation
EMNLP 2024
OpenT2T: An Open-Source Toolkit for Table-to-Text Generation
EMNLP 2024
MIMIR: A Customizable Agent Tuning Platform for Enhanced Scientific Applications
EMNLP 2024
OMG-QA: Building Open-Domain Multi-Modal Generative Question Answering Systems
EMNLP 2024
M3SciQA: A Multi-Modal Multi-Document Scientific QA Benchmark for Evaluating Foundation Models
EMNLP 2024
Uncertainty of Thoughts: Uncertainty-Aware Planning Enhances Information Seeking in LLMs
NIPS 2024
Investigating Data Contamination in Modern Benchmarks for Large Language Models
NAACL 2024
Struc-Bench: Are Large Language Models Good at Generating Complex Structured Tabular Data?
NAACL 2024
Benchmarking Generation and Evaluation Capabilities of Large Language Models for Instruction Controllable Summarization
NAACL 2024
On Evaluating the Integration of Reasoning and Action in LLM Agents with Database Question Answering
NAACL 2024
RobuT: A Systematic Study of Table QA Robustness Against Human-Annotated Adversarial Perturbations
ACL 2023
Revisiting the Gold Standard: Grounding Summarization Evaluation with Robust Human Evaluation
ACL 2023
Enhancing Text-to-SQL Capabilities of Large Language Models: A Study on Prompt Design Strategies
EMNLP 2023
QTSumm: Query-Focused Summarization over Tabular Data
EMNLP 2023
Investigating Table-to-Text Generation Capabilities of Large Language Models in Real-World Information Seeking Scenarios
EMNLP 2023
Towards Interpretable and Efficient Automatic Reference-Based Summarization Evaluation
EMNLP 2023
LoFT: Enhancing Faithfulness and Diversity for Table-to-Text Generation via Logic Form Control
EACL 2023
OpenRT: An Open-source Framework for Reasoning Over Tabular Data
ACL 2023
MultiHiertt: Numerical Reasoning over Multi Hierarchical Tabular and Textual Data
ACL 2022
ReasTAP: Injecting Table Reasoning Skills During Pre-training via Synthetic Reasoning Examples
EMNLP 2022
R2D2: Robust Data-to-Text with Replacement Detection
EMNLP 2022