Yonatan Belinkov

101 papers · 2013–2026 · 14 conferences · across top CS/AI conferences

Achievements

+19 more ↓

🌍 Conference Polyglot (14) 🏃 Academic Marathon (12) 🧭 Keyword Pioneer 🌉 Interdisciplinary Bridge 🐣 Hot Topic Early Bird

🧭 Keyword Pioneer 🐝 Cross-Pollinator (7) 🏃 Academic Marathon (12) 🌟 Keyword Trendsetter Combo (5) 🏠 Conference Loyalist (24) 👥 Mega-Team (61) 🏆 Grand Slam 🤝 Dynamic Duo (14) 🌱 Topic Pioneer 🔬 Deep Specialist (28) 🧬 Topic Evolution 🏆 Keyword Champion ❓ The Questioner (4) 📈 Trend Setter 🗃️ Keyword Collector (328) 🔥 Unstoppable (11) 💎 Century Club (97) ⚡ Prolific Year (7) 🚀 Conference Pioneer

Conferences

ACL (28) EMNLP (17) ICLR (16) NAACL (13) NIPS (7) AAAI (6) EACL (3) IJCNLP (3) INTERSPEECH (2) SEMEVAL (2) COLING (1) ICCV (1) ICML (1) WACV (1)

Top co-authors

Hassan Sajjad (14) James Glass (14) Nadir Durrani (13) Fahim Dalvi (11) Hadas Orgad (11) Aaron Mueller (9) Dana Arad (8) David Bau (8) Stuart Shieber (6) Boaz Carmeli (5)

Research topics

Reasoning (1)

Keywords

representation learning (18) language model (13) attention mechanism (9) neural machine translation (9) neural network (9) natural language inference (8) bias mitigation (5) model editing (4) neuron analysis (4) emergent communication (4) diffusion model (4) mechanistic interpretability (4) large language model (4) out-of-distribution generalization (4) model interpretability (4) causal mediation analysis (4) domain adaptation (3) transfer learning (3) domain generalization (3) attention head (3)

Papers

Masked by Consensus: Disentangling Privileged Knowledge in LLM Correctness ACL 2026 Follow the Flow: On Information Flow Across Textual Tokens in Text-to-Image Models ACL 2026 CRISP: Persistent Concept Unlearning via Sparse Autoencoders ACL 2026 Mechanisms of Prompt-Induced Hallucination in Vision–Language Models ACL 2026 CtD: Composition through Decomposition in Emergent Communication ICLR 2025 REVS: Unlearning Sensitive Information in Language Models via Rank Editing in the Vocabulary Space ACL 2025 Position-aware Automatic Circuit Discovery ACL 2025 Back Attention: Understanding and Enhancing Multi-Hop Reasoning in Large Language Models EMNLP 2025 SAEs Are Good for Steering – If You Select the Right Features EMNLP 2025 Trust Me, I’m Wrong: LLMs Hallucinate with Certainty Despite Knowing the Answer EMNLP 2025 Unsupervised Translation of Emergent Communication AAAI 2025 Measuring Chain of Thought Faithfulness by Unlearning Reasoning Steps EMNLP 2025 DEPTH: Discourse Education through Pre-Training Hierarchically NAACL 2025 Padding Tone: A Mechanistic Analysis of Padding Tokens in T2I Models NAACL 2025 Findings of the BlackboxNLP 2025 Shared Task: Localizing Circuits and Causal Variables in Language Models EMNLP 2025 BlackboxNLP-2025 MIB Shared Task: Improving Circuit Faithfulness via Better Edge Selection EMNLP 2025 Answer, Assemble, Ace: Understanding How LMs Answer Multiple Choice Questions ICLR 2025 Arithmetic Without Algorithms: Language Models Solve Math with a Bag of Heuristics ICLR 2025 LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations ICLR 2025 Jamba: Hybrid Transformer-Mamba Language Models ICLR 2025 MIB: A Mechanistic Interpretability Benchmark ICML 2025 Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models ICLR 2025 Unified Concept Editing in Diffusion Models WACV 2024 Linearity of Relation Decoding in Transformer Language Models ICLR 2024 Semantics and Spatiality of Emergent Communication NIPS 2024 Confidence Regulation Neurons in Language Models NIPS 2024 Accelerating the Global Aggregation of Local Explanations AAAI 2024 Leveraging Prototypical Representations for Mitigating Social Bias without Demographic Information NAACL 2024 ContraSim – Analyzing Neural Representations Based on Contrastive Learning NAACL 2024 ReFACT: Updating Text-to-Image Models by Editing the Text Encoder NAACL 2024 Diffusion Lens: Interpreting Text Encoders in Text-to-Image Pipelines ACL 2024 Concept-Best-Matching: Evaluating Compositionality In Emergent Communication ACL 2024 Learning from Others: Similarity-based Regularization for Mitigating Dataset Bias. ACL 2024 Generating Benchmarks for Factuality Evaluation of Language Models EACL 2024 A Dataset for Metaphor Detection in Early Medieval Hebrew Poetry EACL 2024 Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity Tracking ICLR 2024 Backward Lens: Projecting Language Model Gradients into the Vocabulary Space EMNLP 2024 Fast Forwarding Low-Rank Training EMNLP 2024 VISIT: Visualizing and Interpreting the Semantic Information Flow of Transformers EMNLP 2023 BLIND: Bias Removal With No Demographics ACL 2023 Shielded Representations: Protecting Sensitive Attributes Through Iterative Gradient-Based Projection ACL 2023 Multiple sequence alignment as a sequence-to-sequence learning problem ICLR 2023 Mass-Editing Memory in a Transformer ICLR 2023 Editing Implicit Assumptions in Text-to-Image Diffusion Models ICCV 2023 Parallel Context Windows for Large Language Models ACL 2023 What Are You Token About? Dense Retrieval as Distributions Over the Vocabulary ACL 2023 Emergent Quantized Communication AAAI 2023 A Mechanistic Interpretation of Arithmetic Reasoning in Language Models using Causal Mediation Analysis EMNLP 2023 When Language Models Fall in Love: Animacy Processing in Transformer Language Models EMNLP 2023 How Gender Debiasing Affects Internal Model Representations, and Why It Matters NAACL 2022 Supervising Model Attention with Human Explanations for Robust Natural Language Inference AAAI 2022 IDANI: Inference-time Domain Adaptation via Neuron-level Interventions NAACL 2022 Choose Your Lenses: Flaws in Gender Bias Evaluation NAACL 2022 A Generative Approach for Mitigating Structural Biases in Natural Language Inference NAACL 2022 On the Pitfalls of Analyzing Individual Neurons in Language Models ICLR 2022 Measures of Information Reflect Memorization Patterns NIPS 2022 A Multilingual Perspective Towards the Evaluation of Attribution Methods in Natural Language Inference EMNLP 2022 Locating and Editing Factual Associations in GPT NIPS 2022 Learning from others' mistakes: Avoiding dataset biases without modeling them ICLR 2021 IRM—when it works and when it doesn't: A test case of natural language inference NIPS 2021 Causal Analysis of Syntactic Agreement Mechanisms in Neural Language Models ACL 2021 Probing the Probing Paradigm: Does Probing Accuracy Entail Task Relevance? EACL 2021 Debiasing Methods in Natural Language Understanding Make Bias More Accessible EMNLP 2021 Variational Information Bottleneck for Effective Low-Resource Fine-Tuning ICLR 2021 Causal Analysis of Syntactic Agreement Mechanisms in Neural Language Models IJCNLP 2021 End-to-End Bias Mitigation by Modelling Biases in Corpora ACL 2020 Investigating Gender Bias in Language Models Using Causal Mediation Analysis NIPS 2020 Similarity Analysis of Contextual Word Representation Models ACL 2020 Findings of the WMT 2020 Shared Task on Machine Translation Robustness EMNLP 2020 Analyzing Redundancy in Pretrained Transformer Models EMNLP 2020 Analyzing Individual Neurons in Pre-trained Language Models EMNLP 2020 The Sensitivity of Language Models and Humans to Winograd Schema Perturbations ACL 2020 A Constructive Prediction of the Generalization Error Across Scales ICLR 2020 Probing Neural Dialog Models for Conversational Understanding ACL 2020 Interpretability and Analysis in Neural NLP ACL 2020 Linguistic Knowledge and Transferability of Contextual Representations NAACL 2019 Identifying and Controlling Important Neurons in Neural Machine Translation ICLR 2019 Analyzing Phonetic and Graphemic Representations in End-to-End Automatic Speech Recognition INTERSPEECH 2019 Adversarial Regularization for Visual Question Answering: Strengths, Shortcomings, and Side Effects NAACL 2019 One Size Does Not Fit All: Comparing NMT Representations of Different Granularities NAACL 2019 Findings of the First Shared Task on Machine Translation Robustness ACL 2019 Analyzing the Structure of Attention in a Transformer Language Model ACL 2019 Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP ACL 2019 Improving Neural Language Models by Segmenting, Attending, and Predicting the Future ACL 2019 Don’t Take the Premise for Granted: Mitigating Artifacts in Natural Language Inference ACL 2019 LSTM Networks Can Perform Dynamic Counting ACL 2019 NeuroX: A Toolkit for Analyzing Individual Neurons in Neural Networks AAAI 2019 What Is One Grain of Sand in the Desert? Analyzing Individual Neurons in Deep NLP Models AAAI 2019 Synthetic and Natural Noise Both Break Neural Machine Translation ICLR 2018 On the Evaluation of Semantic Phenomena in Neural Machine Translation Using Natural Language Inference NAACL 2018 Evaluating Layers of Representation in Neural Machine Translation on Part-of-Speech and Semantic Tagging Tasks IJCNLP 2017 QMDIS: QCRI-MIT Advanced Dialect Identification System INTERSPEECH 2017 What do Neural Machine Translation Models Learn about Morphology? ACL 2017 Challenging Language-Dependent Segmentation for Arabic: An Application to Machine Translation and Part-of-Speech Tagging ACL 2017 Understanding and Improving Morphological Learning in the Neural Machine Translation Decoder IJCNLP 2017 Analyzing Hidden Representations in End-to-End Automatic Speech Recognition Systems NIPS 2017 Neural Attention for Learning to Rank Questions in Community Question Answering COLING 2016 SLS at SemEval-2016 Task 3: Neural-based Approaches for Ranking in Community Question Answering SEMEVAL 2016 VectorSLU: A Continuous Word Vector Approach to Answer Selection in Community Question Answering Systems SEMEVAL 2015 Arabic Diacritization with Recurrent Neural Networks EMNLP 2015 Translating Dialectal Arabic to English ACL 2013