conftrace_

Fazl Barez

16 papers · 2023–2026 · 5 conferences · across top CS/AI conferences

Achievements

Jump to papers ↓

+7 more ↓

🐝 Cross-Pollinator (6) 🌉 Interdisciplinary Bridge 🧭 Keyword Pioneer 🌍 Conference Polyglot (5) 🗺️ Taxonomy Completionist (25)

🌍 Conference Polyglot (5) 👥 Mega-Team (24) 👑 Triple Crown 📈 Trend Setter 💎 Century Club (15) 🗃️ Keyword Collector (66) ⚡ Prolific Year (7)

Conferences

EMNLP (6) ACL (4) ICML (3) ICLR (2) NIPS (1)

Top co-authors

Philip Torr (5) David Krueger (3) Clement Neo (3) Amir Abdullah (3) Mor Geva (2) Luke Marks (2) Shay B Cohen (2) Ioannis Konstas (2) Michael Lan (2) Narmeen Fatimah Oozeer (2)

Keywords

large language model (6) model editing (3) ai safety (2) neural network interpretability (2) mechanistic interpretability (2) transformer architecture (2) knowledge editing (1) attention mechanism (1) benchmark evaluation (1) model evaluation (1) code generation (1) model safety (1) prompt engineering (1) embedding space (1) reinforcement learning from human feedback (1) kl divergence (1) model interpretability (1) adversarial perturbation (1) language model (1) factual accuracy (1)

Papers

Make Mechanistic Interpretability Auditable: A Call to Develop Guidelines via Continuous Collaborative Reviewing ACL 2026 Trust Me, I’m Wrong: LLMs Hallucinate with Certainty Despite Knowing the Answer EMNLP 2025 Precise In-Parameter Concept Erasure in Large Language Models EMNLP 2025 Same Question, Different Words: A Latent Adversarial Framework for Prompt Robustness EMNLP 2025 Beyond Linear Steering: Unified Multi-Attribute Control for Language Models EMNLP 2025 Towards Interpreting Visual Information Processing in Vision-Language Models ICLR 2025 PoisonBench: Assessing Language Model Vulnerability to Poisoned Preference Data ICML 2025 Interpreting Context Look-ups in Transformers: Investigating Attention-MLP Interactions EMNLP 2024 Position: Near to Mid-term Risks and Opportunities of Open-Source Generative AI ICML 2024 Value-Evolutionary-Based Reinforcement Learning ICML 2024 Understanding Addition in Transformers ICLR 2024 Interpreting Learned Feedback Patterns in Large Language Models NIPS 2024 Large Language Models Relearn Removed Concepts ACL 2024 Towards Interpretable Sequence Continuation: Analyzing Shared Circuits in Large Language Models EMNLP 2024 The Larger they are, the Harder they Fail: Language Models do not Recognize Identifier Swaps in Python ACL 2023 Detecting Edit Failures In Large Language Models: An Improved Specificity Benchmark ACL 2023