Artificial Intelligence › Core AI ›

AI Safety

2972 directly classified papers

Papers per year

Papers

Line of Duty: Evaluating LLM Self-Knowledge via Consistency in Feasibility Boundaries NAACL 2025

Hallucination Detox: Sensitivity Dropout (SenD) for Large Language Model Training ACL 2025

Learn and Unlearn: Addressing Misinformation in Multilingual LLMs EMNLP 2025

RED: Unleashing Token-Level Rewards from Holistic Feedback via Reward Redistribution EMNLP 2025

Break the Breakout: Reinventing LM Defense Against Jailbreak Attacks with Self-Refine NAACL 2025

Smaller Large Language Models Can Do Moral Self-Correction NAACL 2025

Knowledge Boundary of Large Language Models: A Survey ACL 2025

A Comprehensive Evaluation of Cognitive Biases in LLMs NAACL 2025

A Comparative Analysis of Ethical and Safety Gaps in LLMs using Relative Danger Coefficient NAACL 2025

Stepwise Reasoning Disruption Attack of LLMs ACL 2025

MM-CRITIC: A Holistic Evaluation of Large Multimodal Models as Multimodal Critique EMNLP 2025

CIC-NLP@DravidianLangTech 2025: Detecting AI-generated Product Reviews in Dravidian Languages NAACL 2025

SSNTrio@DravidianLangTech 2025: Identification of AI Generated Content in Dravidian Languages using Transformers NAACL 2025

Improve Safety Training of Large Language Models with Safety-Critical Singular Vectors Localization ACL 2025

Tongue-Tied: Breaking LLMs Safety Through New Language Learning NAACL 2025

Aligning to What? Limits to RLHF Based Alignment NAACL 2025

SafeRAG: Benchmarking Security in Retrieval-Augmented Generation of Large Language Model ACL 2025

MUSE: MCTS-Driven Red Teaming Framework for Enhanced Multi-Turn Dialogue Safety in Large Language Models EMNLP 2025

Think in Safety: Unveiling and Mitigating Safety Alignment Collapse in Multimodal Large Reasoning Model EMNLP 2025

R-TOFU: Unlearning in Large Reasoning Models EMNLP 2025

Breaking ReAct Agents: Foot-in-the-Door Attack Will Get You In NAACL 2025

A Practical Examination of AI-Generated Text Detectors for Large Language Models NAACL 2025

Confidence v.s. Critique: A Decomposition of Self-Correction Capability for LLMs ACL 2025

WordGame: Efficient & Effective LLM Jailbreak via Simultaneous Obfuscation in Query and Response NAACL 2025

Improving Consistency in LLM Inference using Probabilistic Tokenization NAACL 2025