Artificial Intelligence › Core AI ›

AI Safety

2972 directly classified papers

Papers per year

Papers

Improving Consistency in LLM Inference using Probabilistic Tokenization NAACL 2025

WordGame: Efficient & Effective LLM Jailbreak via Simultaneous Obfuscation in Query and Response NAACL 2025

A Practical Examination of AI-Generated Text Detectors for Large Language Models NAACL 2025

Breaking ReAct Agents: Foot-in-the-Door Attack Will Get You In NAACL 2025

Aligning to What? Limits to RLHF Based Alignment NAACL 2025

Tongue-Tied: Breaking LLMs Safety Through New Language Learning NAACL 2025

SSNTrio@DravidianLangTech 2025: Identification of AI Generated Content in Dravidian Languages using Transformers NAACL 2025

CIC-NLP@DravidianLangTech 2025: Detecting AI-generated Product Reviews in Dravidian Languages NAACL 2025

A Comparative Analysis of Ethical and Safety Gaps in LLMs using Relative Danger Coefficient NAACL 2025

A Comprehensive Evaluation of Cognitive Biases in LLMs NAACL 2025

Smaller Large Language Models Can Do Moral Self-Correction NAACL 2025

Break the Breakout: Reinventing LM Defense Against Jailbreak Attacks with Self-Refine NAACL 2025

Line of Duty: Evaluating LLM Self-Knowledge via Consistency in Feasibility Boundaries NAACL 2025

Multi-lingual Multi-turn Automated Red Teaming for LLMs NAACL 2025

Summary the Savior: Harmful Keyword and Query-based Summarization for LLM Jailbreak Defense NAACL 2025

Building Safe GenAI Applications: An End-to-End Overview of Red Teaming for Large Language Models NAACL 2025

Automating Steering for Safe Multimodal Large Language Models EMNLP 2025

PBI-Attack: Prior-Guided Bimodal Interactive Black-Box Jailbreak Attack for Toxicity Maximization EMNLP 2025

DISTIL: Data-Free Inversion of Suspicious Trojan Inputs via Latent Diffusion ICCV 2025

Backdoor Mitigation by Distance-Driven Detoxification ICCV 2025

Jailbreaking Multimodal Large Language Models via Shuffle Inconsistency ICCV 2025

Towards Adversarial Robustness via Debiased High-Confidence Logit Alignment ICCV 2025

Safeguarding Vision-Language Models: Mitigating Vulnerabilities to Gaussian Noise in Perturbation-based Attacks ICCV 2025

Gradient-Reweighted Adversarial Camouflage for Physical Object Detection Evasion ICCV 2025

Adversarial Robust Memory-Based Continual Learner ICCV 2025