conftrace_

Artificial Intelligence › Core AI ›

Safety

414 papers

Papers per year

1

1

4

8

11

21

29

36

87

117

99

Papers

Critical-CoT: A Robust Defense Framework against Reasoning-Level Backdoor Attacks in Large Language Models ACL 2026

DMN: A Compositional Framework for Jailbreaking Multimodal LLMs with Multi-Image Inputs ACL 2026

SafeMCP: Proactive Power Regulation for LLM Agent Defense via Environment-Grounded Look-Ahead Reasoning ACL 2026

HarmRLVR: Weaponizing Verifiable Rewards for Harmful LLM Alignment ACL 2026

Provably Safe Offline-to-Online RL: Decoupling Learning from Data-Driven Safety Enforcement ACL 2026

SHAPE: Unifying Safety, Helpfulness and Pedagogy for Educational LLMs ACL 2026

Unlearners Can Lie: Evaluating and Improving Honesty in LLM Unlearning ACL 2026

Answering the Wrong Question: Reasoning Trace Inversion for Abstention in LLMs ACL 2026

Can Factual Opinions Be Edited (Manipulated) in Large Language Models? ACL 2026

To Lie or Not to Lie? Investigating The Biased Spread of Global Lies by LLMs ACL 2026

Hallucination Detection in LLMs with Topological Divergence on Attention Graphs ACL 2026

Accommodation and Epistemic Vigilance: A Pragmatic Account of Why LLMs Fail to Challenge Harmful Beliefs ACL 2026

Knowing When Not to Answer: Lightweight KB-Aligned OOD Detection for Safe RAG ACL 2026

In-Context Representation Hijacking ACL 2026

SAGE: Synergistic Adaptive Gating of Experts for Hateful Video Detection ACL 2026

Persona-Grounded Safety Evaluation of AI Companions in Multi-Turn Conversations ACL 2026

Seeing No Evil: Blinding Large Vision-Language Models to Safety Instructions via Adversarial Attention Hijacking ACL 2026

FineSteer: A Unified Framework for Fine-Grained Inference-Time Steering in Large Language Models ACL 2026

ReFL: Reflective Feedback Learning for Hallucination Detection of Large Language Models ACL 2026

DVMap: Fine-Grained Pluralistic Value Alignment via High-Consensus Demographic-Value Mapping ACL 2026

Visual Inception: Compromising Long-term Planning in Agentic Recommenders via Multimodal Memory Poisoning ACL 2026

Probing the Safety Robustness of LLMs in Latent Space ACL 2026

USB: A COMPREHENSIVE AND UNIFIED SAFETY EVALUATION BENCHMARK FOR MULTIMODAL LARGE LANGUAGE MODELS ACL 2026

Rethinking Jailbreak Detection of Large Vision Language Models with Representational Contrastive Scoring ACL 2026

Benchmarking Web Agent Safety under E-commerce Deceptive Interfaces ACL 2026