Artificial Intelligence › Core AI ›

AI Safety

2972 directly classified papers

Papers per year

Papers

STOP! Benchmarking Large Language Models with Sensitivity Testing on Offensive Progressions EMNLP 2024

SaySelf: Teaching LLMs to Express Confidence with Self-Reflective Rationales EMNLP 2024

ChatGPT Doesn’t Trust Chargers Fans: Guardrail Sensitivity in Context EMNLP 2024

Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis EMNLP 2024

Advancing Adversarial Suffix Transfer Learning on Aligned Large Language Models EMNLP 2024

Is LLM-as-a-Judge Robust? Investigating Universal Adversarial Attacks on Zero-shot LLM Assessment EMNLP 2024

Can Large Language Models Faithfully Express Their Intrinsic Uncertainty in Words? EMNLP 2024

Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations? EMNLP 2024

Data Advisor: Dynamic Data Curation for Safety Alignment of Large Language Models EMNLP 2024

Ranking Manipulation for Conversational Search Engines EMNLP 2024

InferAligner: Inference-Time Alignment for Harmlessness through Cross-Model Guidance EMNLP 2024

ToolBeHonest: A Multi-level Hallucination Diagnostic Benchmark for Tool-Augmented Large Language Models EMNLP 2024

Universal Vulnerabilities in Large Language Models: Backdoor Attacks for In-context Learning EMNLP 2024

Red Teaming Language Models for Processing Contradictory Dialogues EMNLP 2024

Fishing for Magikarp: Automatically Detecting Under-trained Tokens in Large Language Models EMNLP 2024

The Multilingual Alignment Prism: Aligning Global and Local Preferences to Reduce Harm EMNLP 2024

Let Me Teach You: Pedagogical Foundations of Feedback for Language Models EMNLP 2024

GuardBench: A Large-Scale Benchmark for Guardrail Models EMNLP 2024

Moral Foundations of Large Language Models EMNLP 2024

State-wise safe reinforcement learning with pixel observations L4DC 2024

From LLMs to MLLMs: Exploring the Landscape of Multimodal Jailbreaking EMNLP 2024

CoSafe: Evaluating Large Language Model Safety in Multi-Turn Dialogue Coreference EMNLP 2024

RAFT: Realistic Attacks to Fool Text Detectors EMNLP 2024

Intrinsic Self-correction for Enhanced Morality: An Analysis of Internal Mechanisms and the Superficial Hypothesis EMNLP 2024

Distract Large Language Models for Automatic Jailbreak Attack EMNLP 2024