conftrace_

Artificial Intelligence › Core AI ›

Safety

414 papers

Papers per year

1

1

4

8

11

21

29

36

87

117

99

Papers

CMD: a framework for Context-aware Model self-Detoxification EMNLP 2024

Whispers that Shake Foundations: Analyzing and Mitigating False Premise Hallucinations in Large Language Models EMNLP 2024

ASETF: A Novel Method for Jailbreak Attack on LLMs through Translate Suffix Embeddings EMNLP 2024

Is Safer Better? The Impact of Guardrails on the Argumentative Strength of LLMs in Hate Speech Countering EMNLP 2024

Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis EMNLP 2024

Towards Probing Speech-Specific Risks in Large Multimodal Models: A Taxonomy, Benchmark, and Insights EMNLP 2024

Red Teaming Language Models for Processing Contradictory Dialogues EMNLP 2024

Fishing for Magikarp: Automatically Detecting Under-trained Tokens in Large Language Models EMNLP 2024

Holistic Automated Red Teaming for Large Language Models through Top-Down Test Case Generation and Multi-turn Interaction EMNLP 2024

BaitAttack: Alleviating Intention Shift in Jailbreak Attacks via Adaptive Bait Crafting EMNLP 2024

MLLM-Protector: Ensuring MLLM’s Safety without Hurting Performance EMNLP 2024

Distract Large Language Models for Automatic Jailbreak Attack EMNLP 2024

CoSafe: Evaluating Large Language Model Safety in Multi-Turn Dialogue Coreference EMNLP 2024

From LLMs to MLLMs: Exploring the Landscape of Multimodal Jailbreaking EMNLP 2024

Please note that I’m just an AI: Analysis of Behavior Patterns of LLMs in (Non-)offensive Speech Identification EMNLP 2024

GuardBench: A Large-Scale Benchmark for Guardrail Models EMNLP 2024

Jailbreaking LLMs with Arabic Transliteration and Arabizi EMNLP 2024

Defending Jailbreak Prompts via In-Context Adversarial Game EMNLP 2024

Safety Arithmetic: A Framework for Test-time Safety Alignment of Language Models by Steering Parameters and Activations EMNLP 2024

WebOlympus: An Open Platform for Web Agents on Live Websites EMNLP 2024

ULMR: Unlearning Large Language Models via Negative Response and Model Parameter Average EMNLP 2024

Don’t be my Doctor! Recognizing Healthcare Advice in Large Language Models EMNLP 2024

Survival of the Safest: Towards Secure Prompt Optimization through Interleaved Multi-Objective Evolution EMNLP 2024

Athena: Safe Autonomous Agents with Verbal Contrastive Learning EMNLP 2024

Defending Large Language Models Against Jailbreak Attacks via Layer-specific Editing EMNLP 2024