conftrace
_
Papers
Trends
Conferences
Explore
More
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
← Core AI
Artificial Intelligence
›
Core AI
›
Safety
414 papers
Papers per year
2016: 1
1
2017: 1
1
2018: 4
4
2019: 8
8
2020: 11
11
2021: 21
21
2022: 29
29
2023: 36
36
2024: 87
87
2025: 117
117
2026: 99
99
Papers
CMD: a framework for Context-aware Model self-Detoxification
EMNLP 2024
Whispers that Shake Foundations: Analyzing and Mitigating False Premise Hallucinations in Large Language Models
EMNLP 2024
ASETF: A Novel Method for Jailbreak Attack on LLMs through Translate Suffix Embeddings
EMNLP 2024
Is Safer Better? The Impact of Guardrails on the Argumentative Strength of LLMs in Hate Speech Countering
EMNLP 2024
Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis
EMNLP 2024
Towards Probing Speech-Specific Risks in Large Multimodal Models: A Taxonomy, Benchmark, and Insights
EMNLP 2024
Red Teaming Language Models for Processing Contradictory Dialogues
EMNLP 2024
Fishing for Magikarp: Automatically Detecting Under-trained Tokens in Large Language Models
EMNLP 2024
Holistic Automated Red Teaming for Large Language Models through Top-Down Test Case Generation and Multi-turn Interaction
EMNLP 2024
BaitAttack: Alleviating Intention Shift in Jailbreak Attacks via Adaptive Bait Crafting
EMNLP 2024
MLLM-Protector: Ensuring MLLM’s Safety without Hurting Performance
EMNLP 2024
Distract Large Language Models for Automatic Jailbreak Attack
EMNLP 2024
CoSafe: Evaluating Large Language Model Safety in Multi-Turn Dialogue Coreference
EMNLP 2024
From LLMs to MLLMs: Exploring the Landscape of Multimodal Jailbreaking
EMNLP 2024
Please note that I’m just an AI: Analysis of Behavior Patterns of LLMs in (Non-)offensive Speech Identification
EMNLP 2024
GuardBench: A Large-Scale Benchmark for Guardrail Models
EMNLP 2024
Jailbreaking LLMs with Arabic Transliteration and Arabizi
EMNLP 2024
Defending Jailbreak Prompts via In-Context Adversarial Game
EMNLP 2024
Safety Arithmetic: A Framework for Test-time Safety Alignment of Language Models by Steering Parameters and Activations
EMNLP 2024
WebOlympus: An Open Platform for Web Agents on Live Websites
EMNLP 2024
ULMR: Unlearning Large Language Models via Negative Response and Model Parameter Average
EMNLP 2024
Don’t be my Doctor! Recognizing Healthcare Advice in Large Language Models
EMNLP 2024
Survival of the Safest: Towards Secure Prompt Optimization through Interleaved Multi-Objective Evolution
EMNLP 2024
Athena: Safe Autonomous Agents with Verbal Contrastive Learning
EMNLP 2024
Defending Large Language Models Against Jailbreak Attacks via Layer-specific Editing
EMNLP 2024
<
1
…
11
12
13
…
17
>