conftrace
_
Papers
Trends
Conferences
Explore
Authors
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
← Core AI
Artificial Intelligence
›
Core AI
›
AI Safety
2,972 papers
Papers per year
2002: 1
2006: 1
2007: 1
2012: 4
2013: 1
2015: 5
2016: 1
2017: 13
2018: 40
2019: 91
2020: 111
2021: 181
2022: 204
2023: 333
2024: 642
2025: 1031
2026: 312
Papers
Defending Against Repetitive Backdoor Attacks on Semi-Supervised Learning through Lens of Rate-Distortion-Perception Trade-Off
WACV 2025
Are Exemplar-Based Class Incremental Learning Models Victim of Black-Box Poison Attacks?
WACV 2025
Improving Deep Detector Robustness via Detection-Related Discriminant Maximization and Reorganization
WACV 2025
AI Through the Human Lens: Investigating Cognitive Theories in Machine Psychology
AACL 2025
Beyond Guardrails: Advanced Safety for Large Language Models — Monolingual, Multilingual and Multimodal Frontiers
AACL 2025
Swallowing the Poison Pills: Insights from Vulnerability Disparity Among LLMs
AACL 2025
Building Helpful-Only Large Language Models: A Complete Approach from Motivation to Evaluation
AACL 2025
Atomic Calibration of LLMs in Long-Form Generations
AACL 2025
LiteLMGuard: Seamless and Lightweight On-Device Guardrails for Small Language Models against Quantization Vulnerabilities
AACL 2025
Information-theoretic Distinctions Between Deception and Confusion
AACL 2025
R2-KG: General-Purpose Dual-Agent Framework for Reliable Reasoning on Knowledge Graphs
AACL 2025
Moral Self-correction is Not An Innate Capability in Language Models
AACL 2025
Illusions of Relevance: Arbitrary Content Injection Attacks Deceive Retrievers, Rerankers, and LLM Judges
AACL 2025
UnsafeChain: Enhancing Reasoning Model Safety via Hard Cases
AACL 2025
Consistency Is the Key: Detecting Hallucinations in LLM Generated Text By Checking Inconsistencies About Key Facts
AACL 2025
GeoSAFE - A Novel Geospatial Artificial Intelligence Safety Assurance Framework and Evaluation for LLM Moderation
AACL 2025
Auditing Political Bias in Text Generation by GPT-4 using Sociocultural and Demographic Personas: Case of Bengali Ethnolinguistic Communities
AACL 2025
Mātṛkā: Multilingual Jailbreak Evaluation of Open-Source Large Language Models
AACL 2025
Efficient Adversarial Training in LLMs with Continuous Attacks
NIPS 2024
Provably Safe Neural Network Controllers via Differential Dynamic Logic
NIPS 2024
Reinforcement Learning with Adaptive Regularization for Safe Control of Critical Systems
NIPS 2024
ReMoDetect: Reward Models Recognize Aligned LLM's Generations
NIPS 2024
LT-Defense: Searching-free Backdoor Defense via Exploiting the Long-tailed Effect
NIPS 2024
Fair Secretaries with Unfair Predictions
NIPS 2024
Weak-to-Strong Search: Align Large Language Models via Searching over Small Language Models
NIPS 2024
<
1
…
53
54
55
…
119
>