conftrace_

Artificial Intelligence › Core AI ›

AI Safety

2,972 papers

Papers per year

Papers

Defending Against Repetitive Backdoor Attacks on Semi-Supervised Learning through Lens of Rate-Distortion-Perception Trade-Off WACV 2025

Are Exemplar-Based Class Incremental Learning Models Victim of Black-Box Poison Attacks? WACV 2025

Improving Deep Detector Robustness via Detection-Related Discriminant Maximization and Reorganization WACV 2025

AI Through the Human Lens: Investigating Cognitive Theories in Machine Psychology AACL 2025

Beyond Guardrails: Advanced Safety for Large Language Models — Monolingual, Multilingual and Multimodal Frontiers AACL 2025

Swallowing the Poison Pills: Insights from Vulnerability Disparity Among LLMs AACL 2025

Building Helpful-Only Large Language Models: A Complete Approach from Motivation to Evaluation AACL 2025

Atomic Calibration of LLMs in Long-Form Generations AACL 2025

LiteLMGuard: Seamless and Lightweight On-Device Guardrails for Small Language Models against Quantization Vulnerabilities AACL 2025

Information-theoretic Distinctions Between Deception and Confusion AACL 2025

R2-KG: General-Purpose Dual-Agent Framework for Reliable Reasoning on Knowledge Graphs AACL 2025

Moral Self-correction is Not An Innate Capability in Language Models AACL 2025

Illusions of Relevance: Arbitrary Content Injection Attacks Deceive Retrievers, Rerankers, and LLM Judges AACL 2025

UnsafeChain: Enhancing Reasoning Model Safety via Hard Cases AACL 2025

Consistency Is the Key: Detecting Hallucinations in LLM Generated Text By Checking Inconsistencies About Key Facts AACL 2025

GeoSAFE - A Novel Geospatial Artificial Intelligence Safety Assurance Framework and Evaluation for LLM Moderation AACL 2025

Auditing Political Bias in Text Generation by GPT-4 using Sociocultural and Demographic Personas: Case of Bengali Ethnolinguistic Communities AACL 2025

Mātṛkā: Multilingual Jailbreak Evaluation of Open-Source Large Language Models AACL 2025

Efficient Adversarial Training in LLMs with Continuous Attacks NIPS 2024

Provably Safe Neural Network Controllers via Differential Dynamic Logic NIPS 2024

Reinforcement Learning with Adaptive Regularization for Safe Control of Critical Systems NIPS 2024

ReMoDetect: Reward Models Recognize Aligned LLM's Generations NIPS 2024

LT-Defense: Searching-free Backdoor Defense via Exploiting the Long-tailed Effect NIPS 2024

Fair Secretaries with Unfair Predictions NIPS 2024

Weak-to-Strong Search: Align Large Language Models via Searching over Small Language Models NIPS 2024