conftrace_

Artificial Intelligence › Core AI ›

AI Safety

2,972 papers

Papers per year

Papers

Dataset and Lessons Learned from the 2024 SaTML LLM Capture-the-Flag Competition NIPS 2024

Play Guessing Game with LLM: Indirect Jailbreak Attack with Implicit Clues ACL 2024

Simplifying Constraint Inference with Inverse Reinforcement Learning NIPS 2024

The Generalization and Robustness of Transformer-Based Language Models on Commonsense Reasoning AAAI 2024

Efficient Adversarial Training in LLMs with Continuous Attacks NIPS 2024

Backdooring Instruction-Tuned Large Language Models with Virtual Prompt Injection NAACL 2024

Learning Safety Constraints from Demonstrations with Unknown Rewards AISTATS 2024

AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents NIPS 2024

A theoretical case-study of Scalable Oversight in Hierarchical Reinforcement Learning NIPS 2024

Compos Mentis at SemEval2024 Task6: A Multi-Faceted Role-based Large Language Model Ensemble to Detect Hallucination SEMEVAL 2024

Towards a Unified Framework for Adaptable Problematic Content Detection via Continual Learning NAACL 2024

BadActs: A Universal Backdoor Defense in the Activation Space ACL 2024

Improving Alignment and Robustness with Circuit Breakers NIPS 2024

Survival of the Safest: Towards Secure Prompt Optimization through Interleaved Multi-Objective Evolution EMNLP 2024

Interpretable Preferences via Multi-Objective Reward Modeling and Mixture-of-Experts EMNLP 2024

DarkFed: A Data-Free Backdoor Attack in Federated Learning IJCAI 2024

SELF-GUARD: Empower the LLM to Safeguard Itself NAACL 2024

Conditional Backdoor Attack via JPEG Compression AAAI 2024

Fast Best-of-N Decoding via Speculative Rejection NIPS 2024

Sandwich attack: Multi-language Mixture Adaptive Attack on LLMs NAACL 2024

LOTUS: Evasive and Resilient Backdoor Attacks through Sub-Partitioning CVPR 2024

Membership Inference Attacks against Fine-tuned Large Language Models via Self-prompt Calibration NIPS 2024

Raccoon: Prompt Extraction Benchmark of LLM-Integrated Applications ACL 2024

CoSafe: Evaluating Large Language Model Safety in Multi-Turn Dialogue Coreference EMNLP 2024

Secure Your Model: An Effective Key Prompt Protection Mechanism for Large Language Models NAACL 2024