Papers
Eguard: Defending LLM Embeddings Against Inversion Attacks via Text Mutual Information Optimization
Tiantian Liu, Hongwei Yao, Feng Lin et al.
Dynamic Deep Prompt Optimization for Defending Against Jailbreak Attacks on LLMs
Doniyorkhon Obidov, Honggang Yu, Xiaolong Guo et al.
iSeal: Encrypted Fingerprinting for Reliable LLM Ownership Verification
Zixun Xiong, Gaoyi Wu, Qingyang Yu et al.
Reason2Attack: Jailbreaking Text-to-Image Models via LLM Reasoning
Chenyu Zhang, Lanjun Wang, Yiwen Ma et al.
HalluClean: A Unified Framework to Combat Hallucinations in LLMs
Yaxin Zhao, Yu Zhang
Experiential Fairness: Bridging the Gap Between User Experience and Resource-Centric Fairness in Online LLM Services
Jiahua Huang, Wentai Wu, Yongheng Liu et al.
Breaking Model Lock-in: Cost-Efficient Zero-Shot LLM Routing via a Universal Latent Space
Cheng Yan, Wuyang Zhang, Zhiyuan Ning et al.
SPIRAL: Symbolic LLM Planning via Grounded and Reflective Search
Yifan Zhang, Giridhar Ganapavarapu, Srideepika Jayaraman et al.
EoH-S: Evolution of Heuristic Set Using LLMs for Automated Heuristic Design
Fei Liu, Yilu Liu, Qingfu Zhang et al.
DNR Bench: Benchmarking Over-Reasoning in Reasoning LLMs
Oluwanifemi Bamgbose, Masoud Hashemi, Sathwik Tejaswi Madhusudhan et al.
A Course Correction in Steerability Evaluation: Revealing Miscalibration and Side Effects in LLMs
Trenton Chang, Tobias Schnabel, Adith Swaminathan et al.
MetaCipher: A Time-Persistent and Universal Multi-Agent Framework for Cipher-Based Jailbreak Attacks for LLMs
Boyuan Chen, Minghao Shao, Abdul Basit et al.
A Multi-Agent Conversational Bandit Approach to Online Evaluation and Selection of User-Aligned LLM Responses
Xiangxiang Dai, Yuejin Xie, Maoli Liu et al.
Resilience in Ambient Multi-Agent LLMs via Decentralized Bio-Autonomic Control and Immune-Inspired Anomaly Detection
Nastaran Darabi, Devashri Naik, Sina Tayebati et al.
AlignTree: Efficient Defense Against LLM Jailbreak Attacks
Gil Goren, Shahar Katz, Lior Wolf
Silenced Biases: The Dark Side LLMs Learned to Refuse
Rom Himelstein, Amit LeVi, Brit Youngmann et al.
Cost-Minimized Label-Flipping Poisoning Attack to LLM Alignment
Shigeki Kusaka, Keita Saito, Mikoto Kudo et al.
Dropouts in Confidence: Moral Uncertainty in Human-LLM Alignment
Jea Kwon, Luiz Felipe Vecchietti, Sungwon Park et al.
ARGH-Mark: Anchor-Synchronized Watermarking with Hamming Correction for Robust and Quality-Preserving LLM Attribution
He Li, Xiaojun Chen, Jingcheng He et al.
MRACL: Multi-Reward Space Guided Adaptive Curriculum Reinforcement Learning for LLMs
Wenxuan Liu, Liangyu Huo, Yi Jing et al.
Targeting Misalignment: A Conflict-Aware Framework for Reward-Model-based LLM Alignment
Zixuan Liu, Siavash H. Khajavi, Guangkai Jiang et al.
STACK: Adversarial Attacks on LLM Safeguard Pipelines
Ian R. McKenzie, Oskar John Hollinsworth, Tom Tseng et al.
AdvBDGen: A Robust Framework for Generating Adaptive and Stealthy Backdoors in LLM Alignment
Pankayaraj Pathmanathan, Udari Madhushani Sehwag, Michael-Andrei Panaitescu-Liess et al.
Efficient Switchable Safety Control in LLMs via Magic-Token-Guided Co-Training
Jianfeng Si, Lin Sun, Zhewen Tan et al.