Papers
5,479 papers found
DNR Bench: Benchmarking Over-Reasoning in Reasoning LLMs
Oluwanifemi Bamgbose, Masoud Hashemi, Sathwik Tejaswi Madhusudhan et al.
A Course Correction in Steerability Evaluation: Revealing Miscalibration and Side Effects in LLMs
Trenton Chang, Tobias Schnabel, Adith Swaminathan et al.
MetaCipher: A Time-Persistent and Universal Multi-Agent Framework for Cipher-Based Jailbreak Attacks for LLMs
Boyuan Chen, Minghao Shao, Abdul Basit et al.
A Multi-Agent Conversational Bandit Approach to Online Evaluation and Selection of User-Aligned LLM Responses
Xiangxiang Dai, Yuejin Xie, Maoli Liu et al.
Resilience in Ambient Multi-Agent LLMs via Decentralized Bio-Autonomic Control and Immune-Inspired Anomaly Detection
Nastaran Darabi, Devashri Naik, Sina Tayebati et al.
AlignTree: Efficient Defense Against LLM Jailbreak Attacks
Gil Goren, Shahar Katz, Lior Wolf
Silenced Biases: The Dark Side LLMs Learned to Refuse
Rom Himelstein, Amit LeVi, Brit Youngmann et al.
Cost-Minimized Label-Flipping Poisoning Attack to LLM Alignment
Shigeki Kusaka, Keita Saito, Mikoto Kudo et al.
Dropouts in Confidence: Moral Uncertainty in Human-LLM Alignment
Jea Kwon, Luiz Felipe Vecchietti, Sungwon Park et al.
ARGH-Mark: Anchor-Synchronized Watermarking with Hamming Correction for Robust and Quality-Preserving LLM Attribution
He Li, Xiaojun Chen, Jingcheng He et al.
MRACL: Multi-Reward Space Guided Adaptive Curriculum Reinforcement Learning for LLMs
Wenxuan Liu, Liangyu Huo, Yi Jing et al.
Targeting Misalignment: A Conflict-Aware Framework for Reward-Model-based LLM Alignment
Zixuan Liu, Siavash H. Khajavi, Guangkai Jiang et al.
STACK: Adversarial Attacks on LLM Safeguard Pipelines
Ian R. McKenzie, Oskar John Hollinsworth, Tom Tseng et al.
AdvBDGen: A Robust Framework for Generating Adaptive and Stealthy Backdoors in LLM Alignment
Pankayaraj Pathmanathan, Udari Madhushani Sehwag, Michael-Andrei Panaitescu-Liess et al.
Efficient Switchable Safety Control in LLMs via Magic-Token-Guided Co-Training
Jianfeng Si, Lin Sun, Zhewen Tan et al.
Persistent Instability in LLM’s Personality Measurements: Effects of Scale, Reasoning, and Conversation History
Tommaso Tosato, Saskia Helbling, Yorguin-Jose Mantilla-Ramos et al.
Benchmarking Trustworthiness in Multimodal LLMs for Video Understanding
Youze Wang, Zijun Chen, Ruoyu Chen et al.
STAR-1: Safer Alignment of Reasoning LLMs with 1K Data
Zijun Wang, Haoqin Tu, Yuhan Wang et al.
CluCERT: Certifying LLM Robustness via Clustering-Guided Denoising Smoothing
Zixia Wang, Gaojie Jin, Jia Hu et al.
HumorReject: Decoupling LLM Safety from Refusal Prefix via a Little Humor
Zihui Wu, Haichang Gao, Jiacheng Luo et al.
MedAtlas: Evaluating LLMs for Multi-Round, Multi-Task Medical Reasoning Across Diverse Imaging Modalities and Clinical Text
Ronghao Xu, Zhen Huang, Yangbo Wei et al.
Differentiated Directional Intervention: A Framework for Evading LLM Safety Alignment
Peng Zhang, Peijie Sun
GEM: Generative Entropy-Guided Preference Modeling for Few-Shot Alignment of LLMs
Yiyang Zhao, Huiyu Bai, Xuejiao Zhao
Can LLMs Detect Their Confabulations? Estimating Reliability in Uncertainty-Aware Language Models
Tianyi Zhou, Johanne Medina, Sanjay Chawla
On the Feasibility of Using MultiModal LLMs to Execute AR Social Engineering Attacks
Ting Bi, Chenghang Ye, Zheyu Yang et al.