Research Explorer

DNR Bench: Benchmarking Over-Reasoning in Reasoning LLMs

Oluwanifemi Bamgbose, Masoud Hashemi, Sathwik Tejaswi Madhusudhan et al.

2026 AAAI

A Course Correction in Steerability Evaluation: Revealing Miscalibration and Side Effects in LLMs

Trenton Chang, Tobias Schnabel, Adith Swaminathan et al.

2026 AAAI

MetaCipher: A Time-Persistent and Universal Multi-Agent Framework for Cipher-Based Jailbreak Attacks for LLMs

Boyuan Chen, Minghao Shao, Abdul Basit et al.

2026 AAAI

A Multi-Agent Conversational Bandit Approach to Online Evaluation and Selection of User-Aligned LLM Responses

Xiangxiang Dai, Yuejin Xie, Maoli Liu et al.

2026 AAAI

Resilience in Ambient Multi-Agent LLMs via Decentralized Bio-Autonomic Control and Immune-Inspired Anomaly Detection

Nastaran Darabi, Devashri Naik, Sina Tayebati et al.

2026 AAAI

AlignTree: Efficient Defense Against LLM Jailbreak Attacks

Gil Goren, Shahar Katz, Lior Wolf

2026 AAAI

Silenced Biases: The Dark Side LLMs Learned to Refuse

Rom Himelstein, Amit LeVi, Brit Youngmann et al.

2026 AAAI

Cost-Minimized Label-Flipping Poisoning Attack to LLM Alignment

Shigeki Kusaka, Keita Saito, Mikoto Kudo et al.

2026 AAAI

Dropouts in Confidence: Moral Uncertainty in Human-LLM Alignment

Jea Kwon, Luiz Felipe Vecchietti, Sungwon Park et al.

2026 AAAI

ARGH-Mark: Anchor-Synchronized Watermarking with Hamming Correction for Robust and Quality-Preserving LLM Attribution

He Li, Xiaojun Chen, Jingcheng He et al.

2026 AAAI

MRACL: Multi-Reward Space Guided Adaptive Curriculum Reinforcement Learning for LLMs

Wenxuan Liu, Liangyu Huo, Yi Jing et al.

2026 AAAI

Targeting Misalignment: A Conflict-Aware Framework for Reward-Model-based LLM Alignment

Zixuan Liu, Siavash H. Khajavi, Guangkai Jiang et al.

2026 AAAI

STACK: Adversarial Attacks on LLM Safeguard Pipelines

Ian R. McKenzie, Oskar John Hollinsworth, Tom Tseng et al.

2026 AAAI

AdvBDGen: A Robust Framework for Generating Adaptive and Stealthy Backdoors in LLM Alignment

Pankayaraj Pathmanathan, Udari Madhushani Sehwag, Michael-Andrei Panaitescu-Liess et al.

2026 AAAI

Efficient Switchable Safety Control in LLMs via Magic-Token-Guided Co-Training

Jianfeng Si, Lin Sun, Zhewen Tan et al.

2026 AAAI

Persistent Instability in LLM’s Personality Measurements: Effects of Scale, Reasoning, and Conversation History

Tommaso Tosato, Saskia Helbling, Yorguin-Jose Mantilla-Ramos et al.

2026 AAAI

Benchmarking Trustworthiness in Multimodal LLMs for Video Understanding

Youze Wang, Zijun Chen, Ruoyu Chen et al.

2026 AAAI

STAR-1: Safer Alignment of Reasoning LLMs with 1K Data

Zijun Wang, Haoqin Tu, Yuhan Wang et al.

2026 AAAI

CluCERT: Certifying LLM Robustness via Clustering-Guided Denoising Smoothing

Zixia Wang, Gaojie Jin, Jia Hu et al.

2026 AAAI

HumorReject: Decoupling LLM Safety from Refusal Prefix via a Little Humor

Zihui Wu, Haichang Gao, Jiacheng Luo et al.

2026 AAAI

MedAtlas: Evaluating LLMs for Multi-Round, Multi-Task Medical Reasoning Across Diverse Imaging Modalities and Clinical Text

Ronghao Xu, Zhen Huang, Yangbo Wei et al.

2026 AAAI

Differentiated Directional Intervention: A Framework for Evading LLM Safety Alignment

Peng Zhang, Peijie Sun

2026 AAAI

GEM: Generative Entropy-Guided Preference Modeling for Few-Shot Alignment of LLMs

Yiyang Zhao, Huiyu Bai, Xuejiao Zhao

2026 AAAI

Can LLMs Detect Their Confabulations? Estimating Reliability in Uncertainty-Aware Language Models

Tianyi Zhou, Johanne Medina, Sanjay Chawla

2026 AAAI

On the Feasibility of Using MultiModal LLMs to Execute AR Social Engineering Attacks

Ting Bi, Chenghang Ye, Zheyu Yang et al.

2026 AAAI

Papers