Artificial Intelligence › Core AI ›

AI Safety

2972 directly classified papers

Papers per year

Papers

FC-Attack: Jailbreaking Multimodal Large Language Models via Auto-Generated Flowcharts EMNLP 2025

Exploring the Impact of Instruction-Tuning on LLM’s Susceptibility to Misinformation ACL 2025

NLP_CIMAT at SemEval-2025 Task 3: Just Ask GPT or look Inside. A prompt and Neural Networks Approach to Hallucination Detection SEMEVAL 2025

AILS-NTUA at SemEval-2025 Task 3: Leveraging Large Language Models and Translation Strategies for Multilingual Hallucination Detection SEMEVAL 2025

One Shot Dominance: Knowledge Poisoning Attack on Retrieval-Augmented Generation Systems EMNLP 2025

Internal Value Alignment in Large Language Models through Controlled Value Vector Activation ACL 2025

LLaVA-Critic: Learning to Evaluate Multimodal Models CVPR 2025

Team Cantharellus at SemEval-2025 Task 3: Hallucination Span Detection with Fine Tuning on Weakly Supervised Synthetic Data SEMEVAL 2025

RLAIF-V: Open-Source AI Feedback Leads to Super GPT-4V Trustworthiness CVPR 2025

LLMs Caught in the Crossfire: Malware Requests and Jailbreak Challenges ACL 2025

MM-CRITIC: A Holistic Evaluation of Large Multimodal Models as Multimodal Critique EMNLP 2025

Mr. Snuffleupagus at SemEval-2025 Task 4: Unlearning Factual Knowledge from LLMs Using Adaptive RMU SEMEVAL 2025

AI Governance and Lessons Learned as an AI Policy Advisor in the United States Senate AAAI 2025

Know Your Mistakes: Towards Preventing Overreliance on Task-Oriented Conversational AI Through Accountability Modeling ACL 2025

Towards Statistical Factuality Guarantee for Large Vision-Language Models EMNLP 2025

Computational Thinking with Computer Vision: Developing AI Competency in an Introductory Computer Science Course AAAI 2025

Lacuna Inc. at SemEval-2025 Task 4: LoRA-Enhanced Influence-Based Unlearning for LLMs SEMEVAL 2025

SDD: Self-Degraded Defense against Malicious Fine-tuning ACL 2025

Test-Time Backdoor Detection for Object Detection Models CVPR 2025

HEAL: An Empirical Study on Hallucinations in Embodied Agents Driven by Large Language Models EMNLP 2025

LORE: Continual Logit Rewriting Fosters Faithful Generation EMNLP 2025

A Theory of Response Sampling in LLMs: Part Descriptive and Part Prescriptive ACL 2025

Tightening Robustness Verification of MaxPool-based Neural Networks via Minimizing the Over-Approximation Zone CVPR 2025

AGENTVIGIL: Automatic Black-Box Red-teaming for Indirect Prompt Injection against LLM Agents EMNLP 2025

ESF: Efficient Sensitive Fingerprinting for Black-Box Tamper Detection of Large Language Models ACL 2025