Artificial Intelligence › Core AI ›

Responsible AI

1991 directly classified papers

Papers per year

Papers

WaterPool: A Language Model Watermark Mitigating Trade-Offs among Imperceptibility, Efficacy and Robustness NAACL 2025

PROMPTEVALS: A Dataset of Assertions and Guardrails for Custom Production Large Language Model Pipelines NAACL 2025

Diversity Helps Jailbreak Large Language Models NAACL 2025

My LLM might Mimic AAE - But When Should It? NAACL 2025

High-Dimension Human Value Representation in Large Language Models NAACL 2025

Interactive Human-Centric Bias Mitigation AAAI 2024

Debiasing Text Safety Classifiers through a Fairness-Aware Ensemble EMNLP 2024

Identifying, Mitigating, and Anticipating Bias in Algorithmic Decisions AAAI 2024

Revisiting the Classics: A Study on Identifying and Rectifying Gender Stereotypes in Rhymes and Poems COLING 2024

Fair and Optimal Prediction via Post-Processing AAAI 2024

Aequitas Flow: Streamlining Fair ML Experimentation JMLR 2024

Fairness with Censorship: Bridging the Gap between Fairness Research and Real-World Deployment AAAI 2024

GRASP: A Disagreement Analysis Framework to Assess Group Associations in Perspectives NAACL 2024

Finding ε and δ of Traditional Disclosure Control Systems AAAI 2024

SafeSora: Towards Safety Alignment of Text2Video Generation via a Human Preference Dataset NIPS 2024

CAVA: A Tool for Cultural Alignment Visualization & Analysis EMNLP 2024

KorSmishing Explainer: A Korean-centric LLM-based Framework for Smishing Detection and Explanation Generation EMNLP 2024

U-trustworthy Models. Reliability, Competence, and Confidence in Decision-Making AAAI 2024

Identifying Reasons for Bias: An Argumentation-Based Approach AAAI 2024

Safer-Instruct: Aligning Language Models with Automated Preference Data NAACL 2024

GPT-4 Jailbreaks Itself with Near-Perfect Success Using Self-Explanation EMNLP 2024

Safety Arithmetic: A Framework for Test-time Safety Alignment of Language Models by Steering Parameters and Activations EMNLP 2024

Efficient Lifelong Model Evaluation in an Era of Rapid Progress NIPS 2024

The Greatest Good Benchmark: Measuring LLMs’ Alignment with Utilitarian Moral Dilemmas EMNLP 2024

Context-aware Watermark with Semantic Balanced Green-red Lists for Large Language Models EMNLP 2024