Artificial Intelligence › Core AI ›

Responsible AI

1991 directly classified papers

Papers per year

Papers

Can Language Model Moderators Improve the Health of Online Discourse? NAACL 2024

MisgenderMender: A Community-Informed Approach to Interventions for Misgendering NAACL 2024

Defining and Detecting Vulnerability in Human Evaluation Guidelines: A Preliminary Study Towards Reliable NLG Evaluation NAACL 2024

Investigating Data Contamination in Modern Benchmarks for Large Language Models NAACL 2024

Value FULCRA: Mapping Large Language Models to the Multidimensional Spectrum of Basic Human Value NAACL 2024

Leveraging Prototypical Representations for Mitigating Social Bias without Demographic Information NAACL 2024

Composite Backdoor Attacks Against Large Language Models NAACL 2024

Adapting Fake News Detection to the Era of Large Language Models NAACL 2024

TagDebias: Entity and Concept Tagging for Social Bias Mitigation in Pretrained Language Models NAACL 2024

Discovering and Mitigating Indirect Bias in Attention-Based Model Explanations NAACL 2024

MICo: Preventative Detoxification of Large Language Models through Inhibition Control NAACL 2024

Ethos: Rectifying Language Models in Orthogonal Parameter Space NAACL 2024

Anonymity at Risk? Assessing Re-Identification Capabilities of Large Language Models in Court Decisions NAACL 2024

Addressing Healthcare-related Racial and LGBTQ+ Biases in Pretrained Language Models NAACL 2024

CURATRON: Complete and Robust Preference Data for Rigorous Alignment of Large Language Models NAACL 2024

HalluSafe at SemEval-2024 Task 6: An NLI-based Approach to Make LLMs Safer by Better Detecting Hallucinations and Overgeneration Mistakes NAACL 2024

Halu-NLP at SemEval-2024 Task 6: MetaCheckGPT - A Multi-task Hallucination Detection using LLM uncertainty and meta-models NAACL 2024

Fine-tuning Language Models for AI vs Human Generated Text detection NAACL 2024

Improving Adversarial Data Collection by Supporting Annotators: Lessons from GAHD, a German Hate Speech Dataset NAACL 2024

On the Effectiveness of Adversarial Robustness for Abuse Mitigation with Counterspeech NAACL 2024

GRASP: A Disagreement Analysis Framework to Assess Group Associations in Perspectives NAACL 2024

Safer-Instruct: Aligning Language Models with Automated Preference Data NAACL 2024

“One-Size-Fits-All”? Examining Expectations around What Constitute “Fair” or “Good” NLG System Behaviors NAACL 2024

Enhancing Controlled Query Evaluation through Epistemic Policies IJCAI 2024

Data Ownership and Privacy in Personalized AI Models in Assistive Healthcare IJCAI 2024