Artificial Intelligence › Core AI ›

Responsible AI

1991 directly classified papers

Papers per year

Papers

M4GT-Bench: Evaluation Benchmark for Black-Box Machine-Generated Text Detection ACL 2024

MemeGuard: An LLM and VLM-based Framework for Advancing Content Moderation via Meme Intervention ACL 2024

Machine Unlearning of Pre-trained Large Language Models ACL 2024

Whose Preferences? Differences in Fairness Preferences and Their Impact on the Fairness of AI Utilizing Human Feedback ACL 2024

Defending Against Alignment-Breaking Attacks via Robustly Aligned LLM ACL 2024

An Entropy-based Text Watermarking Detection Method ACL 2024

Don’t Go To Extremes: Revealing the Excessive Sensitivity and Calibration Limitations of LLMs in Implicit Hate Speech Detection ACL 2024

Mitigate Extrinsic Social Bias in Pre-trained Language Models via Continuous Prompts Adjustment EMNLP 2024

Enhancing Data Quality through Simple De-duplication: Navigating Responsible Computational Social Science Research EMNLP 2024

Large Language Models Are Involuntary Truth-Tellers: Exploiting Fallacy Failure for Jailbreak Attacks EMNLP 2024

Style-Specific Neurons for Steering LLMs in Text Style Transfer EMNLP 2024

Linguistic Bias in ChatGPT: Language Models Reinforce Dialect Discrimination EMNLP 2024

Don’t Just Say “I don’t know”! Self-aligning Large Language Models for Responding to Unknown Questions with Explanations EMNLP 2024

Large Language Models Are Poor Clinical Decision-Makers: A Comprehensive Benchmark EMNLP 2024

Householder Pseudo-Rotation: A Novel Approach to Activation Editing in LLMs with Direction-Magnitude Perspective EMNLP 2024

BiasAlert: A Plug-and-play Tool for Social Bias Detection in LLMs EMNLP 2024

Applying Intrinsic Debiasing on Downstream Tasks: Challenges and Considerations for Machine Translation EMNLP 2024

CopyBench: Measuring Literal and Non-Literal Reproduction of Copyright-Protected Text in Language Model Generation EMNLP 2024

Unlocking Anticipatory Text Generation: A Constrained Approach for Large Language Models Decoding EMNLP 2024

BaitAttack: Alleviating Intention Shift in Jailbreak Attacks via Adaptive Bait Crafting EMNLP 2024

Towards Measuring and Modeling “Culture” in LLMs: A Survey EMNLP 2024

Hate Personified: Investigating the role of LLMs in content moderation EMNLP 2024

Distract Large Language Models for Automatic Jailbreak Attack EMNLP 2024

How Susceptible are Large Language Models to Ideological Manipulation? EMNLP 2024

Granular Privacy Control for Geolocation with Vision Language Models EMNLP 2024