Artificial Intelligence › Core AI ›

Responsible AI

1991 directly classified papers

Papers per year

Papers

One-dimensional Adapter to Rule Them All: Concepts Diffusion Models and Erasing Applications CVPR 2024

WildGuard: Open One-stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs NIPS 2024

Enhancing Multiple Dimensions of Trustworthiness in LLMs via Sparse Activation Control NIPS 2024

Aligning Diffusion Models by Optimizing Human Utility NIPS 2024

Defensive Unlearning with Adversarial Training for Robust Concept Erasure in Diffusion Models NIPS 2024

Do Counterfactually Fair Image Classifiers Satisfy Group Fairness? -- A Theoretical and Empirical Study NIPS 2024

Fairness without Harm: An Influence-Guided Active Sampling Approach NIPS 2024

Safe LoRA: The Silver Lining of Reducing Safety Risks when Finetuning Large Language Models NIPS 2024

MoGU: A Framework for Enhancing Safety of LLMs While Preserving Their Usability NIPS 2024

PrivacyLens: Evaluating Privacy Norm Awareness of Language Models in Action NIPS 2024

Smooth Lower Bounds for Differentially Private Algorithms via Padding-and-Permuting Fingerprinting Codes COLT 2024

HalluSafe at SemEval-2024 Task 6: An NLI-based Approach to Make LLMs Safer by Better Detecting Hallucinations and Overgeneration Mistakes SEMEVAL 2024

M4: Multi-generator, Multi-domain, and Multi-lingual Black-Box Machine-Generated Text Detection EACL 2024

Uncovering Stereotypes in Large Language Models: A Task Complexity-based Approach EACL 2024

”It’s how you do things that matters”: Attending to Process to Better Serve Indigenous Communities with Language Technologies EACL 2024

HypoTermQA: Hypothetical Terms Dataset for Benchmarking Hallucination Tendency of LLMs EACL 2024

A Thesis Proposal ClaimInspector Framework: A Hybrid Approach to Data Annotation using Fact-Checked Claims and LLMs EACL 2024

Do-Not-Answer: Evaluating Safeguards in LLMs EACL 2024

Explaining Language Model Predictions with High-Impact Concepts EACL 2024

Differentially Private Natural Language Models: Recent Advances and Future Directions EACL 2024

How should Conversational Agent systems respond to sexual harassment? EACL 2024

PreciseDebias: An Automatic Prompt Engineering Approach for Generative AI To Mitigate Image Demographic Biases WACV 2024

An Empirical Investigation Into Benchmarking Model Multiplicity for Trustworthy Machine Learning: A Case Study on Image Classification WACV 2024

ABLE: Agency-BeLiefs Embedding to Address Stereotypical Bias through Awareness Instead of Obliviousness COLING 2024

A Comparative Study of Explicit and Implicit Gender Biases in Large Language Models via Self-evaluation COLING 2024