dataset creation

115 papers

Explore in graph

Co-occurring keywords

text classification (6776) large language model (12755) question answering (2904) low-resource language (2234) machine translation (2472) multilingual nlp (1423) hate speech detection (716) natural language processing (2027) natural language inference (1278) named entity recognition (2801)

Papers

3D-GRAND: A Million-Scale Dataset for 3D-LLMs with Better Grounding and Less Hallucination CVPR 2025

MemeInterpret: Towards an All-in-One Dataset for Meme Understanding EMNLP 2025

ToVo: Toxicity Taxonomy via Voting NAACL 2025

The Kyrgyz Seed Dataset Submission to the WMT25 Open Language Data Initiative Shared Task EMNLP 2025

Crowdsource, Crawl, or Generate? Creating SEA-VL, a Multicultural Vision-Language Dataset for Southeast Asia ACL 2025

Uchaguzi-2022: A Dataset of Citizen Reports on the 2022 Kenyan Election COLING 2025

DialogCC: An Automated Pipeline for Creating High-Quality Multi-Modal Dialogue Dataset NAACL 2024

Explainable CED: A Dataset for Explainable Critical Error Detection in Machine Translation NAACL 2024

Lexical Complexity Prediction and Lexical Simplification for Catalan and Spanish: Resource Creation, Quality Assessment, and Ethical Considerations EMNLP 2024

GeoGPT4V: Towards Geometric Multi-modal Large Language Models with Geometric Image Generation EMNLP 2024

IndiSentiment140: Sentiment Analysis Dataset for Indian Languages with Emphasis on Low-Resource Languages using Machine Translation NAACL 2024

Expanding the FLORES+ Multilingual Benchmark with Translations for Aragonese, Aranese, Asturian, and Valencian EMNLP 2024

You Make me Feel like a Natural Question: Training QA Systems on Transformed Trivia Questions EMNLP 2024

SyllabusQA: A Course Logistics Question Answering Dataset ACL 2024

IndicLLMSuite: A Blueprint for Creating Pre-training and Fine-Tuning Datasets for Indian Languages ACL 2024

Improving Multilingual Instruction Finetuning via Linguistically Natural and Diverse Datasets EMNLP 2024

MCTS: A Multi-Reference Chinese Text Simplification Dataset COLING 2024

mCSQA: Multilingual Commonsense Reasoning Dataset with Unified Creation Strategy by Language Models and Humans ACL 2024

EEVEE: An Easy Annotation Tool for Natural Language Processing EACL 2024

Getting Serious about Humor: Crafting Humor Datasets with Unfunny Large Language Models ACL 2024

LLM-REDIAL: A Large-Scale Dataset for Conversational Recommender Systems Created from User Behaviors with LLMs ACL 2024

SI-NLI: A Slovene Natural Language Inference Dataset and Its Evaluation COLING 2024

Finding Spoken Identifications: Using GPT-4 Annotation for an Efficient and Fast Dataset Creation Pipeline COLING 2024

A Study on Scaling Up Multilingual News Framing Analysis NAACL 2024

Automating Dataset Production Using Generative Text and Image Models COLING 2024