Ranjay Krishna

77 papers · 2015–2026 · 13 conferences · across top CS/AI conferences

Achievements

+16 more ↓

🏃 Academic Marathon (10) 🌍 Conference Polyglot (12) 🧭 Keyword Pioneer 🌉 Interdisciplinary Bridge 🐣 Hot Topic Early Bird

🌉 Interdisciplinary Bridge 🏃 Academic Marathon (10) 🌈 Renaissance Researcher (11) 🏠 Conference Loyalist (23) 👑 Triple Crown 🤝 Dynamic Duo (11) 👥 Mega-Team (50) 🏆 Keyword Champion (4) 🔬 Deep Specialist (29) 🏆 Grand Slam 🗃️ Keyword Collector (279) ❓ The Questioner (2) ⚡ Prolific Year (11) 💎 Century Club (76) 🔥 Unstoppable (9) 📈 Trend Setter

Conferences

CVPR (23) NIPS (15) ICCV (7) CORL (6) ECCV (6) EMNLP (6) ACL (4) ICLR (4) ICML (2) AAAI (1) IJCNLP (1) NAACL (1) RSS (1)

Top co-authors

Aniruddha Kembhavi (11) Cheng-Yu Hsieh (11) Jieyu Zhang (10) Zixian Ma (9) Ali Farhadi (9) Li Fei-fei (9) Jiafei Duan (8) Yushi Hu (7) Dieter Fox (7) Tanmay Gupta (6)

Keywords

vision-language model (14) large language model (9) compositional reasoning (7) multimodal learning (7) visual question answering (7) visual reasoning (6) benchmark evaluation (5) image captioning (4) multimodal language model (4) text-to-image generation (4) contrastive learning (4) language model (4) zero-shot learning (3) image understanding (3) image generation (3) knowledge distillation (3) image classification (3) question answering (3) active learning (3) video understanding (3)

Papers

Towards Acyclic Preference Evaluation of Language Models via Multiple Evaluators AAAI 2026 AHA: A Vision-Language-Model for Detecting and Reasoning Over Failures in Robotic Manipulation ICLR 2025 Scaling Text-Rich Image Understanding via Code-Guided Synthetic Multimodal Data Generation ACL 2025 Synthetic Visual Genome CVPR 2025 NVILA: Efficient Frontier Visual Language Models CVPR 2025 Eval3D: Interpretable and Fine-grained Evaluation for 3D Generation CVPR 2025 Perception Tokens Enhance Visual Reasoning in Multimodal Language Models CVPR 2025 One Diffusion to Generate Them All CVPR 2025 RealEdit: Reddit Edits As a Large-scale Empirical Dataset for Image Transformations CVPR 2025 Coarse Correspondences Boost Spatial-Temporal Reasoning in Multimodal Language Model CVPR 2025 Semantic and Expressive Variations in Image Captions Across Languages CVPR 2025 Interleaved Scene Graphs for Interleaved Text-and-Image Generation Assessment ICLR 2025 One Trajectory, One Token: Grounded Video Tokenization via Panoptic Sub-object Trajectory ICCV 2025 Contrastive Flow Matching ICCV 2025 PathFinder: A Multi-Modal Multi-Agent System for Medical Diagnostic Decision-Making Applied to Histopathology ICCV 2025 Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models CVPR 2025 DreamSync: Aligning Text-to-Image Generation with Image Understanding Feedback NAACL 2025 SAM2Act: Integrating Visual Foundation Model with A Memory Architecture for Robotic Manipulation ICML 2025 ManiFlow: A General Robot Manipulation Policy via Consistency Flow Training CORL 2025 GraspMolmo: Generalizable Task-Oriented Grasping via Large-Scale Synthetic Data Generation CORL 2025 LATTE: Learning to Think with Vision Specialists EMNLP 2025 Wait, We Don’t Need to “Wait”! Removing Thinking Tokens Improves Reasoning Efficiency EMNLP 2025 Visual Program Distillation: Distilling Tools and Programmatic Reasoning into Vision-Language Models CVPR 2024 The Unmet Promise of Synthetic Training Images: Using Retrieved Real Images Performs Better NIPS 2024 NaturalBench: Evaluating Vision-Language Models on Natural Adversarial Samples NIPS 2024 Task Me Anything NIPS 2024 Multilingual Diversity Improves Vision-Language Representations NIPS 2024 Superposed Decoding: Multiple Generations from a Single Autoregressive Inference Pass NIPS 2024 ActionAtlas: A VideoQA Benchmark for Domain-specialized Action Recognition NIPS 2024 Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models NIPS 2024 I Can Tell What I am Doing: Toward Real-World Natural Language Grounding of Robot Experiences CORL 2024 RoboPoint: A Vision-Language Model for Spatial Affordance Prediction in Robotics CORL 2024 Manipulate-Anything: Automating Real-World Robots using Vision-Language Models CORL 2024 Found in the middle: Calibrating Positional Attention Bias Improves Long Context Utilization ACL 2024 Modeling Collaborator: Enabling Subjective Vision Classification With Minimal Human Effort via LLM Tool-Use CVPR 2024 Iterated Learning Improves Compositionality in Large Vision-Language Models CVPR 2024 Holodeck: Language Guided Generation of 3D Embodied AI Environments CVPR 2024 SPOC: Imitating Shortest Paths in Simulation Enables Effective Navigation and Manipulation in the Real World CVPR 2024 Quilt-LLaVA: Visual Instruction Tuning by Extracting Localized Narratives from Open-Source Histopathology Videos CVPR 2024 m&m’s: A Benchmark to Evaluate Tool-Use for multi-step multi-modal Tasks ECCV 2024 Videoshop: Localized Semantic Video Editing with Noise-Extrapolated Diffusion Inversion ECCV 2024 The Hard Positive Truth about Vision-Language Compositionality ECCV 2024 Efficient Inference of Vision Instruction-Following Models with Elastic Cache ECCV 2024 BLINK: Multimodal Large Language Models Can See but Not Perceive ECCV 2024 SPARO: Selective Attention for Robust and Compositional Transformer Encodings for Vision ECCV 2024 ImageInWords: Unlocking Hyper-Detailed Image Descriptions EMNLP 2024 Lookback Lens: Detecting and Mitigating Contextual Hallucinations in Large Language Models Using Only Attention Maps EMNLP 2024 Is C4 Dataset Optimal for Pruning? An Investigation of Calibration Data for LLM Pruning EMNLP 2024 Selective Visual Representations Improve Convergence and Generalization for Embodied AI ICLR 2024 Davidsonian Scene Graph: Improving Reliability in Fine-grained Evaluation for Text-to-Image Generation ICLR 2024 Offline Training of Language Model Agents with Functions as Learnable Weights ICML 2024 THE COLOSSEUM: A Benchmark for Evaluating Generalization for Robotic Manipulation RSS 2024 AR2-D2: Training a Robot Without a Robot CORL 2023 TIFA: Accurate and Interpretable Text-to-Image Faithfulness Evaluation with Question Answering ICCV 2023 CREPE: Can Vision-Language Foundation Models Reason Compositionally? CVPR 2023 OBJECT 3DIT: Language-guided 3D-aware Image Editing NIPS 2023 SugarCrepe: Fixing Hackable Benchmarks for Vision-Language Compositionality NIPS 2023 DataComp: In search of the next generation of multimodal datasets NIPS 2023 Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes ACL 2023 Quilt-1M: One Million Image-Text Pairs for Histopathology NIPS 2023 Cola: A Benchmark for Compositional Text-to-image Retrieval NIPS 2023 Large Language Model as Attributed Training Data Generator: A Tale of Diversity and Bias NIPS 2023 Agile Modeling: From Concept to Classifier in Minutes ICCV 2023 Measuring Compositional Consistency for Video Question Answering CVPR 2022 ELIGN: Expectation Alignment as a Multi-Agent Intrinsic Reward NIPS 2022 Mind Your Outliers! Investigating the Negative Impact of Outliers on Active Learning for Visual Question Answering IJCNLP 2021 Mind Your Outliers! Investigating the Negative Impact of Outliers on Active Learning for Visual Question Answering ACL 2021 AGQA: A Benchmark for Compositional Spatio-Temporal Reasoning CVPR 2021 Action Genome: Actions As Compositions of Spatio-Temporal Scene Graphs CVPR 2020 Determining Question-Answer Plausibility in Crowdsourced Datasets Using Multi-Task Learning EMNLP 2020 HYPE: A Benchmark for Human eYe Perceptual Evaluation of Generative Models NIPS 2019 Scene Graph Prediction With Limited Labels ICCV 2019 Information Maximizing Visual Question Generation CVPR 2019 Referring Relationships CVPR 2018 Dense-Captioning Events in Videos ICCV 2017 A Hierarchical Approach for Generating Descriptive Image Paragraphs CVPR 2017 Image Retrieval Using Scene Graphs CVPR 2015