Ranjay Krishna
77 papers · 2015–2026 · 13 conferences · across top CS/AI conferences
Achievements
Jump to papers ↓+16 more ↓ Show less ↑
π Academic Marathon (10) π Conference Polyglot (12) π§ Keyword Pioneer π Interdisciplinary Bridge π£ Hot Topic Early Bird
π
Interdisciplinary Bridge
π
Academic Marathon
(10)
π
Renaissance Researcher
(11)
π
Conference Loyalist
(23)
π
Triple Crown
π€
Dynamic Duo
(11)
π₯
Mega-Team
(50)
π
Keyword Champion
(4)
π¬
Deep Specialist
(29)
π
Grand Slam
ποΈ
Keyword Collector
(279)
β
The Questioner
(2)
β‘
Prolific Year
(11)
π
Century Club
(76)
π₯
Unstoppable
(9)
π
Trend Setter
Conferences
CVPR (23)
NIPS (15)
ICCV (7)
CORL (6)
ECCV (6)
EMNLP (6)
ACL (4)
ICLR (4)
ICML (2)
AAAI (1)
IJCNLP (1)
NAACL (1)
RSS (1)
Top co-authors
Keywords
vision-language model
(14)
large language model
(9)
compositional reasoning
(7)
multimodal learning
(7)
visual question answering
(7)
visual reasoning
(6)
benchmark evaluation
(5)
image captioning
(4)
multimodal language model
(4)
text-to-image generation
(4)
contrastive learning
(4)
language model
(4)
zero-shot learning
(3)
image understanding
(3)
image generation
(3)
knowledge distillation
(3)
image classification
(3)
question answering
(3)
active learning
(3)
video understanding
(3)
Papers
Towards Acyclic Preference Evaluation of Language Models via Multiple Evaluators
AAAI 2026
AHA: A Vision-Language-Model for Detecting and Reasoning Over Failures in Robotic Manipulation
ICLR 2025
Scaling Text-Rich Image Understanding via Code-Guided Synthetic Multimodal Data Generation
ACL 2025
Synthetic Visual Genome
CVPR 2025
NVILA: Efficient Frontier Visual Language Models
CVPR 2025
Eval3D: Interpretable and Fine-grained Evaluation for 3D Generation
CVPR 2025
Perception Tokens Enhance Visual Reasoning in Multimodal Language Models
CVPR 2025
One Diffusion to Generate Them All
CVPR 2025
RealEdit: Reddit Edits As a Large-scale Empirical Dataset for Image Transformations
CVPR 2025
Coarse Correspondences Boost Spatial-Temporal Reasoning in Multimodal Language Model
CVPR 2025
Semantic and Expressive Variations in Image Captions Across Languages
CVPR 2025
Interleaved Scene Graphs for Interleaved Text-and-Image Generation Assessment
ICLR 2025
One Trajectory, One Token: Grounded Video Tokenization via Panoptic Sub-object Trajectory
ICCV 2025
Contrastive Flow Matching
ICCV 2025
PathFinder: A Multi-Modal Multi-Agent System for Medical Diagnostic Decision-Making Applied to Histopathology
ICCV 2025
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models
CVPR 2025
DreamSync: Aligning Text-to-Image Generation with Image Understanding Feedback
NAACL 2025
SAM2Act: Integrating Visual Foundation Model with A Memory Architecture for Robotic Manipulation
ICML 2025
ManiFlow: A General Robot Manipulation Policy via Consistency Flow Training
CORL 2025
GraspMolmo: Generalizable Task-Oriented Grasping via Large-Scale Synthetic Data Generation
CORL 2025
LATTE: Learning to Think with Vision Specialists
EMNLP 2025
Wait, We Donβt Need to βWaitβ! Removing Thinking Tokens Improves Reasoning Efficiency
EMNLP 2025
Visual Program Distillation: Distilling Tools and Programmatic Reasoning into Vision-Language Models
CVPR 2024
The Unmet Promise of Synthetic Training Images: Using Retrieved Real Images Performs Better
NIPS 2024
NaturalBench: Evaluating Vision-Language Models on Natural Adversarial Samples
NIPS 2024
Task Me Anything
NIPS 2024
Multilingual Diversity Improves Vision-Language Representations
NIPS 2024
Superposed Decoding: Multiple Generations from a Single Autoregressive Inference Pass
NIPS 2024
ActionAtlas: A VideoQA Benchmark for Domain-specialized Action Recognition
NIPS 2024
Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models
NIPS 2024
I Can Tell What I am Doing: Toward Real-World Natural Language Grounding of Robot Experiences
CORL 2024
RoboPoint: A Vision-Language Model for Spatial Affordance Prediction in Robotics
CORL 2024
Manipulate-Anything: Automating Real-World Robots using Vision-Language Models
CORL 2024
Found in the middle: Calibrating Positional Attention Bias Improves Long Context Utilization
ACL 2024
Modeling Collaborator: Enabling Subjective Vision Classification With Minimal Human Effort via LLM Tool-Use
CVPR 2024
Iterated Learning Improves Compositionality in Large Vision-Language Models
CVPR 2024
Holodeck: Language Guided Generation of 3D Embodied AI Environments
CVPR 2024
SPOC: Imitating Shortest Paths in Simulation Enables Effective Navigation and Manipulation in the Real World
CVPR 2024
Quilt-LLaVA: Visual Instruction Tuning by Extracting Localized Narratives from Open-Source Histopathology Videos
CVPR 2024
m&mβs: A Benchmark to Evaluate Tool-Use for multi-step multi-modal Tasks
ECCV 2024
Videoshop: Localized Semantic Video Editing with Noise-Extrapolated Diffusion Inversion
ECCV 2024
The Hard Positive Truth about Vision-Language Compositionality
ECCV 2024
Efficient Inference of Vision Instruction-Following Models with Elastic Cache
ECCV 2024
BLINK: Multimodal Large Language Models Can See but Not Perceive
ECCV 2024
SPARO: Selective Attention for Robust and Compositional Transformer Encodings for Vision
ECCV 2024
ImageInWords: Unlocking Hyper-Detailed Image Descriptions
EMNLP 2024
Lookback Lens: Detecting and Mitigating Contextual Hallucinations in Large Language Models Using Only Attention Maps
EMNLP 2024
Is C4 Dataset Optimal for Pruning? An Investigation of Calibration Data for LLM Pruning
EMNLP 2024
Selective Visual Representations Improve Convergence and Generalization for Embodied AI
ICLR 2024
Davidsonian Scene Graph: Improving Reliability in Fine-grained Evaluation for Text-to-Image Generation
ICLR 2024
Offline Training of Language Model Agents with Functions as Learnable Weights
ICML 2024
THE COLOSSEUM: A Benchmark for Evaluating Generalization for Robotic Manipulation
RSS 2024
AR2-D2: Training a Robot Without a Robot
CORL 2023
TIFA: Accurate and Interpretable Text-to-Image Faithfulness Evaluation with Question Answering
ICCV 2023
CREPE: Can Vision-Language Foundation Models Reason Compositionally?
CVPR 2023
OBJECT 3DIT: Language-guided 3D-aware Image Editing
NIPS 2023
SugarCrepe: Fixing Hackable Benchmarks for Vision-Language Compositionality
NIPS 2023
DataComp: In search of the next generation of multimodal datasets
NIPS 2023
Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes
ACL 2023
Quilt-1M: One Million Image-Text Pairs for Histopathology
NIPS 2023
Cola: A Benchmark for Compositional Text-to-image Retrieval
NIPS 2023
Large Language Model as Attributed Training Data Generator: A Tale of Diversity and Bias
NIPS 2023
Agile Modeling: From Concept to Classifier in Minutes
ICCV 2023
Measuring Compositional Consistency for Video Question Answering
CVPR 2022
ELIGN: Expectation Alignment as a Multi-Agent Intrinsic Reward
NIPS 2022
Mind Your Outliers! Investigating the Negative Impact of Outliers on Active Learning for Visual Question Answering
IJCNLP 2021
Mind Your Outliers! Investigating the Negative Impact of Outliers on Active Learning for Visual Question Answering
ACL 2021
AGQA: A Benchmark for Compositional Spatio-Temporal Reasoning
CVPR 2021
Action Genome: Actions As Compositions of Spatio-Temporal Scene Graphs
CVPR 2020
Determining Question-Answer Plausibility in Crowdsourced Datasets Using Multi-Task Learning
EMNLP 2020
HYPE: A Benchmark for Human eYe Perceptual Evaluation of Generative Models
NIPS 2019
Scene Graph Prediction With Limited Labels
ICCV 2019
Information Maximizing Visual Question Generation
CVPR 2019
Referring Relationships
CVPR 2018
Dense-Captioning Events in Videos
ICCV 2017
A Hierarchical Approach for Generating Descriptive Image Paragraphs
CVPR 2017
Image Retrieval Using Scene Graphs
CVPR 2015