๐ Visual-CoT: Chain-of-Thought Reasoning
Advancing Multi-Modal Language Models with Visual Chain-of-Thought
๐ Paper (NeurIPS 2024 Spotlight) | ๐ป GitHub | ๐ค Dataset
1. Introduction to Visual-CoT
Visual Chain-of-Thought (VisCoT) is a multi-modal language model that enables:
- Region Identification: Detect key regions in images using bounding boxes
 - Step-by-Step Reasoning: Apply Chain-of-Thought methodology for visual understanding
 - Question Answering: Provide interpretable explanations for visual content
 
1.1 Dataset Statistics
- 438,000 question-answer pairs with bounding box annotations
 - 13 diverse benchmarks (DocVQA, GQA, TextVQA, etc.)
 - Based on LLaVA-1.5 architecture with CLIP ViT-L/14 vision encoder
 
Note: This Space uses Zero GPU which requires authentication. Please login or create a free account if you encounter quota errors.
Model Selection
Choose model variant (larger = better quality, slower)
Current Model Status
2. Interactive Demonstration
Procedure:
- Upload an image
 - Enter a question about the image
 - The model will:
- Step 1: Detect region of interest (ROI) and output bounding box
 - Step 2: Analyze the ROI and generate answer
 
 
Load Random Benchmark Example:
3. Results
3.1 Step 1: Region Detection
3.2 Step 2: Answer Generation
3.3 Visualization
๐ Try These Example Questions
| Input Image | Question | 
|---|
Explore Visual-CoT Benchmark Examples
Load and browse real examples from the Visual-CoT benchmark datasets. Each example includes: image, question, ground-truth bounding box, and answer.
Choose from 9 visual reasoning benchmarks
Image
Annotations
Available Benchmark Datasets
- GQA: Scene graph QA (72K balanced images)
- Path: 
lmms-lab/GQA 
 - Path: 
 - RefCOCO: Referring expression comprehension (8.8K validation)
- Path: 
lmms-lab/RefCOCO 
 - Path: 
 - RefCOCO+: RefCOCO with no location words (3.8K validation)
- Path: 
lmms-lab/RefCOCOplus 
 - Path: 
 - RefCOCOg: RefCOCO with longer expressions (7.5K validation)
- Path: 
lmms-lab/RefCOCOg 
 - Path: 
 - POPE: Object probing evaluation (9K test)
- Path: 
lmms-lab/POPE 
 - Path: 
 - ScienceQA: Science question answering (4.2K validation)
- Path: 
lmms-lab/ScienceQA 
 - Path: 
 - MM-GCoT: Multi-Modal Graph CoT (63.9K training)
- Path: 
AQUA6/MM-GCoT 
 - Path: 
 - VGR: Visual Grounding & Reasoning (90K training)
- Path: 
BytedanceDouyinContent/VGR 
 - Path: 
 
Total: 8 benchmarks from Visual Chain-of-Thought Reasoning Collection
Source: Hugging Face Collection
Paper Information
Title: Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought Reasoning
Authors: Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhuofan Zong, Letian Wang, Yu Liu, Hongsheng Li
Conference: NeurIPS 2024 (Spotlight) ๐
Abstract: We introduce Visual-CoT, a comprehensive dataset and benchmark for evaluating chain-of-thought reasoning in multi-modal language models. Our dataset comprises 438K question-answer pairs with intermediate bounding box annotations highlighting key regions essential for answering questions. We propose a multi-turn processing pipeline that dynamically focuses on visual inputs and provides interpretable reasoning steps.
Model Architecture
Components
Vision Encoder: CLIP ViT-L/14
- Input resolution: 224px or 336px
 - Output: 577 visual tokens (336px) or 196 tokens (224px)
 - Feature dimension: 1024
 
Multi-modal Projector: 2-layer MLP with GELU
- Maps vision features (1024D) to LLM embedding space (4096D)
 - Trainable parameters: ~8.4M
 
Language Model: Vicuna v1.5 (instruction-tuned LLaMA)
- Variants: 7B or 13B parameters
 - Context length: 2048 tokens
 - Base: LLaMA architecture
 
Multi-Turn Processing Pipeline
Image + Question
    โ
[Turn 1] ROI Detection
    โ Outputs: Bounding box coordinates [x1, y1, x2, y2]
    โ Purpose: Identify key regions for reasoning
    โ
[Turn 2] Question Answering
    โ Input: Image + Question + Detected bbox
    โ Output: Final answer grounded in visual evidence
Training Strategy
Stage 1: Feature Alignment (Pretrain)
- Dataset: 558K LAION-CC-SBU subset with BLIP captions
 - Objective: Connect frozen CLIP encoder to frozen LLM
 - Trainable: Only the MLP projector (~8.4M params)
 - Duration: 3.5 hours (7B) to 5.5 hours (13B) on 8รA100 GPUs
 - Hyperparameters:
- Batch size: 256
 - Learning rate: 1e-3
 - Epochs: 1
 - Max sequence length: 2048
 
 
Stage 2: Visual Instruction Tuning
Dataset Mix:
- 665K multimodal instruction-following (LLaVA-1.5)
 - 1.4M positional annotation data (Shikra)
 - 373K Visual-CoT data (ours)
 - Total: ~2.4M training instances
 
Training Details:
- Duration: ~60 hours (7B-224) on 8รA100 GPUs
 - Batch size: 128
 - Learning rate: 2e-5 (backbone), 2e-6 (vision encoder)
 - Epochs: 1
 - DeepSpeed ZeRO-3 for memory efficiency
 
Dataset Construction
Visual-CoT Dataset (438K examples)
13 Diverse Benchmarks:
Document Understanding (4 datasets):
- DocVQA: Document visual QA
 - InfographicsVQA: Infographic comprehension
 - DUDE: Document understanding
 - SROIE: Scanned receipt information extraction
 
Scene Understanding (3 datasets):
- GQA: Scene graph compositional reasoning
 - Visual7W: Pointing and telling tasks
 - VSR: Visual spatial reasoning
 
Text in Images (2 datasets):
- TextVQA: Reading text in natural images
 - OCR-VQA: OCR-based question answering
 
General VQA (2 datasets):
- Visual Genome: Dense annotations
 - COCO: Common objects in context
 
Specialized (2 datasets):
- CUB: Fine-grained bird classification
 - Flickr30k: Image captioning & grounding
 
Annotation Details:
- Each example includes: image, question, answer, bounding box
 - Bounding boxes highlight key regions essential for reasoning
 - 98K examples have detailed reasoning steps
 - Train/val splits maintained from original benchmarks
 
Evaluation & Results
Visual-CoT Benchmark Metrics
Answer Accuracy: GPT-3.5-based evaluation
- Compares generated answer with ground truth
 - Accounts for semantic equivalence
 - Results: 82.7% average accuracy
 
Detection Accuracy: IoU-based bounding box evaluation
- IoU > 0.5 threshold for correct detection
 - Results: 75.3% detection accuracy
 - Validates spatial grounding ability
 
Reasoning Quality: Chain-of-thought coherence
- Multi-turn consistency
 - Interpretability of intermediate steps
 
Model Comparison
| Model | Resolution | Params | Answer Acc | Detection Acc | 
|---|---|---|---|---|
| VisCoT-7B-224 | 224px | 7B | 80.1% | 72.5% | 
| VisCoT-7B-336 | 336px | 7B | 81.8% | 74.2% | 
| VisCoT-13B-224 | 224px | 13B | 81.5% | 73.8% | 
| VisCoT-13B-336 | 336px | 13B | 82.7% | 75.3% | 
Trade-offs:
- Higher resolution โ Better detail recognition, slower inference
 - Larger model โ Better reasoning, more memory
 - 336px + 13B = Best quality but highest compute cost
 
Resources
- Paper: arXiv:2403.16999
 - Code: GitHub
 - Dataset: Hugging Face
 - Project Page: https://hao-shao.com/projects/viscot.html
 - Models:
 
Citation
If you find our work useful, please cite:
@article{shao2024visual,
  title={Visual CoT: Unleashing Chain-of-Thought Reasoning in Multi-Modal Language Models},
  author={Shao, Hao and Qian, Shengju and Xiao, Han and Song, Guanglu and Zong, Zhuofan and Wang, Letian and Liu, Yu and Li, Hongsheng},
  journal={arXiv preprint arXiv:2403.16999},
  year={2024}
}
License
- Code: Apache License 2.0
 - Dataset: Research use only
 - Models: Subject to base LLM license (LLaMA)
 
Acknowledgements
This work is built upon:
Powered by Zero GPU on Hugging Face Spaces