๐ Visual-CoT: Chain-of-Thought Reasoning
Advancing Multi-Modal Language Models with Visual Chain-of-Thought
๐ Paper (NeurIPS 2024 Spotlight) | ๐ป GitHub | ๐ค Dataset
1. Introduction to Visual-CoT
Visual Chain-of-Thought (VisCoT) is a multi-modal language model that enables:
- Region Identification: Detect key regions in images using bounding boxes
- Step-by-Step Reasoning: Apply Chain-of-Thought methodology for visual understanding
- Question Answering: Provide interpretable explanations for visual content
1.1 Dataset Statistics
- 438,000 question-answer pairs with bounding box annotations
- 13 diverse benchmarks (DocVQA, GQA, TextVQA, etc.)
- Based on LLaVA-1.5 architecture with CLIP ViT-L/14 vision encoder
Note: This Space uses Zero GPU which requires authentication. Please login or create a free account if you encounter quota errors.
Model Selection
Choose model variant (larger = better quality, slower)
Current Model Status
2. Interactive Demonstration
Procedure:
- Upload an image
- Enter a question about the image
- The model will:
- Step 1: Detect region of interest (ROI) and output bounding box
- Step 2: Analyze the ROI and generate answer
Load Random Benchmark Example:
3. Results
3.1 Step 1: Region Detection
3.2 Step 2: Answer Generation
3.3 Visualization
๐ Try These Example Questions
| Input Image | Question |
|---|
Explore Visual-CoT Benchmark Examples
Load and browse real examples from the Visual-CoT benchmark datasets. Each example includes: image, question, ground-truth bounding box, and answer.
Choose from 9 visual reasoning benchmarks
Image
Annotations
Available Benchmark Datasets
- GQA: Scene graph QA (72K balanced images)
- Path:
lmms-lab/GQA
- Path:
- RefCOCO: Referring expression comprehension (8.8K validation)
- Path:
lmms-lab/RefCOCO
- Path:
- RefCOCO+: RefCOCO with no location words (3.8K validation)
- Path:
lmms-lab/RefCOCOplus
- Path:
- RefCOCOg: RefCOCO with longer expressions (7.5K validation)
- Path:
lmms-lab/RefCOCOg
- Path:
- POPE: Object probing evaluation (9K test)
- Path:
lmms-lab/POPE
- Path:
- ScienceQA: Science question answering (4.2K validation)
- Path:
lmms-lab/ScienceQA
- Path:
- MM-GCoT: Multi-Modal Graph CoT (63.9K training)
- Path:
AQUA6/MM-GCoT
- Path:
- VGR: Visual Grounding & Reasoning (90K training)
- Path:
BytedanceDouyinContent/VGR
- Path:
Total: 8 benchmarks from Visual Chain-of-Thought Reasoning Collection
Source: Hugging Face Collection
Paper Information
Title: Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought Reasoning
Authors: Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhuofan Zong, Letian Wang, Yu Liu, Hongsheng Li
Conference: NeurIPS 2024 (Spotlight) ๐
Abstract: We introduce Visual-CoT, a comprehensive dataset and benchmark for evaluating chain-of-thought reasoning in multi-modal language models. Our dataset comprises 438K question-answer pairs with intermediate bounding box annotations highlighting key regions essential for answering questions. We propose a multi-turn processing pipeline that dynamically focuses on visual inputs and provides interpretable reasoning steps.
Model Architecture
Components
Vision Encoder: CLIP ViT-L/14
- Input resolution: 224px or 336px
- Output: 577 visual tokens (336px) or 196 tokens (224px)
- Feature dimension: 1024
Multi-modal Projector: 2-layer MLP with GELU
- Maps vision features (1024D) to LLM embedding space (4096D)
- Trainable parameters: ~8.4M
Language Model: Vicuna v1.5 (instruction-tuned LLaMA)
- Variants: 7B or 13B parameters
- Context length: 2048 tokens
- Base: LLaMA architecture
Multi-Turn Processing Pipeline
Image + Question
โ
[Turn 1] ROI Detection
โ Outputs: Bounding box coordinates [x1, y1, x2, y2]
โ Purpose: Identify key regions for reasoning
โ
[Turn 2] Question Answering
โ Input: Image + Question + Detected bbox
โ Output: Final answer grounded in visual evidence
Training Strategy
Stage 1: Feature Alignment (Pretrain)
- Dataset: 558K LAION-CC-SBU subset with BLIP captions
- Objective: Connect frozen CLIP encoder to frozen LLM
- Trainable: Only the MLP projector (~8.4M params)
- Duration: 3.5 hours (7B) to 5.5 hours (13B) on 8รA100 GPUs
- Hyperparameters:
- Batch size: 256
- Learning rate: 1e-3
- Epochs: 1
- Max sequence length: 2048
Stage 2: Visual Instruction Tuning
Dataset Mix:
- 665K multimodal instruction-following (LLaVA-1.5)
- 1.4M positional annotation data (Shikra)
- 373K Visual-CoT data (ours)
- Total: ~2.4M training instances
Training Details:
- Duration: ~60 hours (7B-224) on 8รA100 GPUs
- Batch size: 128
- Learning rate: 2e-5 (backbone), 2e-6 (vision encoder)
- Epochs: 1
- DeepSpeed ZeRO-3 for memory efficiency
Dataset Construction
Visual-CoT Dataset (438K examples)
13 Diverse Benchmarks:
Document Understanding (4 datasets):
- DocVQA: Document visual QA
- InfographicsVQA: Infographic comprehension
- DUDE: Document understanding
- SROIE: Scanned receipt information extraction
Scene Understanding (3 datasets):
- GQA: Scene graph compositional reasoning
- Visual7W: Pointing and telling tasks
- VSR: Visual spatial reasoning
Text in Images (2 datasets):
- TextVQA: Reading text in natural images
- OCR-VQA: OCR-based question answering
General VQA (2 datasets):
- Visual Genome: Dense annotations
- COCO: Common objects in context
Specialized (2 datasets):
- CUB: Fine-grained bird classification
- Flickr30k: Image captioning & grounding
Annotation Details:
- Each example includes: image, question, answer, bounding box
- Bounding boxes highlight key regions essential for reasoning
- 98K examples have detailed reasoning steps
- Train/val splits maintained from original benchmarks
Evaluation & Results
Visual-CoT Benchmark Metrics
Answer Accuracy: GPT-3.5-based evaluation
- Compares generated answer with ground truth
- Accounts for semantic equivalence
- Results: 82.7% average accuracy
Detection Accuracy: IoU-based bounding box evaluation
- IoU > 0.5 threshold for correct detection
- Results: 75.3% detection accuracy
- Validates spatial grounding ability
Reasoning Quality: Chain-of-thought coherence
- Multi-turn consistency
- Interpretability of intermediate steps
Model Comparison
| Model | Resolution | Params | Answer Acc | Detection Acc |
|---|---|---|---|---|
| VisCoT-7B-224 | 224px | 7B | 80.1% | 72.5% |
| VisCoT-7B-336 | 336px | 7B | 81.8% | 74.2% |
| VisCoT-13B-224 | 224px | 13B | 81.5% | 73.8% |
| VisCoT-13B-336 | 336px | 13B | 82.7% | 75.3% |
Trade-offs:
- Higher resolution โ Better detail recognition, slower inference
- Larger model โ Better reasoning, more memory
- 336px + 13B = Best quality but highest compute cost
Resources
- Paper: arXiv:2403.16999
- Code: GitHub
- Dataset: Hugging Face
- Project Page: https://hao-shao.com/projects/viscot.html
- Models:
Citation
If you find our work useful, please cite:
@article{shao2024visual,
title={Visual CoT: Unleashing Chain-of-Thought Reasoning in Multi-Modal Language Models},
author={Shao, Hao and Qian, Shengju and Xiao, Han and Song, Guanglu and Zong, Zhuofan and Wang, Letian and Liu, Yu and Li, Hongsheng},
journal={arXiv preprint arXiv:2403.16999},
year={2024}
}
License
- Code: Apache License 2.0
- Dataset: Research use only
- Models: Subject to base LLM license (LLaMA)
Acknowledgements
This work is built upon:
Powered by Zero GPU on Hugging Face Spaces