Visual-CoT Demo

2. Interactive Demonstration

Procedure:

Upload an image
Enter a question about the image
The model will:
- Step 1: Detect region of interest (ROI) and output bounding box
- Step 2: Analyze the ROI and generate answer

Input Image

Question

Load Random Benchmark Example:

Select Benchmark

3. Results

3.1 Step 1: Region Detection

Detected Bounding Box Coordinates

3.2 Step 2: Answer Generation

Final Answer

3.3 Visualization

Image with Bounding Box Overlay

📋 Try These Example Questions

Click to load example questions (upload image for questions without images)

Input Image	Question

Pages:

Explore Visual-CoT Benchmark Examples

Load and browse real examples from the Visual-CoT benchmark datasets. Each example includes: image, question, ground-truth bounding box, and answer.

Select Benchmark Dataset

Choose from 9 visual reasoning benchmarks

Status

Image

Input Image

Annotations

Question

Ground Truth Bounding Box

Ground Truth Answer

Available Benchmark Datasets

GQA: Scene graph QA (72K balanced images)
- Path: lmms-lab/GQA
RefCOCO: Referring expression comprehension (8.8K validation)
- Path: lmms-lab/RefCOCO
RefCOCO+: RefCOCO with no location words (3.8K validation)
- Path: lmms-lab/RefCOCOplus
RefCOCOg: RefCOCO with longer expressions (7.5K validation)
- Path: lmms-lab/RefCOCOg
POPE: Object probing evaluation (9K test)
- Path: lmms-lab/POPE
ScienceQA: Science question answering (4.2K validation)
- Path: lmms-lab/ScienceQA
MM-GCoT: Multi-Modal Graph CoT (63.9K training)
- Path: AQUA6/MM-GCoT
VGR: Visual Grounding & Reasoning (90K training)
- Path: BytedanceDouyinContent/VGR

Total: 8 benchmarks from Visual Chain-of-Thought Reasoning Collection

Source: Hugging Face Collection

Paper Information

Title: Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought Reasoning

Authors: Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhuofan Zong, Letian Wang, Yu Liu, Hongsheng Li

Conference: NeurIPS 2024 (Spotlight) 🎉

Abstract: We introduce Visual-CoT, a comprehensive dataset and benchmark for evaluating chain-of-thought reasoning in multi-modal language models. Our dataset comprises 438K question-answer pairs with intermediate bounding box annotations highlighting key regions essential for answering questions. We propose a multi-turn processing pipeline that dynamically focuses on visual inputs and provides interpretable reasoning steps.

Model Architecture

Components

Vision Encoder: CLIP ViT-L/14
- Input resolution: 224px or 336px
- Output: 577 visual tokens (336px) or 196 tokens (224px)
- Feature dimension: 1024
Multi-modal Projector: 2-layer MLP with GELU
- Maps vision features (1024D) to LLM embedding space (4096D)
- Trainable parameters: ~8.4M
Language Model: Vicuna v1.5 (instruction-tuned LLaMA)
- Variants: 7B or 13B parameters
- Context length: 2048 tokens
- Base: LLaMA architecture

Multi-Turn Processing Pipeline

Image + Question
    ↓
[Turn 1] ROI Detection
    → Outputs: Bounding box coordinates [x1, y1, x2, y2]
    → Purpose: Identify key regions for reasoning
    ↓
[Turn 2] Question Answering
    → Input: Image + Question + Detected bbox
    → Output: Final answer grounded in visual evidence

Training Strategy

Stage 1: Feature Alignment (Pretrain)

Dataset: 558K LAION-CC-SBU subset with BLIP captions
Objective: Connect frozen CLIP encoder to frozen LLM
Trainable: Only the MLP projector (~8.4M params)
Duration: 3.5 hours (7B) to 5.5 hours (13B) on 8×A100 GPUs
Hyperparameters:
- Batch size: 256
- Learning rate: 1e-3
- Epochs: 1
- Max sequence length: 2048

Stage 2: Visual Instruction Tuning

Dataset Mix:
- 665K multimodal instruction-following (LLaVA-1.5)
- 1.4M positional annotation data (Shikra)
- 373K Visual-CoT data (ours)
- Total: ~2.4M training instances
Training Details:
- Duration: ~60 hours (7B-224) on 8×A100 GPUs
- Batch size: 128
- Learning rate: 2e-5 (backbone), 2e-6 (vision encoder)
- Epochs: 1
- DeepSpeed ZeRO-3 for memory efficiency

Dataset Construction

Visual-CoT Dataset (438K examples)

13 Diverse Benchmarks:

Document Understanding (4 datasets):
- DocVQA: Document visual QA
- InfographicsVQA: Infographic comprehension
- DUDE: Document understanding
- SROIE: Scanned receipt information extraction
Scene Understanding (3 datasets):
- GQA: Scene graph compositional reasoning
- Visual7W: Pointing and telling tasks
- VSR: Visual spatial reasoning
Text in Images (2 datasets):
- TextVQA: Reading text in natural images
- OCR-VQA: OCR-based question answering
General VQA (2 datasets):
- Visual Genome: Dense annotations
- COCO: Common objects in context
Specialized (2 datasets):
- CUB: Fine-grained bird classification
- Flickr30k: Image captioning & grounding

Annotation Details:

Each example includes: image, question, answer, bounding box
Bounding boxes highlight key regions essential for reasoning
98K examples have detailed reasoning steps
Train/val splits maintained from original benchmarks

Evaluation & Results

Visual-CoT Benchmark Metrics

Answer Accuracy: GPT-3.5-based evaluation
- Compares generated answer with ground truth
- Accounts for semantic equivalence
- Results: 82.7% average accuracy
Detection Accuracy: IoU-based bounding box evaluation
- IoU > 0.5 threshold for correct detection
- Results: 75.3% detection accuracy
- Validates spatial grounding ability
Reasoning Quality: Chain-of-thought coherence
- Multi-turn consistency
- Interpretability of intermediate steps

Model Comparison

Model	Resolution	Params	Answer Acc	Detection Acc
VisCoT-7B-224	224px	7B	80.1%	72.5%
VisCoT-7B-336	336px	7B	81.8%	74.2%
VisCoT-13B-224	224px	13B	81.5%	73.8%
VisCoT-13B-336	336px	13B	82.7%	75.3%

Trade-offs:

Higher resolution → Better detail recognition, slower inference
Larger model → Better reasoning, more memory
336px + 13B = Best quality but highest compute cost

Resources

Paper: arXiv:2403.16999
Code: GitHub
Dataset: Hugging Face
Project Page: https://hao-shao.com/projects/viscot.html
Models:

Citation

If you find our work useful, please cite:

@article{shao2024visual,
  title={Visual CoT: Unleashing Chain-of-Thought Reasoning in Multi-Modal Language Models},
  author={Shao, Hao and Qian, Shengju and Xiao, Han and Song, Guanglu and Zong, Zhuofan and Wang, Letian and Liu, Yu and Li, Hongsheng},
  journal={arXiv preprint arXiv:2403.16999},
  year={2024}
}

License

Code: Apache License 2.0
Dataset: Research use only
Models: Subject to base LLM license (LLaMA)

Acknowledgements

This work is built upon:

LLaVA - Base architecture
Shikra - Positional annotations
Vicuna - Language model
CLIP - Vision encoder

🌋 Visual-CoT: Chain-of-Thought Reasoning

1. Introduction to Visual-CoT

1.1 Dataset Statistics

Model Selection

Current Model Status