๐ŸŒ‹ Visual-CoT: Chain-of-Thought Reasoning

Advancing Multi-Modal Language Models with Visual Chain-of-Thought

๐Ÿ“„ Paper (NeurIPS 2024 Spotlight) | ๐Ÿ’ป GitHub | ๐Ÿค— Dataset

1. Introduction to Visual-CoT

Visual Chain-of-Thought (VisCoT) is a multi-modal language model that enables:

  1. Region Identification: Detect key regions in images using bounding boxes
  2. Step-by-Step Reasoning: Apply Chain-of-Thought methodology for visual understanding
  3. Question Answering: Provide interpretable explanations for visual content

1.1 Dataset Statistics

  • 438,000 question-answer pairs with bounding box annotations
  • 13 diverse benchmarks (DocVQA, GQA, TextVQA, etc.)
  • Based on LLaVA-1.5 architecture with CLIP ViT-L/14 vision encoder

Note: This Space uses Zero GPU which requires authentication. Please login or create a free account if you encounter quota errors.

Model Selection

Select Model

Choose model variant (larger = better quality, slower)

Current Model Status

2. Interactive Demonstration

Procedure:

  1. Upload an image
  2. Enter a question about the image
  3. The model will:
    • Step 1: Detect region of interest (ROI) and output bounding box
    • Step 2: Analyze the ROI and generate answer
0 1
128 1024

Load Random Benchmark Example:

Select Benchmark

3. Results

3.1 Step 1: Region Detection

3.2 Step 2: Answer Generation

3.3 Visualization

๐Ÿ“‹ Try These Example Questions

Click to load example questions (upload image for questions without images)
Input Image Question
Pages:

Powered by Zero GPU on Hugging Face Spaces