Multimodal AI: Vision + Language Pipelines

Multimodal AI combines vision, language, and other modalities into systems that understand the world more like humans do. Instead of separate pipelines for text analysis and image processing, multimodal models process documents with embedded charts, interpret screenshots alongside user descriptions, and analyze video content with natural language queries. The enterprise applications are immediate: automated document processing, visual inspection, content moderation, accessibility, and intelligent search across media types.

This guide covers the practical engineering of multimodal pipelines: choosing models, designing architectures, handling document understanding, building visual Q&A systems, and deploying at production scale.

Multimodal Architecture Patterns

Pattern 1: Native Multimodal (Single Model)

Models like GPT-4o, Gemini, and Claude 3.5 natively accept multiple input types:

from openai import OpenAI
import base64

client = OpenAI()

def analyze_image_with_text(image_path: str, question: str) -> str:
    with open(image_path, "rb") as f:
        image_data = base64.b64encode(f.read()).decode()
    
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "user",
            "content": [
                {"type": "text", "text": question},
                {"type": "image_url", "image_url": {
                    "url": f"data:image/png;base64,{image_data}",
                    "detail": "high",  # high/low/auto
                }},
            ],
        }],
        max_tokens=1000,
    )
    
    return response.choices[0].message.content

# Example: Analyze a dashboard screenshot
result = analyze_image_with_text(
    "dashboard_q3.png",
    "What are the key trends in this quarterly dashboard? "
    "Flag any metrics that are below target."
)

Pattern 2: Pipeline Architecture (Specialized Models)

Chain specialized models for complex processing:

Input Document (PDF with tables, charts, text)
         ↓
┌────────────────────┐
│ Document Parser     │ → Extract text, tables, images
│ (PyMuPDF/Unstructured)
└────────────────────┘
         ↓
┌────────────────────┐  ┌────────────────────┐  ┌────────────────────┐
│ Text Chunks         │  │ Table Extraction    │  │ Chart Analysis     │
│ → Embedding + RAG   │  │ → Structured JSON   │  │ → Data + Insights  │
└────────────────────┘  └────────────────────┘  └────────────────────┘
         ↓                       ↓                       ↓
┌─────────────────────────────────────────────────────────────────────┐
│ Fusion Layer: Combine all modality outputs + user query → LLM      │
└─────────────────────────────────────────────────────────────────────┘
         ↓
    Final Response

from unstructured.partition.auto import partition

def process_multimodal_document(file_path: str):
    """Extract and categorize elements from any document."""
    elements = partition(filename=file_path)
    
    result = {
        "text_chunks": [],
        "tables": [],
        "images": [],
        "metadata": {"source": file_path, "element_count": len(elements)},
    }
    
    for element in elements:
        if element.category == "Table":
            result["tables"].append({
                "html": element.metadata.text_as_html,
                "text": str(element),
                "page": element.metadata.page_number,
            })
        elif element.category == "Image":
            result["images"].append({
                "path": element.metadata.image_path,
                "page": element.metadata.page_number,
                "caption": extract_caption(element),
            })
        else:
            result["text_chunks"].append({
                "text": str(element),
                "category": element.category,
                "page": element.metadata.page_number,
            })
    
    return result

Document Understanding

Intelligent Document Processing (IDP)

class DocumentProcessor:
    def __init__(self, vision_model="gpt-4o"):
        self.vision_model = vision_model
    
    def process_invoice(self, image_path: str) -> dict:
        """Extract structured data from invoice images."""
        prompt = """Extract all information from this invoice image.
Return valid JSON with:
{
    "vendor": {"name": "", "address": "", "tax_id": ""},
    "invoice_number": "",
    "date": "YYYY-MM-DD",
    "due_date": "YYYY-MM-DD",
    "line_items": [{"description": "", "quantity": 0, "unit_price": 0, "total": 0}],
    "subtotal": 0,
    "tax_rate": 0,
    "tax_amount": 0,
    "total": 0,
    "currency": "USD",
    "payment_terms": ""
}"""
        
        result = analyze_image_with_text(image_path, prompt)
        parsed = json.loads(result)
        
        # Validate extracted data
        validated = self.validate_invoice(parsed)
        return validated
    
    def classify_document(self, image_path: str) -> dict:
        """Classify document type from image."""
        prompt = """Classify this document. Return JSON:
{
    "type": "invoice|receipt|contract|report|form|letter|other",
    "confidence": 0.0-1.0,
    "language": "ISO 639-1 code",
    "has_signature": true/false,
    "has_tables": true/false,
    "page_count_estimate": 1
}"""
        
        return json.loads(analyze_image_with_text(image_path, prompt))

Visual Q&A Over Documents

def document_visual_qa(document_pages: list[str], question: str):
    """Answer questions about multi-page documents with visual content."""
    
    # Step 1: Identify relevant pages
    page_relevance = []
    for i, page_image in enumerate(document_pages):
        relevance = analyze_image_with_text(
            page_image,
            f"Rate 0-10 how relevant this page is to: '{question}'. Return just the number."
        )
        page_relevance.append((i, int(relevance.strip())))
    
    # Step 2: Process top-3 most relevant pages
    top_pages = sorted(page_relevance, key=lambda x: x[1], reverse=True)[:3]
    
    # Step 3: Combine relevant pages for final answer
    page_contents = []
    for page_idx, score in top_pages:
        content = analyze_image_with_text(
            document_pages[page_idx],
            "Extract all text, data, and visual information from this page."
        )
        page_contents.append(f"[Page {page_idx + 1}]\n{content}")
    
    # Step 4: Answer the question using extracted content
    final_prompt = f"""Based on the following document content, answer: {question}

Document Content:
{chr(10).join(page_contents)}

Answer:"""
    
    response = llm.generate(final_prompt)
    return {
        "answer": response,
        "source_pages": [p[0] + 1 for p in top_pages],
    }

Model Selection

Model	Vision	Language	Multi-Image	Best For	Cost
GPT-4o	Excellent	Excellent	Yes (up to 20)	General purpose, accuracy	$$
Gemini 2.0 Pro	Excellent	Excellent	Yes (up to 3600)	Long documents, video	$$
Gemini 2.0 Flash	Good	Good	Yes	Cost-sensitive, high volume	$
Claude 3.5 Sonnet	Very Good	Excellent	Yes (up to 20)	Detailed analysis, reasoning	$$
LLaVA (open source)	Good	Good	Limited	Self-hosted, privacy	Free
Florence 2	Good	Limited	No	Object detection, captioning	Free
PaLI-X	Good	Good	No	Multilingual document understanding	Research

Cost Comparison for Document Processing

Task	GPT-4o	Gemini Flash	Self-hosted LLaVA
Single page invoice extraction	$0.015	$0.002	$0.001 (GPU amortized)
10-page document Q&A	$0.15	$0.02	$0.01
1000 invoices/day	$450/mo	$60/mo	$200/mo (GPU lease)

Production Pipeline

class MultimodalPipeline:
    def __init__(self):
        self.preprocessor = DocumentPreprocessor()
        self.classifier = DocumentClassifier()
        self.extractors = {
            "invoice": InvoiceExtractor(),
            "receipt": ReceiptExtractor(),
            "contract": ContractExtractor(),
        }
        self.quality_checker = QualityChecker()
    
    async def process(self, file_path: str) -> dict:
        # Step 1: Preprocess
        pages = self.preprocessor.to_images(file_path, dpi=300)
        
        # Step 2: Classify
        doc_type = self.classifier.classify(pages[0])
        
        # Step 3: Extract with specialized extractor
        extractor = self.extractors.get(doc_type["type"])
        if not extractor:
            return {"status": "unsupported", "type": doc_type["type"]}
        
        extracted = await extractor.extract(pages)
        
        # Step 4: Quality check
        quality = self.quality_checker.validate(extracted, doc_type["type"])
        
        if quality["confidence"] < 0.8:
            return {
                "status": "needs_review",
                "extracted": extracted,
                "quality": quality,
            }
        
        return {
            "status": "success",
            "extracted": extracted,
            "quality": quality,
        }

Anti-Patterns

Anti-Pattern	Problem	Fix
Sending full-resolution images	High latency and cost for negligible quality gain	Resize to 1024px or use “low” detail mode for classification
No OCR fallback	Vision models miss small text or handwriting	Combine vision model with dedicated OCR (Tesseract, AWS Textract)
Single-model dependency	One model fails on specific document types	Ensemble approach: primary + fallback model
No quality validation	Extracted data has errors propagated to downstream systems	Validate all extracted fields against business rules
Processing full documents	Every page sent to vision model wastes tokens	Classify page relevance first, process only relevant pages
Ignoring image preprocessing	Skewed, low-contrast, or rotated images reduce accuracy	Deskew, enhance contrast, and normalize rotation before processing

Multimodal AI Checklist

:::note[Source] This guide is derived from operational intelligence at Garnet Grid Consulting. For multimodal AI consulting, visit garnetgrid.com. :::