07Active Research

Multimodal Reasoning

Models that reason jointly over text, images, documents, and structured data.

Vision-LanguageDocument AIStructured DataGrounding

Overview

Multimodal reasoning extends language model capability to visual and structured inputs. We focus on document understanding, visual grounding, and cross-modal reasoning for enterprise workflows that involve charts, diagrams, forms, and mixed-media documents.

Research Directions

Document understanding

Layout-aware parsing of PDFs, invoices, contracts, and forms with field extraction and structure recovery.

Visual grounding

Localizing model claims to specific regions of images or document pages.

Chart and diagram reasoning

Extracting quantitative information from charts, schematics, and technical drawings.

Cross-modal retrieval

Unified embedding spaces for joint search across text and image corpora.

Table understanding

Reasoning over complex nested tables with merged cells, implicit row headers, and footnotes.

Why document AI is still hard

PDFs are not documents; they are drawing instructions. A sentence can be encoded as dozens of text objects with no semantic grouping. Columns, headers, and footnotes exist only as geometric relationships between text boxes. Recovering readable structure from raw PDF streams requires a combination of spatial clustering, reading order detection, and semantic role classification. We train layout models on 2M+ annotated enterprise documents across legal, financial, and technical domains.

Visual grounding for auditability

When a multimodal model answers a question about a document, enterprise users need to verify the answer against the source. We implement pixel-level grounding: every extracted field is associated with a bounding box in the source document. Answers are rejected if the grounding confidence falls below a calibrated threshold, replacing the output with a human review flag rather than a hallucinated response.

Chart and diagram reasoning

Charts encode quantitative information in visual form that standard OCR cannot recover. We train a two-stage pipeline: a detection model localizes chart elements (axes, legends, bars, lines), and a reasoning model interprets the detected structure to answer quantitative questions. On ChartQA, our pipeline achieves 84.2% accuracy, improving over baseline VLM prompting by 18 percentage points.

PreviousVoice Agents Next Model Alignment

Why document AI is still hard

Visual grounding for auditability

Chart and diagram reasoning