Models that reason jointly over text, images, documents, and structured data.
Overview
Multimodal reasoning extends language model capability to visual and structured inputs. We focus on document understanding, visual grounding, and cross-modal reasoning for enterprise workflows that involve charts, diagrams, forms, and mixed-media documents.
Research Directions
Document understanding
Layout-aware parsing of PDFs, invoices, contracts, and forms with field extraction and structure recovery.
Visual grounding
Localizing model claims to specific regions of images or document pages.
Chart and diagram reasoning
Extracting quantitative information from charts, schematics, and technical drawings.
Cross-modal retrieval
Unified embedding spaces for joint search across text and image corpora.
Table understanding
Reasoning over complex nested tables with merged cells, implicit row headers, and footnotes.
PDFs are not documents; they are drawing instructions. A sentence can be encoded as dozens of text objects with no semantic grouping. Columns, headers, and footnotes exist only as geometric relationships between text boxes. Recovering readable structure from raw PDF streams requires a combination of spatial clustering, reading order detection, and semantic role classification. We train layout models on 2M+ annotated enterprise documents across legal, financial, and technical domains.
When a multimodal model answers a question about a document, enterprise users need to verify the answer against the source. We implement pixel-level grounding: every extracted field is associated with a bounding box in the source document. Answers are rejected if the grounding confidence falls below a calibrated threshold, replacing the output with a human review flag rather than a hallucinated response.
Charts encode quantitative information in visual form that standard OCR cannot recover. We train a two-stage pipeline: a detection model localizes chart elements (axes, legends, bars, lines), and a reasoning model interprets the detected structure to answer quantitative questions. On ChartQA, our pipeline achieves 84.2% accuracy, improving over baseline VLM prompting by 18 percentage points.