Skip to main content
Enterprise AI Analysis: The Structured Output Benchmark: A Multi-Source Benchmark for Evaluating Structured Output Quality in Large Language Models

Enterprise AI Analysis

The Structured Output Benchmark: A Multi-Source Benchmark for Evaluating Structured Output Quality in Large Language Models

This analysis summarizes key insights from the paper, highlighting the crucial distinction between schema compliance and actual value accuracy in structured output generation by Large Language Models across text, image, and audio modalities.

Executive Impact

This paper introduces SOB, a multi-source benchmark for evaluating structured output quality in LLMs. It covers text, images, and audio, and uses a text-normalized representation to isolate structured-output capability from raw vision/speech processing. SOB comprises 5,000 text, 209 image, and 115 audio records, each with a question, JSON schema, and ground-truth answer. Evaluation of 21 models shows high schema compliance but low value accuracy (83.0% text, 67.2% images, 23.7% audio), especially with longer contexts. The benchmark, pipeline, and code are released.

0 Text Value Accuracy (Best)
0 Image Value Accuracy (Best)
0 Audio Value Accuracy (Best)
0 JSON Pass Rate (Median)

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Introduction
Related Work
Methodology
Evaluation
Results and Discussion
Limitations and Future Work
Conclusion

Most benchmarks for LLMs focus on reasoning or code generation, with few addressing structured output beyond schema compliance or single source modalities. In practice, LLMs are widely used for extracting structured records from various content types (invoices, medical records, etc.). This task, termed 'structured output,' requires returning JSON that conforms to a target schema and contains values faithfully grounded in the context. Current benchmarks fall short by not simultaneously evaluating multi-source extraction, schema-specified JSON generation, and per-field exact value grounding at scale. SOB aims to fill this gap.

Existing benchmarks for structured output either focus on schema compliance (JSONSchemaBench, StructEval, DeepJSONEval) or value correctness within a single domain (ExtractBench, LLMStructBench). Constrained decoding methods improve syntactic validity but can degrade semantic accuracy. General multi-modal benchmarks (MMBench, MMMU) address broad reasoning, while domain-specific ones (DocVQA, ChartQA, AudioBench) focus on QA but don't require schema-compliant JSON output with grounded per-field value accuracy. SOB unifies these dimensions, evaluating multi-source extraction (text, images, audio) with value-level accuracy and cross-source comparison.

SOB processes three open-source datasets: HotpotQA (text), olmOCR-bench (images), and AMI Meeting Corpus (audio). For each record, a human-authored JSON schema and ground-truth structured output are verified by an LLM reviewer. All models receive text-normalized context, ensuring structured-output capability is isolated from raw vision/speech processing quality. Schemas are categorized as medium (depth 2) or hard (depth ≥3), with most being hard. The evaluation pipeline checks for parse validity, schema compliance, and then compares path-flattened leaf nodes against ground truth using seven metrics. A key example illustrates multi-hop reasoning and nested structure extraction.

Enterprise Process Flow

Source Record (Context, Question, JSON Schema)
Candidate LLM JSON Response
Parse & Schema Validation
Path Flattening
Field Comparison vs. Ground Truth

SOB employs seven metrics: JSON Pass Rate, Value Accuracy (primary, exact leaf-value match), Faithfulness (token F1, partial credit), Path Recall (structural completeness), Structure Coverage (structural precision/recall), Type Safety (JSON type correctness), and Perfect Response (exact full-object match). Semantic metrics are 'hardened' by a factor that gates on structural correctness and a coverage gate. These metrics are grouped into categories like Long Context Extraction, Complex Schema Handling, and Output Contract Reliability. Aggregation uses schema-complexity-weighted means, as complex schemas are more representative of production tasks. Inference is greedy (temperature 0.0), with reasoning disabled where possible to isolate extraction capability.

Evaluation of 21 models reveals consistent patterns: high JSON Pass Rate (near-perfect schema compliance) but significantly lower Value Accuracy, indicating that producing schema-valid JSON is easier than extracting correct grounded values. The gap between these metrics is 15-25 percentage points. Value Accuracy drops sharply across modalities: 83.0% for text, 67.2% for images, and 23.7% for audio. Model rankings shift across modalities, and model size doesn't predict structured output quality. Structured hallucinations are hard to detect because they appear structurally correct. The benchmark shows that schema constraints have a smaller effect on grounded extraction than on structural reliability.

15-25% Average gap between JSON Pass Rate and Value Accuracy across models.

JSON Pass Rate vs. Value Accuracy Across Models

Model JSON Pass Rate (%) Value Accuracy (%)
GLM-4.7 97.2 83.0
Qwen3.5-35B 97.4 82.8
GPT-5.4 99.9 82.5
Gemini-2.5-Flash 98.3 82.2
Interfaze-Beta 97.5 82.1

SOB's ground truth is human-authored with LLM cross-checks, reducing LLM-as-generator bias, but schemas reflect team conventions. Future versions will incorporate production-standard schemas. The evaluation isolates structured extraction from raw vision/ASR quality by using text-normalized contexts, meaning audio results are an upper bound (not including ASR error). Metrics are strict (exact-match, ordered arrays), favoring precision. Future work includes semantic-aware comparison, order-sensitive flags, sensitivity analysis, constrained decoding baselines, additional modalities (video, code), and a live leaderboard.

SOB addresses the critical need for evaluating structured output quality in LLMs by focusing on the correctness of values within JSON, not just schema compliance. It highlights a significant gap: models achieve high schema compliance but struggle with value accuracy, especially across diverse modalities like text, images, and audio. The benchmark demonstrates that schema compliance alone is an insufficient measure of structured output quality. By releasing the dataset, evaluation pipeline, and code, SOB aims to enable the community to build and measure what truly matters for production structured output: getting the values right.

Advanced ROI Calculator

Estimate the potential return on investment for integrating advanced AI into your enterprise operations.

Estimated Annual Savings $0
Annual Hours Reclaimed 0

AI Implementation Roadmap

A typical timeline for integrating and scaling AI solutions within an enterprise environment.

Phase 1: Discovery & Strategy (2-4 Weeks)

Initial consultations, current state assessment, identification of high-impact use cases, and strategic planning for AI integration.

Phase 2: Pilot Program (6-10 Weeks)

Development and deployment of a small-scale AI pilot in a controlled environment to validate concepts and gather initial performance data.

Phase 3: Iteration & Refinement (4-8 Weeks)

Based on pilot results, iterative improvements to the AI models and integration processes, focusing on accuracy and efficiency.

Phase 4: Scaled Deployment (8-16 Weeks)

Full-scale deployment across relevant departments, comprehensive training for end-users, and establishment of monitoring systems.

Phase 5: Continuous Optimization (Ongoing)

Regular performance reviews, model updates, and exploration of new AI capabilities to maintain competitive advantage.

Ready to Transform Your Enterprise with AI?

Connect with our experts to explore how tailored AI solutions can drive efficiency and innovation in your organization.

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs


AI Consultation Booking