50 examples
A broader 50-question subset adds more schemas, joins, grouping cases, and broad answer sets.
Stress-test sequel
The same representation comparison under a harder 50-example setup.
Mark 2 follows the main Balanced-30 explainer and asks what changes when the project is pushed further. It keeps the same preselected relevant tables (oracle tables), so retrieval is still intentionally out of scope, then adds broader questions, SQL repair, and a stronger Direct Table QA prompt.
What changed
The sequel is not a replacement for Balanced-30. It is a wider stress test that checks whether the same representation tradeoffs still hold when the setup becomes more demanding.
A broader 50-question subset adds more schemas, joins, grouping cases, and broad answer sets.
Runs still receive preselected relevant tables, so the page tests answer generation rather than retrieval.
Text-to-SQL gets error-feedback repair, making this a stronger sequel than the no-repair main run.
Direct Table QA repeats the question and adds answer-checking pressure before returning rows.
Bottom line
Mark 2 shows that the winner depends on the cost and failure mode you care about, even when the answer-row metrics remain directly comparable.
87.0% cell recall with 2 failed records on the 50-example matrix.
86.0% cell recall, but 6,173 tokens per record and 1 context-limit error.
The cheapest row uses 374 tokens per record while staying executable, inspectable, and easy to debug.
Mark 2 results
These are answer-row metrics over the 50-example Mark 2 subset. Cell precision and recall compare returned cells with expected answer cells; row count fit and order accuracy catch shape and ordering issues.
| Model | Path | Cell precision | Cell recall | Row count fit | Order accuracy | Failed | Context errors | Tokens/record |
|---|---|---|---|---|---|---|---|---|
| gpt-4o-mini | Direct Table QA | 0.594 | 0.697 | 0.739 | 0.940 | 2 | 0 | 4,236 |
| gpt-4o-mini | Text-to-SQL | 0.672 | 0.710 | 0.762 | 0.980 | 7 | 0 | 395 |
| gpt-5-mini | Direct Table QA | 0.800 | 0.860 | 0.846 | 0.967 | 1 | 1 | 6,173 |
| gpt-5-mini | Text-to-SQL | 0.707 | 0.760 | 0.789 | 0.980 | 6 | 0 | 975 |
| gpt-5.2 | Direct Table QA | 0.805 | 0.856 | 0.892 | 0.960 | 1 | 1 | 4,575 |
| gpt-5.2 | Text-to-SQL | 0.811 | 0.870 | 0.848 | 0.980 | 2 | 0 | 374 |
Stress signals
The stress summary compares the Mark 2 headline run with the earlier checkpoint. It is useful context, not a new headline benchmark claim.
Recall +3.17 pp; failure rate +2.00 pp; tokens per record +917.
Direct Table QA remains competitive but carries much higher token cost and one Mark 2 context failure.
Recall +9.73 pp; failure rate -14.18 pp; tokens per record +84.
Text-to-SQL remains much cheaper per record; the Mark 2 setup adds SQL repair and extra diagnostics.
report/tables/m2_evidence_prompt_comparison.md The repeated-question answer-checking prompt improved Direct Table QA but did not improve Text-to-SQL.
report/tables/m2_evidence_variance_summary.md Repeated gpt-5.2 runs show modest metric variance, so small deltas should be interpreted cautiously.
report/tables/m2_evidence_context_stress.md Direct Table QA is more exposed to context pressure because it serializes table content into the prompt.
report/tables/m2_11_local_diagnostic.md Local-provider failures document environment behavior; they are not used for the public model comparison.
Pattern-level evidence
Per-pattern metrics show where each representation path bends: broad grouped outputs, ordering cases, and multi-condition selections do not create the same pressure.
| Path | Pattern | Records | Cell precision | Cell recall | Row count fit | Order accuracy |
|---|---|---|---|---|---|---|
| Direct Table QA | Aggregation | 3 | 1.000 | 1.000 | 1.000 | 1.000 |
| Direct Table QA | Grouping | 21 | 0.753 | 0.807 | 0.939 | 1.000 |
| Direct Table QA | Multi-condition selection | 12 | 0.792 | 0.867 | 0.758 | 1.000 |
| Direct Table QA | Ordering | 4 | 0.708 | 0.708 | 1.000 | 0.500 |
| Direct Table QA | Projection | 6 | 0.903 | 1.000 | 0.903 | 1.000 |
| Direct Table QA | Simple selection | 4 | 0.915 | 0.905 | 0.845 | 1.000 |
| Text-to-SQL | Aggregation | 3 | 1.000 | 1.000 | 1.000 | 1.000 |
| Text-to-SQL | Grouping | 21 | 0.814 | 0.857 | 0.925 | 1.000 |
| Text-to-SQL | Multi-condition selection | 12 | 0.750 | 0.708 | 0.708 | 1.000 |
| Text-to-SQL | Ordering | 4 | 0.792 | 1.000 | 0.792 | 0.750 |
| Text-to-SQL | Projection | 6 | 0.715 | 1.000 | 0.715 | 1.000 |
| Text-to-SQL | Simple selection | 4 | 1.000 | 1.000 | 1.000 | 1.000 |
Artifact trail
The Mark 2 page is a public reading layer over the stored report, tables, figures, and run matrix.
data/subset/manifest_50.json gpt-4o-mini, gpt-5-mini, gpt-5.2 Preselected relevant tables .venv/Scripts/python.exe scripts/run_experiment.py --manifest data/subset/manifest_50.json --outputs-dir data/outputs/runs --report-runs-dir report/runs --provider openai-compatible --base-url https://api.openai.com/v1 --model gpt-4o-mini --text-prompt-version text_to_sql_v3_schema_grounded --table-qa-prompt-version table_qa_v6_question_repeated_check --table-serialization compact --sql-repair-strategy error_feedback_v1 --retrieval-mode oracle_tables --refresh-cache report/final_report_mark2.md Narrative write-up for the 50-example stress-test sequel.
report/tables/final_mark2_model_comparison.md Compact metric table for each Mark 2 model and representation path.
report/tables/final_mark2_stress_summary.md Comparison of the headline Mark 2 run against the earlier checkpoint.
report/tables/final_mark2_diagnostics_summary.md Supporting checks for prompt changes, variance, context pressure, and provider diagnostics.
report/tables/final_mark2_per_pattern.md Per-pattern answer-row metrics for the 50-question stress test.
report/tables/final_mark2_model_matrix_runs.json Run IDs, prompts, serialization, and command metadata for the Mark 2 matrix.
report/figures/final_mark2_model_metrics.png Generated figure summarizing Mark 2 model metrics.
report/figures/final_mark1_vs_mark2_degradation.png Generated figure comparing the earlier checkpoint with Mark 2.