30 Spider development examples Evidence appendix
Curated artifacts behind the SpiderLens lesson.
This page explains the public data behind the explainer: final run groups, Balanced-30 metrics, representative questions, diagnostics, and reproducibility settings.
How to Read This Evidence
Balanced-30 is a controlled comparison: every run uses the same 30 Spider examples and the same preselected relevant tables. That keeps the focus on answer generation rather than table retrieval.
Text-to-SQL vs Direct Table QA Preselected relevant tables Final Runs
The final matrix has three fresh repeats for each model. Each run evaluates both pipelines under the same prompt, serialization, and no-repair settings.
| Run | Model | Repeat | Serialization | Retrieval | Status |
|---|---|---|---|---|---|
run_0045 | gpt-4o-mini | 1 / 3 | compact | Preselected tables | OK |
run_0046 | gpt-4o-mini | 2 / 3 | compact | Preselected tables | OK |
run_0047 | gpt-4o-mini | 3 / 3 | compact | Preselected tables | OK |
run_0048 | gpt-5-mini | 1 / 3 | compact | Preselected tables | OK |
run_0049 | gpt-5-mini | 2 / 3 | compact | Preselected tables | OK |
run_0050 | gpt-5-mini | 3 / 3 | compact | Preselected tables | OK |
run_0051 | gpt-5.2 | 1 / 3 | compact | Preselected tables | OK |
run_0052 | gpt-5.2 | 2 / 3 | compact | Preselected tables | OK |
run_0053 | gpt-5.2 | 3 / 3 | compact | Preselected tables | OK |
Balanced-30 Metrics
These are mean answer-row scores over the three repeats. Precision and recall are cell-level metrics; cardinality measures row-count fit; failed records are failed examples per repeat.
| Model | Pipeline | Precision | Recall | Cardinality | Order | Failed |
|---|---|---|---|---|---|---|
| gpt-4o-mini | Direct Table QA | 0.617 | 0.669 | 0.837 | 0.822 | 0.3 |
| gpt-4o-mini | Text-to-SQL | 0.640 | 0.700 | 0.673 | 1.000 | 7.3 |
| gpt-5-mini | Direct Table QA | 0.895 | 0.920 | 0.944 | 0.956 | 0.0 |
| gpt-5-mini | Text-to-SQL | 0.770 | 0.780 | 0.824 | 0.941 | 5.0 |
| gpt-5.2 | Direct Table QA | 0.733 | 0.778 | 0.926 | 0.904 | 0.0 |
| gpt-5.2 | Text-to-SQL | 0.808 | 0.833 | 0.842 | 0.967 | 4.0 |
Pipeline Deltas
Deltas are Direct Table QA minus Text-to-SQL for each model. Positive metric deltas favor Direct Table QA; negative failed-record deltas mean fewer failures.
| Model | Precision delta | Recall delta | Cardinality delta | Order delta | Failed delta |
|---|---|---|---|---|---|
| gpt-4o-mini | -0.022 | -0.031 | 0.164 | -0.178 | -7.0 |
| gpt-5-mini | 0.126 | 0.141 | 0.120 | 0.015 | -5.0 |
| gpt-5.2 | -0.076 | -0.055 | 0.085 | -0.063 | -4.0 |
Diagnostics
These diagnostics catch shape and ordering errors that cell overlap can hide, including exact row matches and top-k position accuracy.
| Model | Pipeline | Tuple F1 | Exact match | Constraint pass | Top-k exact | Top-k position |
|---|---|---|---|---|---|---|
| gpt-4o-mini | Direct Table QA | 0.557 | 0.367 | 0.367 | 0.200 | 0.200 |
| gpt-4o-mini | Text-to-SQL | 0.611 | 0.578 | 0.578 | 1.000 | 1.000 |
| gpt-5-mini | Direct Table QA | 0.868 | 0.800 | 0.800 | 0.667 | 0.733 |
| gpt-5-mini | Text-to-SQL | 0.730 | 0.722 | 0.722 | 0.600 | 0.644 |
| gpt-5.2 | Direct Table QA | 0.710 | 0.600 | 0.600 | 0.400 | 0.422 |
| gpt-5.2 | Text-to-SQL | 0.780 | 0.767 | 0.767 | 0.800 | 0.800 |
12 Examples From the Balanced-30 Subset
The full subset has 30 questions. These first 12 show the range of patterns and include polished wording when the original benchmark sentence is hard to read.
Covers a simple aggregate count and tests aggregate-only output.
Original Spider wording Find the number of distinct type of pets.
Covers a count-distinct aggregate to test duplicate handling without joins.
Covers a count over a nested anti-selection in the museum domain.
Covers a filtered AVG aggregate in the museum domain.
Original Spider wording What is maximum and minimum death toll caused each time?
Covers a min/max aggregate without joins.
Original Spider wording Show name, country, age for all singers ordered by age from the oldest to the youngest.
Covers an ordered projection case and exercises order-sensitive scoring.
Covers an audited joined top-1 ordering case.
Covers a single-table top-1 ordering case.
Covers filtered ordering without a limit.
Covers a single-table distinct top-3 ordering case over match winners.
Single-table filtered distinct projection.
Original Spider wording Which distinct car models are the produced after 1980?
Covers a 21-row distinct filtered output that stresses broad recall and deduplication.
Artifact Index
These files provide traceability from the website claims back to reports, generated tables, figures, and raw run artifacts.
report/final_report.pdf Complete project report with methodology, protocol, results, error analysis, questions, schemas, and reproducibility.
report/final_report.tex LaTeX source used to produce the submitted PDF.
report/final_report.md Markdown version of the submitted report for easier browsing.
report/tables/final_balanced30_repeated_runs.json Lists the nine final runs, models, settings, commands, metadata paths, and metrics paths.
report/tables/final_balanced30_variance_summary.md Mean, standard deviation, min/max, failure rates, and timing for each model/pipeline row.
report/tables/final_balanced30_main_results.md Expanded metric table behind the compact website results.
report/tables/final_balanced30_pipeline_deltas.md Direct Table QA minus Text-to-SQL deltas by model.
report/tables/final_balanced30_report_diagnostics.md Tuple F1, exact match, constraint pass, and top-k/order diagnostics.
report/tables/final_balanced30_report_failures.md Curated wins/losses/different-failure cases used for error analysis.
report/tables/subset_summary_30_balanced.md The 30-question subset with pattern, difficulty, preselected tables, row counts, and rationale.
report/figures/final_balanced30_repeated_metrics.png Visual summary of repeated assignment metrics across models and pipelines.
report/figures/final_balanced30_failure_breakdown.png Shows technical execution failures versus semantic wrong-answer cases.
report/figures/final_balanced30_token_cost.png Shows the context-cost tradeoff between serialized table content and schema-only SQL prompts.
report/figures/final_balanced30_tuple_topk_diagnostics.png Shows row-shape and ranking diagnostics beyond cell overlap.
report/runs/run_0045 Raw gpt-4o-mini repeat evidence, including error examples used in the paper.
report/runs/run_0048 Raw gpt-5-mini repeat evidence for strong Table-QA and Text-to-SQL comparison.
report/runs/run_0050 Raw gpt-5-mini repeat evidence for dev-0056 and dev-0104 examples.
report/tables/final_mark1_subset_questions.md Earlier checkpoint evidence kept for historical context; the website uses Balanced-30 for its claims.
report/tables/final_mark2_model_comparison.md Larger follow-up experiment kept for historical context; the website uses Balanced-30 for its claims.
Reproducibility Settings
The command below regenerates the final repeated matrix from the stored experiment configuration.
data/subset/manifest_30_balanced.json gpt-4o-mini, gpt-5-mini, gpt-5.2 Preselected relevant tables .venv/Scripts/python.exe scripts/run_repeated_variance.py --manifest data/subset/manifest_30_balanced.json --models gpt-4o-mini gpt-5-mini gpt-5.2 --repeat-count 3 --cache-policy fresh --retrieval-mode oracle_tables --text-prompt-version text_to_sql_v3_schema_grounded --table-qa-prompt-version table_qa_v4_order_distinct_complete --table-serialization compact --sql-repair-strategy none