Back to Balanced-30

Stress-test sequel

SpiderLens Mark 2

The same representation comparison under a harder 50-example setup.

Mark 2 follows the main Balanced-30 explainer and asks what changes when the project is pushed further. It keeps the same preselected relevant tables (oracle tables), so retrieval is still intentionally out of scope, then adds broader questions, SQL repair, and a stronger Direct Table QA prompt.

What changed

Mark 2 keeps the comparison controlled, then turns up the pressure.

The sequel is not a replacement for Balanced-30. It is a wider stress test that checks whether the same representation tradeoffs still hold when the setup becomes more demanding.

50 examples

A broader 50-question subset adds more schemas, joins, grouping cases, and broad answer sets.

Controlled tables

Runs still receive preselected relevant tables, so the page tests answer generation rather than retrieval.

SQL repair enabled

Text-to-SQL gets error-feedback repair, making this a stronger sequel than the no-repair main run.

Stronger Direct QA prompt

Direct Table QA repeats the question and adds answer-checking pressure before returning rows.

Bottom line

The sequel is less tidy, which is exactly the point.

Mark 2 shows that the winner depends on the cost and failure mode you care about, even when the answer-row metrics remain directly comparable.

Strongest answer recovery gpt-5.2 Text-to-SQL

87.0% cell recall with 2 failed records on the 50-example matrix.

Direct QA tradeoff gpt-5-mini Direct Table QA

86.0% cell recall, but 6,173 tokens per record and 1 context-limit error.

Why SQL still matters gpt-5.2 Text-to-SQL

The cheapest row uses 374 tokens per record while staying executable, inspectable, and easy to debug.

Mark 2 results

Six model-path rows, one harder comparison.

These are answer-row metrics over the 50-example Mark 2 subset. Cell precision and recall compare returned cells with expected answer cells; row count fit and order accuracy catch shape and ordering issues.

Model Path Cell precision Cell recall Row count fit Order accuracy Failed Context errors Tokens/record
gpt-4o-mini Direct Table QA 0.594 0.697 0.739 0.940 2 0 4,236
gpt-4o-mini Text-to-SQL 0.672 0.710 0.762 0.980 7 0 395
gpt-5-mini Direct Table QA 0.800 0.860 0.846 0.967 1 1 6,173
gpt-5-mini Text-to-SQL 0.707 0.760 0.789 0.980 6 0 975
gpt-5.2 Direct Table QA 0.805 0.856 0.892 0.960 1 1 4,575
gpt-5.2 Text-to-SQL 0.811 0.870 0.848 0.980 2 0 374

Stress signals

Cost, context, and repair become part of the story.

The stress summary compares the Mark 2 headline run with the earlier checkpoint. It is useful context, not a new headline benchmark claim.

Direct Table QA 22 to 50 examples

Recall +3.17 pp; failure rate +2.00 pp; tokens per record +917.

Direct Table QA remains competitive but carries much higher token cost and one Mark 2 context failure.

Text-to-SQL 22 to 50 examples

Recall +9.73 pp; failure rate -14.18 pp; tokens per record +84.

Text-to-SQL remains much cheaper per record; the Mark 2 setup adds SQL repair and extra diagnostics.

Supporting diagnostics

Prompt comparison report/tables/m2_evidence_prompt_comparison.md

The repeated-question answer-checking prompt improved Direct Table QA but did not improve Text-to-SQL.

Variance check report/tables/m2_evidence_variance_summary.md

Repeated gpt-5.2 runs show modest metric variance, so small deltas should be interpreted cautiously.

Context pressure report/tables/m2_evidence_context_stress.md

Direct Table QA is more exposed to context pressure because it serializes table content into the prompt.

Provider diagnostic report/tables/m2_11_local_diagnostic.md

Local-provider failures document environment behavior; they are not used for the public model comparison.

Pattern-level evidence

The hard cases are not evenly hard.

Per-pattern metrics show where each representation path bends: broad grouped outputs, ordering cases, and multi-condition selections do not create the same pressure.

Path Pattern Records Cell precision Cell recall Row count fit Order accuracy
Direct Table QA Aggregation 3 1.000 1.000 1.000 1.000
Direct Table QA Grouping 21 0.753 0.807 0.939 1.000
Direct Table QA Multi-condition selection 12 0.792 0.867 0.758 1.000
Direct Table QA Ordering 4 0.708 0.708 1.000 0.500
Direct Table QA Projection 6 0.903 1.000 0.903 1.000
Direct Table QA Simple selection 4 0.915 0.905 0.845 1.000
Text-to-SQL Aggregation 3 1.000 1.000 1.000 1.000
Text-to-SQL Grouping 21 0.814 0.857 0.925 1.000
Text-to-SQL Multi-condition selection 12 0.750 0.708 0.708 1.000
Text-to-SQL Ordering 4 0.792 1.000 0.792 0.750
Text-to-SQL Projection 6 0.715 1.000 0.715 1.000
Text-to-SQL Simple selection 4 1.000 1.000 1.000 1.000

Artifact trail

Everything on this page points back to generated evidence.

The Mark 2 page is a public reading layer over the stored report, tables, figures, and run matrix.

Manifest data/subset/manifest_50.json
Models gpt-4o-mini, gpt-5-mini, gpt-5.2
Retrieval setting Preselected relevant tables
.venv/Scripts/python.exe scripts/run_experiment.py --manifest data/subset/manifest_50.json --outputs-dir data/outputs/runs --report-runs-dir report/runs --provider openai-compatible --base-url https://api.openai.com/v1 --model gpt-4o-mini --text-prompt-version text_to_sql_v3_schema_grounded --table-qa-prompt-version table_qa_v6_question_repeated_check --table-serialization compact --sql-repair-strategy error_feedback_v1 --retrieval-mode oracle_tables --refresh-cache
Mark 2 companion report report/final_report_mark2.md

Narrative write-up for the 50-example stress-test sequel.

Mark 2 model comparison report/tables/final_mark2_model_comparison.md

Compact metric table for each Mark 2 model and representation path.

Mark 2 stress summary report/tables/final_mark2_stress_summary.md

Comparison of the headline Mark 2 run against the earlier checkpoint.

Mark 2 diagnostics report/tables/final_mark2_diagnostics_summary.md

Supporting checks for prompt changes, variance, context pressure, and provider diagnostics.

Mark 2 pattern metrics report/tables/final_mark2_per_pattern.md

Per-pattern answer-row metrics for the 50-question stress test.

Mark 2 run matrix report/tables/final_mark2_model_matrix_runs.json

Run IDs, prompts, serialization, and command metadata for the Mark 2 matrix.

Mark 2 model figure report/figures/final_mark2_model_metrics.png

Generated figure summarizing Mark 2 model metrics.

Mark 1 to Mark 2 comparison figure report/figures/final_mark1_vs_mark2_degradation.png

Generated figure comparing the earlier checkpoint with Mark 2.