Contextual-SQL Reward Model

Contextual_AI

Blog Post GitHub Hugging Face Collection

Contextual-SQL Reward Model is the scoring component of the Contextual-SQL system, which achieved #1 on the BIRD benchmark leaderboard in February 2025. This model is finetuned from Qwen-2.5-32B-Instruct to rank SQL query candidates by scoring their execution correctness given a database schema and natural language query.

This reward model is one component of the full Contextual-SQL pipeline. For the complete text-to-SQL system (including SQL generation with Qwen2.5-Coder-32B-Instruct), please see the Github repository.

For more details about the complete system and methodology, check out our blog post.

Model Details

The Contextual-SQL system achieves state-of-the-art performance through a multi-stage pipeline where this reward model plays a critical role:

  • Inference-Time Scaling: The system generates multiple diverse SQL candidates through parallel sampling with varied prompts and temperatures from Qwen2.5-Coder-32B-Instruct, then selects the best candidate using execution validation, consistency scoring, and this learned reward model.

  • Reward Model Training: This Qwen-2.5-32B-Instruct base model was finetuned on the BIRD training set to rank candidate SQL queries. Training uses a classification objective with hard negative mining, where each question is paired with one correct SQL candidate and 15 high-likelihood incorrect candidates as hard negatives.

  • Multi-Signal Candidate Selection: The final ranking combines this reward model's probability scores P(Y|X,Q) with the generator's log-probabilities P(X|Q) (weighted at α=0.4), creating a joint likelihood that captures both the quality of the SQL candidate and the correctness of its execution output.

Model Description

  • Developed by: Contextual AI
  • Language(s) (NLP): English
  • Finetuned from model: Qwen-2.5-32B-Instruct
  • Model type: Reward model for SQL query scoring
  • License: Same as base model (Apache 2.0)

Model Sources

Usage

This model is designed to be used as part of the full Contextual-SQL pipeline for scoring SQL candidates. It is not a standalone text-to-SQL generator.

Quick Start (Scoring a Single SQL Candidate)

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("ContextualAI/ctx-bird-reward-250121")
model = AutoModelForCausalLM.from_pretrained("ContextualAI/ctx-bird-reward-250121")

# Example inputs
db_schema = """CREATE TABLE customers (
    id INT PRIMARY KEY,
    name VARCHAR(100),
    region VARCHAR(50),
    revenue FLOAT
);"""

question = "Show me top 5 highest revenue customers by region"
evidence = "Revenue is stored in the revenue column"
sql_candidate = "SELECT region, name, revenue FROM customers ORDER BY revenue DESC LIMIT 5"
execution_result = "5 rows returned"
num_rows = 5

# Format the prompt
messages = [
    {
        "role": "system",
        "content": "You are a judge that can check whether a given SQL correctly answers a given natural language user query. You'll be given Database Schema, Question, External Knowledge, SQL, logprob Score and its Execution Result.",
    },
    {
        "role": "user",
        "content": (
            f"-- Database Schema: \n{db_schema}\n"
            f"-- Question: {question}\n"
            f"-- External Knowledge: {evidence}\n"
            f"-- SQL: {sql_candidate}\n"
            f"-- Execution Result #rows: {num_rows}\n"
            f"-- Execution Result START\n{execution_result}\n"
            f"-- END Execution Result\n"
            f"-- Does SQL correctly answer Question?\n"
        ),
    }
]

# Generate reward score
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=10)
score = tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True)
print(f"Reward score: {score}")

Complete Pipeline

For production use, we recommend using the full Contextual-SQL pipeline from the Github repository. The pipeline consists of:

  1. Candidate Generation: Generate multiple SQL query candidates using Qwen2.5-Coder-32B-Instruct
  2. SQL Execution: Execute candidates against the database
  3. Reward Scoring: Score candidates with this reward model
  4. Final Selection: Select the best candidate based on combined scoring

Installation

# Clone the repository
git clone https://github.com/ContextualAI/bird-sql.git
cd bird-sql

# Install dependencies
pip install -r requirements.txt

# Download the reward model
mkdir -p models/reward
huggingface-cli download ContextualAI/ctx-bird-reward-250121 \
  --local-dir models/reward

Running the Pipeline

# Step 1: Generate SQL candidates
python src/generate.py \
  --input_file data/test_all.jsonl \
  --output_dir output/generations/ \
  --num_gpus 2

# Step 2: Execute SQL candidates
python src/process_sqls.py \
  --input_file data/test_all.jsonl \
  --generations_dir output/generations/ \
  --output_dir output/with_results/ \
  --compare_against_gt \
  --sql_timeout 30.0

# Step 3: Score with reward model
VLLM_USE_V1=0 python src/reward.py \
  --input_file output/with_results/data_with_results.jsonl \
  --output_dir output/with_rewards \
  --num_gpus 2

# Step 4: Select best candidates
python src/analysis.py \
  --rewards_dir output/with_rewards \
  --gt_sql_file data/test_gold_sqls.txt \
  --output_dir output/analysis \
  --num_cpus 100

System Requirements

  • GPU: 2+ GPUs with 80GB RAM each (for full pipeline)
  • Python: 3.10+
  • See the repository for complete requirements

Evaluation

The complete Contextual-SQL system (using this reward model) achieves state-of-the-art results on the BIRD benchmark:

Model BIRD Dev Set BIRD Test Set
Contextual-SQL 73.50 75.63

To reproduce these evaluation results, follow the instructions in the Github repository.

Citation

If you find our work helpful, please cite:

@misc{agrawal2025text2sql,
  author       = {Sheshansh Agrawal and Thien Nguyen},
  title        = {Open-Sourcing the Best Local Text-to-SQL System},
  year         = {2025},
  url          = {https://contextual.ai/blog/open-sourcing-the-best-local-text-to-sql-system/}
}
Downloads last month
32
Safetensors
Model size
33B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ContextualAI/ctx-bird-reward-250121

Base model

Qwen/Qwen2.5-32B
Finetuned
(1198)
this model

Collection including ContextualAI/ctx-bird-reward-250121