Contextual-SQL Reward Model is the scoring component of the Contextual-SQL system, which achieved #1 on the BIRD benchmark leaderboard in February 2025. This model is finetuned from Qwen-2.5-32B-Instruct to rank SQL query candidates by scoring their execution correctness given a database schema and natural language query.
This reward model is one component of the full Contextual-SQL pipeline. For the complete text-to-SQL system (including SQL generation with Qwen2.5-Coder-32B-Instruct), please see the Github repository.
For more details about the complete system and methodology, check out our blog post.
Model Details
The Contextual-SQL system achieves state-of-the-art performance through a multi-stage pipeline where this reward model plays a critical role:
Inference-Time Scaling: The system generates multiple diverse SQL candidates through parallel sampling with varied prompts and temperatures from Qwen2.5-Coder-32B-Instruct, then selects the best candidate using execution validation, consistency scoring, and this learned reward model.
Reward Model Training: This Qwen-2.5-32B-Instruct base model was finetuned on the BIRD training set to rank candidate SQL queries. Training uses a classification objective with hard negative mining, where each question is paired with one correct SQL candidate and 15 high-likelihood incorrect candidates as hard negatives.
Multi-Signal Candidate Selection: The final ranking combines this reward model's probability scores P(Y|X,Q) with the generator's log-probabilities P(X|Q) (weighted at α=0.4), creating a joint likelihood that captures both the quality of the SQL candidate and the correctness of its execution output.
Model Description
- Developed by: Contextual AI
- Language(s) (NLP): English
- Finetuned from model: Qwen-2.5-32B-Instruct
- Model type: Reward model for SQL query scoring
- License: Same as base model (Apache 2.0)
Model Sources
- Repository: https://github.com/ContextualAI/bird-sql
- Blog Post: https://contextual.ai/blog/open-sourcing-the-best-local-text-to-sql-system
Usage
This model is designed to be used as part of the full Contextual-SQL pipeline for scoring SQL candidates. It is not a standalone text-to-SQL generator.
Quick Start (Scoring a Single SQL Candidate)
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("ContextualAI/ctx-bird-reward-250121")
model = AutoModelForCausalLM.from_pretrained("ContextualAI/ctx-bird-reward-250121")
# Example inputs
db_schema = """CREATE TABLE customers (
id INT PRIMARY KEY,
name VARCHAR(100),
region VARCHAR(50),
revenue FLOAT
);"""
question = "Show me top 5 highest revenue customers by region"
evidence = "Revenue is stored in the revenue column"
sql_candidate = "SELECT region, name, revenue FROM customers ORDER BY revenue DESC LIMIT 5"
execution_result = "5 rows returned"
num_rows = 5
# Format the prompt
messages = [
{
"role": "system",
"content": "You are a judge that can check whether a given SQL correctly answers a given natural language user query. You'll be given Database Schema, Question, External Knowledge, SQL, logprob Score and its Execution Result.",
},
{
"role": "user",
"content": (
f"-- Database Schema: \n{db_schema}\n"
f"-- Question: {question}\n"
f"-- External Knowledge: {evidence}\n"
f"-- SQL: {sql_candidate}\n"
f"-- Execution Result #rows: {num_rows}\n"
f"-- Execution Result START\n{execution_result}\n"
f"-- END Execution Result\n"
f"-- Does SQL correctly answer Question?\n"
),
}
]
# Generate reward score
inputs = tokenizer.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_tensors="pt",
).to(model.device)
outputs = model.generate(**inputs, max_new_tokens=10)
score = tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True)
print(f"Reward score: {score}")
Complete Pipeline
For production use, we recommend using the full Contextual-SQL pipeline from the Github repository. The pipeline consists of:
- Candidate Generation: Generate multiple SQL query candidates using Qwen2.5-Coder-32B-Instruct
- SQL Execution: Execute candidates against the database
- Reward Scoring: Score candidates with this reward model
- Final Selection: Select the best candidate based on combined scoring
Installation
# Clone the repository
git clone https://github.com/ContextualAI/bird-sql.git
cd bird-sql
# Install dependencies
pip install -r requirements.txt
# Download the reward model
mkdir -p models/reward
huggingface-cli download ContextualAI/ctx-bird-reward-250121 \
--local-dir models/reward
Running the Pipeline
# Step 1: Generate SQL candidates
python src/generate.py \
--input_file data/test_all.jsonl \
--output_dir output/generations/ \
--num_gpus 2
# Step 2: Execute SQL candidates
python src/process_sqls.py \
--input_file data/test_all.jsonl \
--generations_dir output/generations/ \
--output_dir output/with_results/ \
--compare_against_gt \
--sql_timeout 30.0
# Step 3: Score with reward model
VLLM_USE_V1=0 python src/reward.py \
--input_file output/with_results/data_with_results.jsonl \
--output_dir output/with_rewards \
--num_gpus 2
# Step 4: Select best candidates
python src/analysis.py \
--rewards_dir output/with_rewards \
--gt_sql_file data/test_gold_sqls.txt \
--output_dir output/analysis \
--num_cpus 100
System Requirements
- GPU: 2+ GPUs with 80GB RAM each (for full pipeline)
- Python: 3.10+
- See the repository for complete requirements
Evaluation
The complete Contextual-SQL system (using this reward model) achieves state-of-the-art results on the BIRD benchmark:
| Model | BIRD Dev Set | BIRD Test Set |
|---|---|---|
| Contextual-SQL | 73.50 | 75.63 |
To reproduce these evaluation results, follow the instructions in the Github repository.
Citation
If you find our work helpful, please cite:
@misc{agrawal2025text2sql,
author = {Sheshansh Agrawal and Thien Nguyen},
title = {Open-Sourcing the Best Local Text-to-SQL System},
year = {2025},
url = {https://contextual.ai/blog/open-sourcing-the-best-local-text-to-sql-system/}
}
- Downloads last month
- 32