Dataset Builder
This project contains the complete source code for building three domain-specific datasets for scientific computing code intelligence research.
Project Structure
dataset_builder/
βββ README.md # This file
β
βββ data1/ # DATA1: Domain-Specific Code Dataset
β βββ main.py # Step 0-1: Keyword expansion + GitHub repo search
β βββ main_v2.py # Step 0-4: Full pipeline (search β check β clone β filter)
β βββ util.py # Shared utilities (logger, LLM calls, code extensions)
β βββ download_dataset.py # Download ChemPile code dataset from HuggingFace
β βββ merge_dataset.py # Merge crawled repos with ChemPile data, deduplicate
β βββ analysis.py # Code-level analysis (comments, functions, tokens)
β βββ compute_stars_keywords.py # Compute stars/keyword statistics
β βββ compute_statistics.py # Compute code statistics from JSONL analysis files
β βββ rename.py # Rename repo directories to owner___repo format
β βββ rename2.py # Rename ChemPile files with zero-padded numbering
β βββ pyproject.toml # Python project config
β βββ scripts/
β β βββ export_files_to_csv.py # Export repo files to CSV grouped by keyword
β βββ reporting/ # Statistical reporting and visualization
β β βββ __init__.py
β β βββ main.py # Reporting entry point
β β βββ visualization.py # Generate figures (funnel, distributions, etc.)
β β βββ repo_meta_scan.py # Scan repo-level metadata
β β βββ code_file_stats.py # File-level code statistics
β β βββ code_file_stats_fast.py # Optimized file-level statistics
β β βββ stage_a_stats.py # Stage A (search/check) statistics
β β βββ stage_b_stats.py # Stage B (clone/filter) statistics
β β βββ join_insights.py # Join and cross-analyze insights
β βββ README.md # DATA1 dataset documentation
β
βββ data2/ # DATA2: Code-Documentation Alignment Dataset
β βββ instruction_generation/ # README summarization pipeline
β β βββ pipeline.py # Unified entry (summarize + parse modes)
β β βββ summarize_repo_readme.py # Summarize repo READMEs using LLM
β β βββ extract_repo_functions.py # Extract functions from repos
β β βββ schemas.py # Pydantic data schemas
β β βββ prompts/
β β βββ function_extract.txt # Prompt for function extraction
β β βββ readme_summary.txt # Prompt for README summarization
β βββ step22/ # Function scoring, generation, alignment
β β βββ build.py # Build tree-sitter language parsers
β β βββ func_stat.py # Extract functions using tree-sitter
β β βββ md_stat.py # Extract & save README summaries
β β βββ emb_qwen_func.py # Score functions using Qwen embedding model
β β βββ emb_qwen_md.py # Score READMEs using Qwen embedding model
β β βββ function_req.py # Filter functions by score threshold
β β βββ gemini_generation.py # Generate docstrings using Gemini API
β β βββ alignment.py # Align functions with generated docstrings
β β βββ prompt.txt # Prompt template for docstring generation
β β βββ depend_analysis.py # Dependency/call-graph analysis
β β βββ find_none_score_func.py # Find functions missing scores
β β βββ folder_stat.py # Repository folder statistics
β β βββ ppt.py # Visualization of alignment data
β β βββ debug_parser.py # Debug tree-sitter parser loading
β βββ README.md # DATA2 dataset documentation
β
βββ data3/ # DATA3: Programming Problems Generation Dataset
β βββ main.py # RepoAgent: generate docs for repos
β βββ gemini.py # Gemini API connectivity test
β βββ load_dataset.py # Load and inspect datasets
β βββ instruct_generation.py # Score functions for scientific relevance
β βββ extract_functions.py # Extract functions from enhanced_dataset.csv
β βββ extract_functions_v2.py # Extract functions v2 (better CSV/JSON handling)
β βββ merge_datasets.py # Merge res2.csv with dataset_all.csv
β βββ generate_programming_problems.py # Generate problems using Gemini API
β βββ generate_problems_batch.py # Batch problem generation (OpenAI batch API)
β βββ generate_problems_openai.py # Problem generation via OpenAI API
β βββ enrich_programming_problems.py # Enrich problems with source code context
β βββ vllm_high.py # VLLM-based high-throughput inference
β βββ vllm_qwen_batch.py # Qwen model batch inference via VLLM
β βββ show_pricing.py # Display API pricing information
β βββ check_enhanced.py # Validate enhanced dataset
β βββ check_index_distribution.py # Check index distribution
β βββ check_match.py # Check data matching
β βββ check_relationship.py # Check data relationships
β βββ is_sci_prompt.txt # Prompt: classify code as scientific computing
β βββ is_sci_prompt1.txt # Prompt variant for scientific classification
β βββ score_prompt.txt # Prompt: score function relevance
β βββ *.sh # Various shell scripts for batch processing
β βββ README.md # DATA3 dataset documentation
Dataset Building Pipelines
DATA1: Domain-Specific Code Dataset
Goal: Collect, filter, and export domain-specific code from GitHub repositories.
Pipeline (executed in order):
Keyword Expansion & Search (
main.py/main_v2.py)- Expand scientific keywords using LLM
- Search GitHub API for repositories matching keywords
- Check relevance using LLM (reads READMEs)
- Clone relevant repos (shallow clone)
- Filter to keep only code files
External Data (
download_dataset.py)- Download ChemPile code dataset from HuggingFace
Merge & Deduplicate (
merge_dataset.py)- Merge crawled repos with ChemPile data
- Deduplicate by content hash
Analysis (
analysis.py,compute_*.py)- Analyze code metrics (lines, comments, functions, tokens)
- Compute keyword and stars statistics
Export (
scripts/export_files_to_csv.py)- Export final dataset to CSV files grouped by keyword
Reporting (
reporting/)- Generate statistical reports and visualizations
DATA2: Code-Documentation Alignment Dataset
Goal: Generate high-quality docstrings for scientific code functions.
Pipeline (executed in order):
README Summarization (
instruction_generation/)- Summarize repository READMEs using LLM
- Extract structured information from repos
Function Extraction (
step22/func_stat.py)- Parse code using tree-sitter to extract functions
- Multi-language support (Python, C, C++, Java, Go, Rust, Julia)
README Processing (
step22/md_stat.py)- Copy README summaries to function dataset directories
Embedding Scoring (
step22/emb_qwen_func.py,emb_qwen_md.py)- Score function quality using Qwen embedding model
- Score README quality using Qwen embedding model
Function Filtering (
step22/function_req.py)- Filter functions by combined quality score
Docstring Generation (
step22/gemini_generation.py)- Generate docstrings using Gemini API
- Budget monitoring with circuit breaker
- Checkpoint/resume support
Alignment (
step22/alignment.py)- Merge function data with generated docstrings
DATA3: Programming Problems Generation Dataset
Goal: Generate programming problems inspired by scientific code.
Pipeline (executed in order):
Documentation Generation (
main.py)- Use RepoAgent to generate documentation for repositories
Function Extraction (
extract_functions.py,extract_functions_v2.py)- Extract individual functions from enhanced dataset
Scientific Relevance Scoring (
instruct_generation.py)- Score functions for scientific computing relevance
- Use
is_sci_prompt.txtandscore_prompt.txtas prompts
Dataset Merge (
merge_datasets.py)- Merge function scores with source code data
Problem Generation (
generate_programming_problems.py)- Generate programming problems using Gemini API
- Filter by relevance score
- Budget monitoring and cost control
Enrichment (
enrich_programming_problems.py)- Enrich generated problems with source code context
Dependencies
Common
pandas,tqdm,jsonlinespython-dotenv
DATA1
langchain,langchain-openai,pydantic,logururequests(GitHub API)matplotlib,seaborn,wordcloud(reporting)datasets(HuggingFace)
DATA2
tree-sitter,tree-sitter-python,tree-sitter-c, etc.vllm,transformers,torch(embedding scoring)google-cloud-aiplatform,vertexai(Gemini API)
DATA3
google-cloud-aiplatform,vertexai(Gemini API)openai(OpenAI API)vllm,transformers,torch(local inference)
Notes
- Scripts contain hardcoded paths that need to be updated for your environment
- API credentials (GitHub token, Gemini, OpenAI) need to be configured separately
- Large datasets require significant storage and compute resources
- Most scripts support checkpoint/resume for long-running processes