YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Dataset Builder

This project contains the complete source code for building three domain-specific datasets for scientific computing code intelligence research.

Project Structure

dataset_builder/
├── README.md                          # This file
│
├── data1/                             # DATA1: Domain-Specific Code Dataset
│   ├── main.py                        # Step 0-1: Keyword expansion + GitHub repo search
│   ├── main_v2.py                     # Step 0-4: Full pipeline (search → check → clone → filter)
│   ├── util.py                        # Shared utilities (logger, LLM calls, code extensions)
│   ├── download_dataset.py            # Download ChemPile code dataset from HuggingFace
│   ├── merge_dataset.py               # Merge crawled repos with ChemPile data, deduplicate
│   ├── analysis.py                    # Code-level analysis (comments, functions, tokens)
│   ├── compute_stars_keywords.py      # Compute stars/keyword statistics
│   ├── compute_statistics.py          # Compute code statistics from JSONL analysis files
│   ├── rename.py                      # Rename repo directories to owner___repo format
│   ├── rename2.py                     # Rename ChemPile files with zero-padded numbering
│   ├── pyproject.toml                 # Python project config
│   ├── scripts/
│   │   └── export_files_to_csv.py     # Export repo files to CSV grouped by keyword
│   ├── reporting/                     # Statistical reporting and visualization
│   │   ├── __init__.py
│   │   ├── main.py                    # Reporting entry point
│   │   ├── visualization.py           # Generate figures (funnel, distributions, etc.)
│   │   ├── repo_meta_scan.py          # Scan repo-level metadata
│   │   ├── code_file_stats.py         # File-level code statistics
│   │   ├── code_file_stats_fast.py    # Optimized file-level statistics
│   │   ├── stage_a_stats.py           # Stage A (search/check) statistics
│   │   ├── stage_b_stats.py           # Stage B (clone/filter) statistics
│   │   └── join_insights.py           # Join and cross-analyze insights
│   └── README.md                      # DATA1 dataset documentation
│
├── data2/                             # DATA2: Code-Documentation Alignment Dataset
│   ├── instruction_generation/        # README summarization pipeline
│   │   ├── pipeline.py                # Unified entry (summarize + parse modes)
│   │   ├── summarize_repo_readme.py   # Summarize repo READMEs using LLM
│   │   ├── extract_repo_functions.py  # Extract functions from repos
│   │   ├── schemas.py                 # Pydantic data schemas
│   │   └── prompts/
│   │       ├── function_extract.txt   # Prompt for function extraction
│   │       └── readme_summary.txt     # Prompt for README summarization
│   ├── step22/                        # Function scoring, generation, alignment
│   │   ├── build.py                   # Build tree-sitter language parsers
│   │   ├── func_stat.py               # Extract functions using tree-sitter
│   │   ├── md_stat.py                 # Extract & save README summaries
│   │   ├── emb_qwen_func.py           # Score functions using Qwen embedding model
│   │   ├── emb_qwen_md.py             # Score READMEs using Qwen embedding model
│   │   ├── function_req.py            # Filter functions by score threshold
│   │   ├── gemini_generation.py       # Generate docstrings using Gemini API
│   │   ├── alignment.py               # Align functions with generated docstrings
│   │   ├── prompt.txt                 # Prompt template for docstring generation
│   │   ├── depend_analysis.py         # Dependency/call-graph analysis
│   │   ├── find_none_score_func.py    # Find functions missing scores
│   │   ├── folder_stat.py             # Repository folder statistics
│   │   ├── ppt.py                     # Visualization of alignment data
│   │   └── debug_parser.py            # Debug tree-sitter parser loading
│   └── README.md                      # DATA2 dataset documentation
│
├── data3/                             # DATA3: Programming Problems Generation Dataset
│   ├── main.py                        # RepoAgent: generate docs for repos
│   ├── gemini.py                      # Gemini API connectivity test
│   ├── load_dataset.py                # Load and inspect datasets
│   ├── instruct_generation.py         # Score functions for scientific relevance
│   ├── extract_functions.py           # Extract functions from enhanced_dataset.csv
│   ├── extract_functions_v2.py        # Extract functions v2 (better CSV/JSON handling)
│   ├── merge_datasets.py              # Merge res2.csv with dataset_all.csv
│   ├── generate_programming_problems.py  # Generate problems using Gemini API
│   ├── generate_problems_batch.py     # Batch problem generation (OpenAI batch API)
│   ├── generate_problems_openai.py    # Problem generation via OpenAI API
│   ├── enrich_programming_problems.py # Enrich problems with source code context
│   ├── vllm_high.py                   # VLLM-based high-throughput inference
│   ├── vllm_qwen_batch.py            # Qwen model batch inference via VLLM
│   ├── show_pricing.py                # Display API pricing information
│   ├── check_enhanced.py              # Validate enhanced dataset
│   ├── check_index_distribution.py    # Check index distribution
│   ├── check_match.py                 # Check data matching
│   ├── check_relationship.py          # Check data relationships
│   ├── is_sci_prompt.txt              # Prompt: classify code as scientific computing
│   ├── is_sci_prompt1.txt             # Prompt variant for scientific classification
│   ├── score_prompt.txt               # Prompt: score function relevance
│   ├── *.sh                           # Various shell scripts for batch processing
│   └── README.md                      # DATA3 dataset documentation

Dataset Building Pipelines

DATA1: Domain-Specific Code Dataset

Goal: Collect, filter, and export domain-specific code from GitHub repositories.

Pipeline (executed in order):

Keyword Expansion & Search (main.py / main_v2.py)
- Expand scientific keywords using LLM
- Search GitHub API for repositories matching keywords
- Check relevance using LLM (reads READMEs)
- Clone relevant repos (shallow clone)
- Filter to keep only code files
External Data (download_dataset.py)
- Download ChemPile code dataset from HuggingFace
Merge & Deduplicate (merge_dataset.py)
- Merge crawled repos with ChemPile data
- Deduplicate by content hash
Analysis (analysis.py, compute_*.py)
- Analyze code metrics (lines, comments, functions, tokens)
- Compute keyword and stars statistics
Export (scripts/export_files_to_csv.py)
- Export final dataset to CSV files grouped by keyword
Reporting (reporting/)
- Generate statistical reports and visualizations

DATA2: Code-Documentation Alignment Dataset

Goal: Generate high-quality docstrings for scientific code functions.

Pipeline (executed in order):

README Summarization (instruction_generation/)
- Summarize repository READMEs using LLM
- Extract structured information from repos
Function Extraction (step22/func_stat.py)
- Parse code using tree-sitter to extract functions
- Multi-language support (Python, C, C++, Java, Go, Rust, Julia)
README Processing (step22/md_stat.py)
- Copy README summaries to function dataset directories
Embedding Scoring (step22/emb_qwen_func.py, emb_qwen_md.py)
- Score function quality using Qwen embedding model
- Score README quality using Qwen embedding model
Function Filtering (step22/function_req.py)
- Filter functions by combined quality score
Docstring Generation (step22/gemini_generation.py)
- Generate docstrings using Gemini API
- Budget monitoring with circuit breaker
- Checkpoint/resume support
Alignment (step22/alignment.py)
- Merge function data with generated docstrings

DATA3: Programming Problems Generation Dataset

Goal: Generate programming problems inspired by scientific code.

Pipeline (executed in order):

Documentation Generation (main.py)
- Use RepoAgent to generate documentation for repositories
Function Extraction (extract_functions.py, extract_functions_v2.py)
- Extract individual functions from enhanced dataset
Scientific Relevance Scoring (instruct_generation.py)
- Score functions for scientific computing relevance
- Use is_sci_prompt.txt and score_prompt.txt as prompts
Dataset Merge (merge_datasets.py)
- Merge function scores with source code data
Problem Generation (generate_programming_problems.py)
- Generate programming problems using Gemini API
- Filter by relevance score
- Budget monitoring and cost control
Enrichment (enrich_programming_problems.py)
- Enrich generated problems with source code context

Dependencies

Common

pandas, tqdm, jsonlines
python-dotenv

DATA1

langchain, langchain-openai, pydantic, loguru
requests (GitHub API)
matplotlib, seaborn, wordcloud (reporting)
datasets (HuggingFace)

DATA2

tree-sitter, tree-sitter-python, tree-sitter-c, etc.
vllm, transformers, torch (embedding scoring)
google-cloud-aiplatform, vertexai (Gemini API)

DATA3

google-cloud-aiplatform, vertexai (Gemini API)
openai (OpenAI API)
vllm, transformers, torch (local inference)

Notes

Scripts contain hardcoded paths that need to be updated for your environment
API credentials (GitHub token, Gemini, OpenAI) need to be configured separately
Large datasets require significant storage and compute resources
Most scripts support checkpoint/resume for long-running processes

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support