Upload README.md with huggingface_hub
Browse files
README.md
ADDED
|
@@ -0,0 +1,207 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Dataset Builder
|
| 2 |
+
|
| 3 |
+
This project contains the complete source code for building three domain-specific datasets for scientific computing code intelligence research.
|
| 4 |
+
|
| 5 |
+
## Project Structure
|
| 6 |
+
|
| 7 |
+
```
|
| 8 |
+
dataset_builder/
|
| 9 |
+
βββ README.md # This file
|
| 10 |
+
β
|
| 11 |
+
βββ data1/ # DATA1: Domain-Specific Code Dataset
|
| 12 |
+
β βββ main.py # Step 0-1: Keyword expansion + GitHub repo search
|
| 13 |
+
β βββ main_v2.py # Step 0-4: Full pipeline (search β check β clone β filter)
|
| 14 |
+
β βββ util.py # Shared utilities (logger, LLM calls, code extensions)
|
| 15 |
+
β βββ download_dataset.py # Download ChemPile code dataset from HuggingFace
|
| 16 |
+
β βββ merge_dataset.py # Merge crawled repos with ChemPile data, deduplicate
|
| 17 |
+
β βββ analysis.py # Code-level analysis (comments, functions, tokens)
|
| 18 |
+
β βββ compute_stars_keywords.py # Compute stars/keyword statistics
|
| 19 |
+
β βββ compute_statistics.py # Compute code statistics from JSONL analysis files
|
| 20 |
+
β βββ rename.py # Rename repo directories to owner___repo format
|
| 21 |
+
β βββ rename2.py # Rename ChemPile files with zero-padded numbering
|
| 22 |
+
β βββ pyproject.toml # Python project config
|
| 23 |
+
β βββ scripts/
|
| 24 |
+
β β βββ export_files_to_csv.py # Export repo files to CSV grouped by keyword
|
| 25 |
+
β βββ reporting/ # Statistical reporting and visualization
|
| 26 |
+
β β βββ __init__.py
|
| 27 |
+
β β βββ main.py # Reporting entry point
|
| 28 |
+
β β βββ visualization.py # Generate figures (funnel, distributions, etc.)
|
| 29 |
+
β β βββ repo_meta_scan.py # Scan repo-level metadata
|
| 30 |
+
β β βββ code_file_stats.py # File-level code statistics
|
| 31 |
+
β β βββ code_file_stats_fast.py # Optimized file-level statistics
|
| 32 |
+
β β βββ stage_a_stats.py # Stage A (search/check) statistics
|
| 33 |
+
β β βββ stage_b_stats.py # Stage B (clone/filter) statistics
|
| 34 |
+
β β βββ join_insights.py # Join and cross-analyze insights
|
| 35 |
+
β βββ README.md # DATA1 dataset documentation
|
| 36 |
+
β
|
| 37 |
+
βββ data2/ # DATA2: Code-Documentation Alignment Dataset
|
| 38 |
+
β βββ instruction_generation/ # README summarization pipeline
|
| 39 |
+
β β βββ pipeline.py # Unified entry (summarize + parse modes)
|
| 40 |
+
β β βββ summarize_repo_readme.py # Summarize repo READMEs using LLM
|
| 41 |
+
β β βββ extract_repo_functions.py # Extract functions from repos
|
| 42 |
+
β β βββ schemas.py # Pydantic data schemas
|
| 43 |
+
β β βββ prompts/
|
| 44 |
+
β β βββ function_extract.txt # Prompt for function extraction
|
| 45 |
+
β β βββ readme_summary.txt # Prompt for README summarization
|
| 46 |
+
β βββ step22/ # Function scoring, generation, alignment
|
| 47 |
+
β β βββ build.py # Build tree-sitter language parsers
|
| 48 |
+
β β βββ func_stat.py # Extract functions using tree-sitter
|
| 49 |
+
β β βββ md_stat.py # Extract & save README summaries
|
| 50 |
+
β β βββ emb_qwen_func.py # Score functions using Qwen embedding model
|
| 51 |
+
β β βββ emb_qwen_md.py # Score READMEs using Qwen embedding model
|
| 52 |
+
β β βββ function_req.py # Filter functions by score threshold
|
| 53 |
+
β β βββ gemini_generation.py # Generate docstrings using Gemini API
|
| 54 |
+
β β βββ alignment.py # Align functions with generated docstrings
|
| 55 |
+
β β βββ prompt.txt # Prompt template for docstring generation
|
| 56 |
+
β β βββ depend_analysis.py # Dependency/call-graph analysis
|
| 57 |
+
β β βββ find_none_score_func.py # Find functions missing scores
|
| 58 |
+
β β βββ folder_stat.py # Repository folder statistics
|
| 59 |
+
β β βββ ppt.py # Visualization of alignment data
|
| 60 |
+
β β βββ debug_parser.py # Debug tree-sitter parser loading
|
| 61 |
+
β βββ README.md # DATA2 dataset documentation
|
| 62 |
+
β
|
| 63 |
+
βββ data3/ # DATA3: Programming Problems Generation Dataset
|
| 64 |
+
β βββ main.py # RepoAgent: generate docs for repos
|
| 65 |
+
β βββ gemini.py # Gemini API connectivity test
|
| 66 |
+
β βββ load_dataset.py # Load and inspect datasets
|
| 67 |
+
β βββ instruct_generation.py # Score functions for scientific relevance
|
| 68 |
+
β βββ extract_functions.py # Extract functions from enhanced_dataset.csv
|
| 69 |
+
β βββ extract_functions_v2.py # Extract functions v2 (better CSV/JSON handling)
|
| 70 |
+
β βββ merge_datasets.py # Merge res2.csv with dataset_all.csv
|
| 71 |
+
β βββ generate_programming_problems.py # Generate problems using Gemini API
|
| 72 |
+
β βββ generate_problems_batch.py # Batch problem generation (OpenAI batch API)
|
| 73 |
+
β βββ generate_problems_openai.py # Problem generation via OpenAI API
|
| 74 |
+
β βββ enrich_programming_problems.py # Enrich problems with source code context
|
| 75 |
+
β βββ vllm_high.py # VLLM-based high-throughput inference
|
| 76 |
+
β βββ vllm_qwen_batch.py # Qwen model batch inference via VLLM
|
| 77 |
+
β βββ show_pricing.py # Display API pricing information
|
| 78 |
+
β βββ check_enhanced.py # Validate enhanced dataset
|
| 79 |
+
β βββ check_index_distribution.py # Check index distribution
|
| 80 |
+
β βββ check_match.py # Check data matching
|
| 81 |
+
β βββ check_relationship.py # Check data relationships
|
| 82 |
+
β βββ is_sci_prompt.txt # Prompt: classify code as scientific computing
|
| 83 |
+
β βββ is_sci_prompt1.txt # Prompt variant for scientific classification
|
| 84 |
+
β βββ score_prompt.txt # Prompt: score function relevance
|
| 85 |
+
β βββ *.sh # Various shell scripts for batch processing
|
| 86 |
+
β βββ README.md # DATA3 dataset documentation
|
| 87 |
+
```
|
| 88 |
+
|
| 89 |
+
## Dataset Building Pipelines
|
| 90 |
+
|
| 91 |
+
### DATA1: Domain-Specific Code Dataset
|
| 92 |
+
|
| 93 |
+
**Goal**: Collect, filter, and export domain-specific code from GitHub repositories.
|
| 94 |
+
|
| 95 |
+
**Pipeline** (executed in order):
|
| 96 |
+
|
| 97 |
+
1. **Keyword Expansion & Search** (`main.py` / `main_v2.py`)
|
| 98 |
+
- Expand scientific keywords using LLM
|
| 99 |
+
- Search GitHub API for repositories matching keywords
|
| 100 |
+
- Check relevance using LLM (reads READMEs)
|
| 101 |
+
- Clone relevant repos (shallow clone)
|
| 102 |
+
- Filter to keep only code files
|
| 103 |
+
|
| 104 |
+
2. **External Data** (`download_dataset.py`)
|
| 105 |
+
- Download ChemPile code dataset from HuggingFace
|
| 106 |
+
|
| 107 |
+
3. **Merge & Deduplicate** (`merge_dataset.py`)
|
| 108 |
+
- Merge crawled repos with ChemPile data
|
| 109 |
+
- Deduplicate by content hash
|
| 110 |
+
|
| 111 |
+
4. **Analysis** (`analysis.py`, `compute_*.py`)
|
| 112 |
+
- Analyze code metrics (lines, comments, functions, tokens)
|
| 113 |
+
- Compute keyword and stars statistics
|
| 114 |
+
|
| 115 |
+
5. **Export** (`scripts/export_files_to_csv.py`)
|
| 116 |
+
- Export final dataset to CSV files grouped by keyword
|
| 117 |
+
|
| 118 |
+
6. **Reporting** (`reporting/`)
|
| 119 |
+
- Generate statistical reports and visualizations
|
| 120 |
+
|
| 121 |
+
### DATA2: Code-Documentation Alignment Dataset
|
| 122 |
+
|
| 123 |
+
**Goal**: Generate high-quality docstrings for scientific code functions.
|
| 124 |
+
|
| 125 |
+
**Pipeline** (executed in order):
|
| 126 |
+
|
| 127 |
+
1. **README Summarization** (`instruction_generation/`)
|
| 128 |
+
- Summarize repository READMEs using LLM
|
| 129 |
+
- Extract structured information from repos
|
| 130 |
+
|
| 131 |
+
2. **Function Extraction** (`step22/func_stat.py`)
|
| 132 |
+
- Parse code using tree-sitter to extract functions
|
| 133 |
+
- Multi-language support (Python, C, C++, Java, Go, Rust, Julia)
|
| 134 |
+
|
| 135 |
+
3. **README Processing** (`step22/md_stat.py`)
|
| 136 |
+
- Copy README summaries to function dataset directories
|
| 137 |
+
|
| 138 |
+
4. **Embedding Scoring** (`step22/emb_qwen_func.py`, `emb_qwen_md.py`)
|
| 139 |
+
- Score function quality using Qwen embedding model
|
| 140 |
+
- Score README quality using Qwen embedding model
|
| 141 |
+
|
| 142 |
+
5. **Function Filtering** (`step22/function_req.py`)
|
| 143 |
+
- Filter functions by combined quality score
|
| 144 |
+
|
| 145 |
+
6. **Docstring Generation** (`step22/gemini_generation.py`)
|
| 146 |
+
- Generate docstrings using Gemini API
|
| 147 |
+
- Budget monitoring with circuit breaker
|
| 148 |
+
- Checkpoint/resume support
|
| 149 |
+
|
| 150 |
+
7. **Alignment** (`step22/alignment.py`)
|
| 151 |
+
- Merge function data with generated docstrings
|
| 152 |
+
|
| 153 |
+
### DATA3: Programming Problems Generation Dataset
|
| 154 |
+
|
| 155 |
+
**Goal**: Generate programming problems inspired by scientific code.
|
| 156 |
+
|
| 157 |
+
**Pipeline** (executed in order):
|
| 158 |
+
|
| 159 |
+
1. **Documentation Generation** (`main.py`)
|
| 160 |
+
- Use RepoAgent to generate documentation for repositories
|
| 161 |
+
|
| 162 |
+
2. **Function Extraction** (`extract_functions.py`, `extract_functions_v2.py`)
|
| 163 |
+
- Extract individual functions from enhanced dataset
|
| 164 |
+
|
| 165 |
+
3. **Scientific Relevance Scoring** (`instruct_generation.py`)
|
| 166 |
+
- Score functions for scientific computing relevance
|
| 167 |
+
- Use `is_sci_prompt.txt` and `score_prompt.txt` as prompts
|
| 168 |
+
|
| 169 |
+
4. **Dataset Merge** (`merge_datasets.py`)
|
| 170 |
+
- Merge function scores with source code data
|
| 171 |
+
|
| 172 |
+
5. **Problem Generation** (`generate_programming_problems.py`)
|
| 173 |
+
- Generate programming problems using Gemini API
|
| 174 |
+
- Filter by relevance score
|
| 175 |
+
- Budget monitoring and cost control
|
| 176 |
+
|
| 177 |
+
6. **Enrichment** (`enrich_programming_problems.py`)
|
| 178 |
+
- Enrich generated problems with source code context
|
| 179 |
+
|
| 180 |
+
## Dependencies
|
| 181 |
+
|
| 182 |
+
### Common
|
| 183 |
+
- `pandas`, `tqdm`, `jsonlines`
|
| 184 |
+
- `python-dotenv`
|
| 185 |
+
|
| 186 |
+
### DATA1
|
| 187 |
+
- `langchain`, `langchain-openai`, `pydantic`, `loguru`
|
| 188 |
+
- `requests` (GitHub API)
|
| 189 |
+
- `matplotlib`, `seaborn`, `wordcloud` (reporting)
|
| 190 |
+
- `datasets` (HuggingFace)
|
| 191 |
+
|
| 192 |
+
### DATA2
|
| 193 |
+
- `tree-sitter`, `tree-sitter-python`, `tree-sitter-c`, etc.
|
| 194 |
+
- `vllm`, `transformers`, `torch` (embedding scoring)
|
| 195 |
+
- `google-cloud-aiplatform`, `vertexai` (Gemini API)
|
| 196 |
+
|
| 197 |
+
### DATA3
|
| 198 |
+
- `google-cloud-aiplatform`, `vertexai` (Gemini API)
|
| 199 |
+
- `openai` (OpenAI API)
|
| 200 |
+
- `vllm`, `transformers`, `torch` (local inference)
|
| 201 |
+
|
| 202 |
+
## Notes
|
| 203 |
+
|
| 204 |
+
- Scripts contain hardcoded paths that need to be updated for your environment
|
| 205 |
+
- API credentials (GitHub token, Gemini, OpenAI) need to be configured separately
|
| 206 |
+
- Large datasets require significant storage and compute resources
|
| 207 |
+
- Most scripts support checkpoint/resume for long-running processes
|