SunDou commited on
Commit
ddd7e5c
Β·
verified Β·
1 Parent(s): a440642

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +207 -0
README.md ADDED
@@ -0,0 +1,207 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Dataset Builder
2
+
3
+ This project contains the complete source code for building three domain-specific datasets for scientific computing code intelligence research.
4
+
5
+ ## Project Structure
6
+
7
+ ```
8
+ dataset_builder/
9
+ β”œβ”€β”€ README.md # This file
10
+ β”‚
11
+ β”œβ”€β”€ data1/ # DATA1: Domain-Specific Code Dataset
12
+ β”‚ β”œβ”€β”€ main.py # Step 0-1: Keyword expansion + GitHub repo search
13
+ β”‚ β”œβ”€β”€ main_v2.py # Step 0-4: Full pipeline (search β†’ check β†’ clone β†’ filter)
14
+ β”‚ β”œβ”€β”€ util.py # Shared utilities (logger, LLM calls, code extensions)
15
+ β”‚ β”œβ”€β”€ download_dataset.py # Download ChemPile code dataset from HuggingFace
16
+ β”‚ β”œβ”€β”€ merge_dataset.py # Merge crawled repos with ChemPile data, deduplicate
17
+ β”‚ β”œβ”€β”€ analysis.py # Code-level analysis (comments, functions, tokens)
18
+ β”‚ β”œβ”€β”€ compute_stars_keywords.py # Compute stars/keyword statistics
19
+ β”‚ β”œβ”€β”€ compute_statistics.py # Compute code statistics from JSONL analysis files
20
+ β”‚ β”œβ”€β”€ rename.py # Rename repo directories to owner___repo format
21
+ β”‚ β”œβ”€β”€ rename2.py # Rename ChemPile files with zero-padded numbering
22
+ β”‚ β”œβ”€β”€ pyproject.toml # Python project config
23
+ β”‚ β”œβ”€β”€ scripts/
24
+ β”‚ β”‚ └── export_files_to_csv.py # Export repo files to CSV grouped by keyword
25
+ β”‚ β”œβ”€β”€ reporting/ # Statistical reporting and visualization
26
+ β”‚ β”‚ β”œβ”€β”€ __init__.py
27
+ β”‚ β”‚ β”œβ”€β”€ main.py # Reporting entry point
28
+ β”‚ β”‚ β”œβ”€β”€ visualization.py # Generate figures (funnel, distributions, etc.)
29
+ β”‚ β”‚ β”œβ”€β”€ repo_meta_scan.py # Scan repo-level metadata
30
+ β”‚ β”‚ β”œβ”€β”€ code_file_stats.py # File-level code statistics
31
+ β”‚ β”‚ β”œβ”€β”€ code_file_stats_fast.py # Optimized file-level statistics
32
+ β”‚ β”‚ β”œβ”€β”€ stage_a_stats.py # Stage A (search/check) statistics
33
+ β”‚ β”‚ β”œβ”€β”€ stage_b_stats.py # Stage B (clone/filter) statistics
34
+ β”‚ β”‚ └── join_insights.py # Join and cross-analyze insights
35
+ β”‚ └── README.md # DATA1 dataset documentation
36
+ β”‚
37
+ β”œβ”€β”€ data2/ # DATA2: Code-Documentation Alignment Dataset
38
+ β”‚ β”œβ”€β”€ instruction_generation/ # README summarization pipeline
39
+ β”‚ β”‚ β”œβ”€β”€ pipeline.py # Unified entry (summarize + parse modes)
40
+ β”‚ β”‚ β”œβ”€β”€ summarize_repo_readme.py # Summarize repo READMEs using LLM
41
+ β”‚ β”‚ β”œβ”€β”€ extract_repo_functions.py # Extract functions from repos
42
+ β”‚ β”‚ β”œβ”€β”€ schemas.py # Pydantic data schemas
43
+ β”‚ β”‚ └── prompts/
44
+ β”‚ β”‚ β”œβ”€β”€ function_extract.txt # Prompt for function extraction
45
+ β”‚ β”‚ └── readme_summary.txt # Prompt for README summarization
46
+ β”‚ β”œβ”€β”€ step22/ # Function scoring, generation, alignment
47
+ β”‚ β”‚ β”œβ”€β”€ build.py # Build tree-sitter language parsers
48
+ β”‚ β”‚ β”œβ”€β”€ func_stat.py # Extract functions using tree-sitter
49
+ β”‚ β”‚ β”œβ”€β”€ md_stat.py # Extract & save README summaries
50
+ β”‚ β”‚ β”œβ”€β”€ emb_qwen_func.py # Score functions using Qwen embedding model
51
+ β”‚ β”‚ β”œβ”€β”€ emb_qwen_md.py # Score READMEs using Qwen embedding model
52
+ β”‚ β”‚ β”œβ”€β”€ function_req.py # Filter functions by score threshold
53
+ β”‚ β”‚ β”œβ”€β”€ gemini_generation.py # Generate docstrings using Gemini API
54
+ β”‚ β”‚ β”œβ”€β”€ alignment.py # Align functions with generated docstrings
55
+ β”‚ β”‚ β”œβ”€β”€ prompt.txt # Prompt template for docstring generation
56
+ β”‚ β”‚ β”œβ”€β”€ depend_analysis.py # Dependency/call-graph analysis
57
+ β”‚ β”‚ β”œβ”€β”€ find_none_score_func.py # Find functions missing scores
58
+ β”‚ β”‚ β”œβ”€β”€ folder_stat.py # Repository folder statistics
59
+ β”‚ β”‚ β”œβ”€β”€ ppt.py # Visualization of alignment data
60
+ β”‚ β”‚ └── debug_parser.py # Debug tree-sitter parser loading
61
+ β”‚ └── README.md # DATA2 dataset documentation
62
+ β”‚
63
+ β”œβ”€β”€ data3/ # DATA3: Programming Problems Generation Dataset
64
+ β”‚ β”œβ”€β”€ main.py # RepoAgent: generate docs for repos
65
+ β”‚ β”œβ”€β”€ gemini.py # Gemini API connectivity test
66
+ β”‚ β”œβ”€β”€ load_dataset.py # Load and inspect datasets
67
+ β”‚ β”œβ”€β”€ instruct_generation.py # Score functions for scientific relevance
68
+ β”‚ β”œβ”€β”€ extract_functions.py # Extract functions from enhanced_dataset.csv
69
+ β”‚ β”œβ”€β”€ extract_functions_v2.py # Extract functions v2 (better CSV/JSON handling)
70
+ β”‚ β”œβ”€β”€ merge_datasets.py # Merge res2.csv with dataset_all.csv
71
+ β”‚ β”œβ”€β”€ generate_programming_problems.py # Generate problems using Gemini API
72
+ β”‚ β”œβ”€β”€ generate_problems_batch.py # Batch problem generation (OpenAI batch API)
73
+ β”‚ β”œβ”€β”€ generate_problems_openai.py # Problem generation via OpenAI API
74
+ β”‚ β”œβ”€β”€ enrich_programming_problems.py # Enrich problems with source code context
75
+ β”‚ β”œβ”€β”€ vllm_high.py # VLLM-based high-throughput inference
76
+ β”‚ β”œβ”€β”€ vllm_qwen_batch.py # Qwen model batch inference via VLLM
77
+ β”‚ β”œβ”€β”€ show_pricing.py # Display API pricing information
78
+ β”‚ β”œβ”€β”€ check_enhanced.py # Validate enhanced dataset
79
+ β”‚ β”œβ”€β”€ check_index_distribution.py # Check index distribution
80
+ β”‚ β”œβ”€β”€ check_match.py # Check data matching
81
+ β”‚ β”œβ”€β”€ check_relationship.py # Check data relationships
82
+ β”‚ β”œβ”€β”€ is_sci_prompt.txt # Prompt: classify code as scientific computing
83
+ β”‚ β”œβ”€β”€ is_sci_prompt1.txt # Prompt variant for scientific classification
84
+ β”‚ β”œβ”€β”€ score_prompt.txt # Prompt: score function relevance
85
+ β”‚ β”œβ”€β”€ *.sh # Various shell scripts for batch processing
86
+ β”‚ └── README.md # DATA3 dataset documentation
87
+ ```
88
+
89
+ ## Dataset Building Pipelines
90
+
91
+ ### DATA1: Domain-Specific Code Dataset
92
+
93
+ **Goal**: Collect, filter, and export domain-specific code from GitHub repositories.
94
+
95
+ **Pipeline** (executed in order):
96
+
97
+ 1. **Keyword Expansion & Search** (`main.py` / `main_v2.py`)
98
+ - Expand scientific keywords using LLM
99
+ - Search GitHub API for repositories matching keywords
100
+ - Check relevance using LLM (reads READMEs)
101
+ - Clone relevant repos (shallow clone)
102
+ - Filter to keep only code files
103
+
104
+ 2. **External Data** (`download_dataset.py`)
105
+ - Download ChemPile code dataset from HuggingFace
106
+
107
+ 3. **Merge & Deduplicate** (`merge_dataset.py`)
108
+ - Merge crawled repos with ChemPile data
109
+ - Deduplicate by content hash
110
+
111
+ 4. **Analysis** (`analysis.py`, `compute_*.py`)
112
+ - Analyze code metrics (lines, comments, functions, tokens)
113
+ - Compute keyword and stars statistics
114
+
115
+ 5. **Export** (`scripts/export_files_to_csv.py`)
116
+ - Export final dataset to CSV files grouped by keyword
117
+
118
+ 6. **Reporting** (`reporting/`)
119
+ - Generate statistical reports and visualizations
120
+
121
+ ### DATA2: Code-Documentation Alignment Dataset
122
+
123
+ **Goal**: Generate high-quality docstrings for scientific code functions.
124
+
125
+ **Pipeline** (executed in order):
126
+
127
+ 1. **README Summarization** (`instruction_generation/`)
128
+ - Summarize repository READMEs using LLM
129
+ - Extract structured information from repos
130
+
131
+ 2. **Function Extraction** (`step22/func_stat.py`)
132
+ - Parse code using tree-sitter to extract functions
133
+ - Multi-language support (Python, C, C++, Java, Go, Rust, Julia)
134
+
135
+ 3. **README Processing** (`step22/md_stat.py`)
136
+ - Copy README summaries to function dataset directories
137
+
138
+ 4. **Embedding Scoring** (`step22/emb_qwen_func.py`, `emb_qwen_md.py`)
139
+ - Score function quality using Qwen embedding model
140
+ - Score README quality using Qwen embedding model
141
+
142
+ 5. **Function Filtering** (`step22/function_req.py`)
143
+ - Filter functions by combined quality score
144
+
145
+ 6. **Docstring Generation** (`step22/gemini_generation.py`)
146
+ - Generate docstrings using Gemini API
147
+ - Budget monitoring with circuit breaker
148
+ - Checkpoint/resume support
149
+
150
+ 7. **Alignment** (`step22/alignment.py`)
151
+ - Merge function data with generated docstrings
152
+
153
+ ### DATA3: Programming Problems Generation Dataset
154
+
155
+ **Goal**: Generate programming problems inspired by scientific code.
156
+
157
+ **Pipeline** (executed in order):
158
+
159
+ 1. **Documentation Generation** (`main.py`)
160
+ - Use RepoAgent to generate documentation for repositories
161
+
162
+ 2. **Function Extraction** (`extract_functions.py`, `extract_functions_v2.py`)
163
+ - Extract individual functions from enhanced dataset
164
+
165
+ 3. **Scientific Relevance Scoring** (`instruct_generation.py`)
166
+ - Score functions for scientific computing relevance
167
+ - Use `is_sci_prompt.txt` and `score_prompt.txt` as prompts
168
+
169
+ 4. **Dataset Merge** (`merge_datasets.py`)
170
+ - Merge function scores with source code data
171
+
172
+ 5. **Problem Generation** (`generate_programming_problems.py`)
173
+ - Generate programming problems using Gemini API
174
+ - Filter by relevance score
175
+ - Budget monitoring and cost control
176
+
177
+ 6. **Enrichment** (`enrich_programming_problems.py`)
178
+ - Enrich generated problems with source code context
179
+
180
+ ## Dependencies
181
+
182
+ ### Common
183
+ - `pandas`, `tqdm`, `jsonlines`
184
+ - `python-dotenv`
185
+
186
+ ### DATA1
187
+ - `langchain`, `langchain-openai`, `pydantic`, `loguru`
188
+ - `requests` (GitHub API)
189
+ - `matplotlib`, `seaborn`, `wordcloud` (reporting)
190
+ - `datasets` (HuggingFace)
191
+
192
+ ### DATA2
193
+ - `tree-sitter`, `tree-sitter-python`, `tree-sitter-c`, etc.
194
+ - `vllm`, `transformers`, `torch` (embedding scoring)
195
+ - `google-cloud-aiplatform`, `vertexai` (Gemini API)
196
+
197
+ ### DATA3
198
+ - `google-cloud-aiplatform`, `vertexai` (Gemini API)
199
+ - `openai` (OpenAI API)
200
+ - `vllm`, `transformers`, `torch` (local inference)
201
+
202
+ ## Notes
203
+
204
+ - Scripts contain hardcoded paths that need to be updated for your environment
205
+ - API credentials (GitHub token, Gemini, OpenAI) need to be configured separately
206
+ - Large datasets require significant storage and compute resources
207
+ - Most scripts support checkpoint/resume for long-running processes