YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Dataset Builder

This project contains the complete source code for building three domain-specific datasets for scientific computing code intelligence research.

Project Structure

dataset_builder/
β”œβ”€β”€ README.md                          # This file
β”‚
β”œβ”€β”€ data1/                             # DATA1: Domain-Specific Code Dataset
β”‚   β”œβ”€β”€ main.py                        # Step 0-1: Keyword expansion + GitHub repo search
β”‚   β”œβ”€β”€ main_v2.py                     # Step 0-4: Full pipeline (search β†’ check β†’ clone β†’ filter)
β”‚   β”œβ”€β”€ util.py                        # Shared utilities (logger, LLM calls, code extensions)
β”‚   β”œβ”€β”€ download_dataset.py            # Download ChemPile code dataset from HuggingFace
β”‚   β”œβ”€β”€ merge_dataset.py               # Merge crawled repos with ChemPile data, deduplicate
β”‚   β”œβ”€β”€ analysis.py                    # Code-level analysis (comments, functions, tokens)
β”‚   β”œβ”€β”€ compute_stars_keywords.py      # Compute stars/keyword statistics
β”‚   β”œβ”€β”€ compute_statistics.py          # Compute code statistics from JSONL analysis files
β”‚   β”œβ”€β”€ rename.py                      # Rename repo directories to owner___repo format
β”‚   β”œβ”€β”€ rename2.py                     # Rename ChemPile files with zero-padded numbering
β”‚   β”œβ”€β”€ pyproject.toml                 # Python project config
β”‚   β”œβ”€β”€ scripts/
β”‚   β”‚   └── export_files_to_csv.py     # Export repo files to CSV grouped by keyword
β”‚   β”œβ”€β”€ reporting/                     # Statistical reporting and visualization
β”‚   β”‚   β”œβ”€β”€ __init__.py
β”‚   β”‚   β”œβ”€β”€ main.py                    # Reporting entry point
β”‚   β”‚   β”œβ”€β”€ visualization.py           # Generate figures (funnel, distributions, etc.)
β”‚   β”‚   β”œβ”€β”€ repo_meta_scan.py          # Scan repo-level metadata
β”‚   β”‚   β”œβ”€β”€ code_file_stats.py         # File-level code statistics
β”‚   β”‚   β”œβ”€β”€ code_file_stats_fast.py    # Optimized file-level statistics
β”‚   β”‚   β”œβ”€β”€ stage_a_stats.py           # Stage A (search/check) statistics
β”‚   β”‚   β”œβ”€β”€ stage_b_stats.py           # Stage B (clone/filter) statistics
β”‚   β”‚   └── join_insights.py           # Join and cross-analyze insights
β”‚   └── README.md                      # DATA1 dataset documentation
β”‚
β”œβ”€β”€ data2/                             # DATA2: Code-Documentation Alignment Dataset
β”‚   β”œβ”€β”€ instruction_generation/        # README summarization pipeline
β”‚   β”‚   β”œβ”€β”€ pipeline.py                # Unified entry (summarize + parse modes)
β”‚   β”‚   β”œβ”€β”€ summarize_repo_readme.py   # Summarize repo READMEs using LLM
β”‚   β”‚   β”œβ”€β”€ extract_repo_functions.py  # Extract functions from repos
β”‚   β”‚   β”œβ”€β”€ schemas.py                 # Pydantic data schemas
β”‚   β”‚   └── prompts/
β”‚   β”‚       β”œβ”€β”€ function_extract.txt   # Prompt for function extraction
β”‚   β”‚       └── readme_summary.txt     # Prompt for README summarization
β”‚   β”œβ”€β”€ step22/                        # Function scoring, generation, alignment
β”‚   β”‚   β”œβ”€β”€ build.py                   # Build tree-sitter language parsers
β”‚   β”‚   β”œβ”€β”€ func_stat.py               # Extract functions using tree-sitter
β”‚   β”‚   β”œβ”€β”€ md_stat.py                 # Extract & save README summaries
β”‚   β”‚   β”œβ”€β”€ emb_qwen_func.py           # Score functions using Qwen embedding model
β”‚   β”‚   β”œβ”€β”€ emb_qwen_md.py             # Score READMEs using Qwen embedding model
β”‚   β”‚   β”œβ”€β”€ function_req.py            # Filter functions by score threshold
β”‚   β”‚   β”œβ”€β”€ gemini_generation.py       # Generate docstrings using Gemini API
β”‚   β”‚   β”œβ”€β”€ alignment.py               # Align functions with generated docstrings
β”‚   β”‚   β”œβ”€β”€ prompt.txt                 # Prompt template for docstring generation
β”‚   β”‚   β”œβ”€β”€ depend_analysis.py         # Dependency/call-graph analysis
β”‚   β”‚   β”œβ”€β”€ find_none_score_func.py    # Find functions missing scores
β”‚   β”‚   β”œβ”€β”€ folder_stat.py             # Repository folder statistics
β”‚   β”‚   β”œβ”€β”€ ppt.py                     # Visualization of alignment data
β”‚   β”‚   └── debug_parser.py            # Debug tree-sitter parser loading
β”‚   └── README.md                      # DATA2 dataset documentation
β”‚
β”œβ”€β”€ data3/                             # DATA3: Programming Problems Generation Dataset
β”‚   β”œβ”€β”€ main.py                        # RepoAgent: generate docs for repos
β”‚   β”œβ”€β”€ gemini.py                      # Gemini API connectivity test
β”‚   β”œβ”€β”€ load_dataset.py                # Load and inspect datasets
β”‚   β”œβ”€β”€ instruct_generation.py         # Score functions for scientific relevance
β”‚   β”œβ”€β”€ extract_functions.py           # Extract functions from enhanced_dataset.csv
β”‚   β”œβ”€β”€ extract_functions_v2.py        # Extract functions v2 (better CSV/JSON handling)
β”‚   β”œβ”€β”€ merge_datasets.py              # Merge res2.csv with dataset_all.csv
β”‚   β”œβ”€β”€ generate_programming_problems.py  # Generate problems using Gemini API
β”‚   β”œβ”€β”€ generate_problems_batch.py     # Batch problem generation (OpenAI batch API)
β”‚   β”œβ”€β”€ generate_problems_openai.py    # Problem generation via OpenAI API
β”‚   β”œβ”€β”€ enrich_programming_problems.py # Enrich problems with source code context
β”‚   β”œβ”€β”€ vllm_high.py                   # VLLM-based high-throughput inference
β”‚   β”œβ”€β”€ vllm_qwen_batch.py            # Qwen model batch inference via VLLM
β”‚   β”œβ”€β”€ show_pricing.py                # Display API pricing information
β”‚   β”œβ”€β”€ check_enhanced.py              # Validate enhanced dataset
β”‚   β”œβ”€β”€ check_index_distribution.py    # Check index distribution
β”‚   β”œβ”€β”€ check_match.py                 # Check data matching
β”‚   β”œβ”€β”€ check_relationship.py          # Check data relationships
β”‚   β”œβ”€β”€ is_sci_prompt.txt              # Prompt: classify code as scientific computing
β”‚   β”œβ”€β”€ is_sci_prompt1.txt             # Prompt variant for scientific classification
β”‚   β”œβ”€β”€ score_prompt.txt               # Prompt: score function relevance
β”‚   β”œβ”€β”€ *.sh                           # Various shell scripts for batch processing
β”‚   └── README.md                      # DATA3 dataset documentation

Dataset Building Pipelines

DATA1: Domain-Specific Code Dataset

Goal: Collect, filter, and export domain-specific code from GitHub repositories.

Pipeline (executed in order):

  1. Keyword Expansion & Search (main.py / main_v2.py)

    • Expand scientific keywords using LLM
    • Search GitHub API for repositories matching keywords
    • Check relevance using LLM (reads READMEs)
    • Clone relevant repos (shallow clone)
    • Filter to keep only code files
  2. External Data (download_dataset.py)

    • Download ChemPile code dataset from HuggingFace
  3. Merge & Deduplicate (merge_dataset.py)

    • Merge crawled repos with ChemPile data
    • Deduplicate by content hash
  4. Analysis (analysis.py, compute_*.py)

    • Analyze code metrics (lines, comments, functions, tokens)
    • Compute keyword and stars statistics
  5. Export (scripts/export_files_to_csv.py)

    • Export final dataset to CSV files grouped by keyword
  6. Reporting (reporting/)

    • Generate statistical reports and visualizations

DATA2: Code-Documentation Alignment Dataset

Goal: Generate high-quality docstrings for scientific code functions.

Pipeline (executed in order):

  1. README Summarization (instruction_generation/)

    • Summarize repository READMEs using LLM
    • Extract structured information from repos
  2. Function Extraction (step22/func_stat.py)

    • Parse code using tree-sitter to extract functions
    • Multi-language support (Python, C, C++, Java, Go, Rust, Julia)
  3. README Processing (step22/md_stat.py)

    • Copy README summaries to function dataset directories
  4. Embedding Scoring (step22/emb_qwen_func.py, emb_qwen_md.py)

    • Score function quality using Qwen embedding model
    • Score README quality using Qwen embedding model
  5. Function Filtering (step22/function_req.py)

    • Filter functions by combined quality score
  6. Docstring Generation (step22/gemini_generation.py)

    • Generate docstrings using Gemini API
    • Budget monitoring with circuit breaker
    • Checkpoint/resume support
  7. Alignment (step22/alignment.py)

    • Merge function data with generated docstrings

DATA3: Programming Problems Generation Dataset

Goal: Generate programming problems inspired by scientific code.

Pipeline (executed in order):

  1. Documentation Generation (main.py)

    • Use RepoAgent to generate documentation for repositories
  2. Function Extraction (extract_functions.py, extract_functions_v2.py)

    • Extract individual functions from enhanced dataset
  3. Scientific Relevance Scoring (instruct_generation.py)

    • Score functions for scientific computing relevance
    • Use is_sci_prompt.txt and score_prompt.txt as prompts
  4. Dataset Merge (merge_datasets.py)

    • Merge function scores with source code data
  5. Problem Generation (generate_programming_problems.py)

    • Generate programming problems using Gemini API
    • Filter by relevance score
    • Budget monitoring and cost control
  6. Enrichment (enrich_programming_problems.py)

    • Enrich generated problems with source code context

Dependencies

Common

  • pandas, tqdm, jsonlines
  • python-dotenv

DATA1

  • langchain, langchain-openai, pydantic, loguru
  • requests (GitHub API)
  • matplotlib, seaborn, wordcloud (reporting)
  • datasets (HuggingFace)

DATA2

  • tree-sitter, tree-sitter-python, tree-sitter-c, etc.
  • vllm, transformers, torch (embedding scoring)
  • google-cloud-aiplatform, vertexai (Gemini API)

DATA3

  • google-cloud-aiplatform, vertexai (Gemini API)
  • openai (OpenAI API)
  • vllm, transformers, torch (local inference)

Notes

  • Scripts contain hardcoded paths that need to be updated for your environment
  • API credentials (GitHub token, Gemini, OpenAI) need to be configured separately
  • Large datasets require significant storage and compute resources
  • Most scripts support checkpoint/resume for long-running processes
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support