--- tags: - code - programming - dataset pretty_name: "Coding Dataset" --- # Coding Dataset Production-grade dataset for training AI coding agents. ## Dataset Summary - **Total Examples**: 6 (demo) - **Languages**: Python, JavaScript, Java - **Task Types**: Code Generation - **License**: CC0-1.0 ## Dataset Structure ### Data Splits - train: 70% of data - validation: 15% of data - test: 15% of data ### Features - `id` (string): Unique identifier - `code` (string): Source code snippet - `code_description` (string): Natural language description - `programming_language` (string): Language (python, javascript, java, etc.) - `task_type` (string): Type of task - `difficulty_level` (string): Difficulty (beginner, intermediate, advanced, expert) - `quality_score` (float): Quality score 0.0-1.0 - `is_tested` (bool): Code is tested - `has_bugs` (bool): Known bugs exist - `lines_of_code` (int): Number of lines - `collected_at` (string): Collection timestamp ## Usage ```python from datasets import load_dataset # Load dataset dataset = load_dataset("romcmu863/code-dataset") # Access splits train = dataset['train'] validation = dataset['validation'] test = dataset['test'] # Get first example example = train[0] print(example['code_description']) print(example['code']) ``` ## License CC0-1.0 ## Created 2025-10-25