Spaces:
Sleeping
Sleeping
| # Evaluation Tasks | |
| This directory contains evaluation tasks organized by use case. | |
| ## Structure | |
| ``` | |
| tasks/ | |
| βββ sql_generation/ # SQL generation tasks | |
| β βββ nyc_taxi_small/ # NYC Taxi dataset | |
| βββ code_generation/ # Code generation tasks | |
| β βββ python_algorithms/ # Python algorithm tasks | |
| β βββ go_algorithms/ # Go algorithm tasks | |
| βββ documentation/ # Documentation generation tasks | |
| βββ technical_docs/ # Technical documentation tasks | |
| βββ api_documentation/ # API documentation tasks | |
| ``` | |
| ## Use Cases | |
| ### 1. SQL Generation | |
| - **Purpose**: Evaluate models on natural language to SQL query generation | |
| - **Datasets**: NYC Taxi Small | |
| - **Dialects**: Presto, BigQuery, Snowflake | |
| - **Metrics**: Correctness, execution success, result matching, dialect compliance | |
| ### 2. Code Generation | |
| - **Purpose**: Evaluate models on natural language to source code generation | |
| - **Languages**: Python, Go, JavaScript, Java | |
| - **Datasets**: Algorithm implementations, web services, data structures | |
| - **Metrics**: Syntax correctness, compilation success, execution success, code quality | |
| ### 3. Documentation Generation | |
| - **Purpose**: Evaluate models on natural language to technical documentation | |
| - **Formats**: Markdown, HTML, JSON, YAML | |
| - **Datasets**: API docs, technical guides, installation instructions | |
| - **Metrics**: Accuracy, completeness, clarity, format compliance | |
| ## Task Structure | |
| Each task directory contains: | |
| ### Required Files | |
| - `cases.yaml` - Test cases with questions and reference outputs | |
| - `loader.py` - Data loading and test execution utilities | |
| - `schema.sql` - Database schema (for SQL tasks) | |
| - `test_data.json` - Test data for evaluation (for code/doc tasks) | |
| ### Optional Files | |
| - `README.md` - Task-specific documentation | |
| - `requirements.txt` - Task-specific dependencies | |
| - `config.yaml` - Task-specific configuration | |
| ## Adding New Tasks | |
| 1. Create a new directory under the appropriate use case | |
| 2. Add the required files (`cases.yaml`, `loader.py`) | |
| 3. Define test cases with questions and reference outputs | |
| 4. Implement data loading and evaluation logic | |
| 5. Update the main configuration files | |
| ## Evaluation Metrics | |
| ### SQL Generation | |
| - **Correctness**: Exact match with reference SQL | |
| - **Execution Success**: SQL executes without errors | |
| - **Result Matching**: F1 score comparing query results | |
| - **Dialect Compliance**: Proper SQL transpilation | |
| - **Readability**: SQL structure and formatting | |
| ### Code Generation | |
| - **Syntax Correctness**: Code compiles without syntax errors | |
| - **Compilation Success**: Code builds successfully | |
| - **Execution Success**: Code runs and produces expected output | |
| - **Code Quality**: Follows language best practices | |
| - **Performance**: Code efficiency and optimization | |
| ### Documentation Generation | |
| - **Accuracy**: Content matches reference documentation | |
| - **Completeness**: Covers all required information | |
| - **Clarity**: Easy to understand and follow | |
| - **Format Compliance**: Follows specified documentation format | |
| - **Technical Correctness**: Technically accurate information | |