Spaces:
Sleeping
Sleeping
File size: 3,131 Bytes
acd8e16 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 |
# Evaluation Tasks
This directory contains evaluation tasks organized by use case.
## Structure
```
tasks/
βββ sql_generation/ # SQL generation tasks
β βββ nyc_taxi_small/ # NYC Taxi dataset
βββ code_generation/ # Code generation tasks
β βββ python_algorithms/ # Python algorithm tasks
β βββ go_algorithms/ # Go algorithm tasks
βββ documentation/ # Documentation generation tasks
βββ technical_docs/ # Technical documentation tasks
βββ api_documentation/ # API documentation tasks
```
## Use Cases
### 1. SQL Generation
- **Purpose**: Evaluate models on natural language to SQL query generation
- **Datasets**: NYC Taxi Small
- **Dialects**: Presto, BigQuery, Snowflake
- **Metrics**: Correctness, execution success, result matching, dialect compliance
### 2. Code Generation
- **Purpose**: Evaluate models on natural language to source code generation
- **Languages**: Python, Go, JavaScript, Java
- **Datasets**: Algorithm implementations, web services, data structures
- **Metrics**: Syntax correctness, compilation success, execution success, code quality
### 3. Documentation Generation
- **Purpose**: Evaluate models on natural language to technical documentation
- **Formats**: Markdown, HTML, JSON, YAML
- **Datasets**: API docs, technical guides, installation instructions
- **Metrics**: Accuracy, completeness, clarity, format compliance
## Task Structure
Each task directory contains:
### Required Files
- `cases.yaml` - Test cases with questions and reference outputs
- `loader.py` - Data loading and test execution utilities
- `schema.sql` - Database schema (for SQL tasks)
- `test_data.json` - Test data for evaluation (for code/doc tasks)
### Optional Files
- `README.md` - Task-specific documentation
- `requirements.txt` - Task-specific dependencies
- `config.yaml` - Task-specific configuration
## Adding New Tasks
1. Create a new directory under the appropriate use case
2. Add the required files (`cases.yaml`, `loader.py`)
3. Define test cases with questions and reference outputs
4. Implement data loading and evaluation logic
5. Update the main configuration files
## Evaluation Metrics
### SQL Generation
- **Correctness**: Exact match with reference SQL
- **Execution Success**: SQL executes without errors
- **Result Matching**: F1 score comparing query results
- **Dialect Compliance**: Proper SQL transpilation
- **Readability**: SQL structure and formatting
### Code Generation
- **Syntax Correctness**: Code compiles without syntax errors
- **Compilation Success**: Code builds successfully
- **Execution Success**: Code runs and produces expected output
- **Code Quality**: Follows language best practices
- **Performance**: Code efficiency and optimization
### Documentation Generation
- **Accuracy**: Content matches reference documentation
- **Completeness**: Covers all required information
- **Clarity**: Easy to understand and follow
- **Format Compliance**: Follows specified documentation format
- **Technical Correctness**: Technically accurate information
|