DataEngEval / tasks /README.md
uparekh01151's picture
Initial commit for DataEngEval
acd8e16
# Evaluation Tasks
This directory contains evaluation tasks organized by use case.
## Structure
```
tasks/
β”œβ”€β”€ sql_generation/ # SQL generation tasks
β”‚ └── nyc_taxi_small/ # NYC Taxi dataset
β”œβ”€β”€ code_generation/ # Code generation tasks
β”‚ β”œβ”€β”€ python_algorithms/ # Python algorithm tasks
β”‚ └── go_algorithms/ # Go algorithm tasks
└── documentation/ # Documentation generation tasks
β”œβ”€β”€ technical_docs/ # Technical documentation tasks
└── api_documentation/ # API documentation tasks
```
## Use Cases
### 1. SQL Generation
- **Purpose**: Evaluate models on natural language to SQL query generation
- **Datasets**: NYC Taxi Small
- **Dialects**: Presto, BigQuery, Snowflake
- **Metrics**: Correctness, execution success, result matching, dialect compliance
### 2. Code Generation
- **Purpose**: Evaluate models on natural language to source code generation
- **Languages**: Python, Go, JavaScript, Java
- **Datasets**: Algorithm implementations, web services, data structures
- **Metrics**: Syntax correctness, compilation success, execution success, code quality
### 3. Documentation Generation
- **Purpose**: Evaluate models on natural language to technical documentation
- **Formats**: Markdown, HTML, JSON, YAML
- **Datasets**: API docs, technical guides, installation instructions
- **Metrics**: Accuracy, completeness, clarity, format compliance
## Task Structure
Each task directory contains:
### Required Files
- `cases.yaml` - Test cases with questions and reference outputs
- `loader.py` - Data loading and test execution utilities
- `schema.sql` - Database schema (for SQL tasks)
- `test_data.json` - Test data for evaluation (for code/doc tasks)
### Optional Files
- `README.md` - Task-specific documentation
- `requirements.txt` - Task-specific dependencies
- `config.yaml` - Task-specific configuration
## Adding New Tasks
1. Create a new directory under the appropriate use case
2. Add the required files (`cases.yaml`, `loader.py`)
3. Define test cases with questions and reference outputs
4. Implement data loading and evaluation logic
5. Update the main configuration files
## Evaluation Metrics
### SQL Generation
- **Correctness**: Exact match with reference SQL
- **Execution Success**: SQL executes without errors
- **Result Matching**: F1 score comparing query results
- **Dialect Compliance**: Proper SQL transpilation
- **Readability**: SQL structure and formatting
### Code Generation
- **Syntax Correctness**: Code compiles without syntax errors
- **Compilation Success**: Code builds successfully
- **Execution Success**: Code runs and produces expected output
- **Code Quality**: Follows language best practices
- **Performance**: Code efficiency and optimization
### Documentation Generation
- **Accuracy**: Content matches reference documentation
- **Completeness**: Covers all required information
- **Clarity**: Easy to understand and follow
- **Format Compliance**: Follows specified documentation format
- **Technical Correctness**: Technically accurate information