File size: 3,131 Bytes
acd8e16
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
# Evaluation Tasks

This directory contains evaluation tasks organized by use case.

## Structure

```
tasks/
β”œβ”€β”€ sql_generation/           # SQL generation tasks
β”‚   └── nyc_taxi_small/      # NYC Taxi dataset
β”œβ”€β”€ code_generation/          # Code generation tasks
β”‚   β”œβ”€β”€ python_algorithms/   # Python algorithm tasks
β”‚   └── go_algorithms/       # Go algorithm tasks
└── documentation/           # Documentation generation tasks
    β”œβ”€β”€ technical_docs/      # Technical documentation tasks
    └── api_documentation/   # API documentation tasks
```

## Use Cases

### 1. SQL Generation
- **Purpose**: Evaluate models on natural language to SQL query generation
- **Datasets**: NYC Taxi Small
- **Dialects**: Presto, BigQuery, Snowflake
- **Metrics**: Correctness, execution success, result matching, dialect compliance

### 2. Code Generation
- **Purpose**: Evaluate models on natural language to source code generation
- **Languages**: Python, Go, JavaScript, Java
- **Datasets**: Algorithm implementations, web services, data structures
- **Metrics**: Syntax correctness, compilation success, execution success, code quality

### 3. Documentation Generation
- **Purpose**: Evaluate models on natural language to technical documentation
- **Formats**: Markdown, HTML, JSON, YAML
- **Datasets**: API docs, technical guides, installation instructions
- **Metrics**: Accuracy, completeness, clarity, format compliance

## Task Structure

Each task directory contains:

### Required Files
- `cases.yaml` - Test cases with questions and reference outputs
- `loader.py` - Data loading and test execution utilities
- `schema.sql` - Database schema (for SQL tasks)
- `test_data.json` - Test data for evaluation (for code/doc tasks)

### Optional Files
- `README.md` - Task-specific documentation
- `requirements.txt` - Task-specific dependencies
- `config.yaml` - Task-specific configuration

## Adding New Tasks

1. Create a new directory under the appropriate use case
2. Add the required files (`cases.yaml`, `loader.py`)
3. Define test cases with questions and reference outputs
4. Implement data loading and evaluation logic
5. Update the main configuration files

## Evaluation Metrics

### SQL Generation
- **Correctness**: Exact match with reference SQL
- **Execution Success**: SQL executes without errors
- **Result Matching**: F1 score comparing query results
- **Dialect Compliance**: Proper SQL transpilation
- **Readability**: SQL structure and formatting

### Code Generation
- **Syntax Correctness**: Code compiles without syntax errors
- **Compilation Success**: Code builds successfully
- **Execution Success**: Code runs and produces expected output
- **Code Quality**: Follows language best practices
- **Performance**: Code efficiency and optimization

### Documentation Generation
- **Accuracy**: Content matches reference documentation
- **Completeness**: Covers all required information
- **Clarity**: Easy to understand and follow
- **Format Compliance**: Follows specified documentation format
- **Technical Correctness**: Technically accurate information