DataEngEval / tasks /README.md
uparekh01151's picture
Initial commit for DataEngEval
acd8e16

A newer version of the Gradio SDK is available: 6.1.0

Upgrade

Evaluation Tasks

This directory contains evaluation tasks organized by use case.

Structure

tasks/
β”œβ”€β”€ sql_generation/           # SQL generation tasks
β”‚   └── nyc_taxi_small/      # NYC Taxi dataset
β”œβ”€β”€ code_generation/          # Code generation tasks
β”‚   β”œβ”€β”€ python_algorithms/   # Python algorithm tasks
β”‚   └── go_algorithms/       # Go algorithm tasks
└── documentation/           # Documentation generation tasks
    β”œβ”€β”€ technical_docs/      # Technical documentation tasks
    └── api_documentation/   # API documentation tasks

Use Cases

1. SQL Generation

  • Purpose: Evaluate models on natural language to SQL query generation
  • Datasets: NYC Taxi Small
  • Dialects: Presto, BigQuery, Snowflake
  • Metrics: Correctness, execution success, result matching, dialect compliance

2. Code Generation

  • Purpose: Evaluate models on natural language to source code generation
  • Languages: Python, Go, JavaScript, Java
  • Datasets: Algorithm implementations, web services, data structures
  • Metrics: Syntax correctness, compilation success, execution success, code quality

3. Documentation Generation

  • Purpose: Evaluate models on natural language to technical documentation
  • Formats: Markdown, HTML, JSON, YAML
  • Datasets: API docs, technical guides, installation instructions
  • Metrics: Accuracy, completeness, clarity, format compliance

Task Structure

Each task directory contains:

Required Files

  • cases.yaml - Test cases with questions and reference outputs
  • loader.py - Data loading and test execution utilities
  • schema.sql - Database schema (for SQL tasks)
  • test_data.json - Test data for evaluation (for code/doc tasks)

Optional Files

  • README.md - Task-specific documentation
  • requirements.txt - Task-specific dependencies
  • config.yaml - Task-specific configuration

Adding New Tasks

  1. Create a new directory under the appropriate use case
  2. Add the required files (cases.yaml, loader.py)
  3. Define test cases with questions and reference outputs
  4. Implement data loading and evaluation logic
  5. Update the main configuration files

Evaluation Metrics

SQL Generation

  • Correctness: Exact match with reference SQL
  • Execution Success: SQL executes without errors
  • Result Matching: F1 score comparing query results
  • Dialect Compliance: Proper SQL transpilation
  • Readability: SQL structure and formatting

Code Generation

  • Syntax Correctness: Code compiles without syntax errors
  • Compilation Success: Code builds successfully
  • Execution Success: Code runs and produces expected output
  • Code Quality: Follows language best practices
  • Performance: Code efficiency and optimization

Documentation Generation

  • Accuracy: Content matches reference documentation
  • Completeness: Covers all required information
  • Clarity: Easy to understand and follow
  • Format Compliance: Follows specified documentation format
  • Technical Correctness: Technically accurate information