💬 LLM Esoteric Code BenchmarksMarch 20, 2026✅ Tests passing

Esoteric Benchmark Runner

A CLI tool to execute EsoLang-Bench tasks on large language models, measure their code generation accuracy for esoteric languages, and generate detailed performance reports. This tool automates the benchmarking process for AI researchers and developers working with LLMs.

View on GitHub Download ZIP

Share:X / Twitter LinkedIn Reddit Hacker News

What It Does

Executes EsoLang-Bench tasks using OpenAI models (e.g., GPT-4).
Measures code generation accuracy and execution time.
Supports output in JSON or CSV format.

Installation

Python 3.7+
Required Python packages:
click
pandas
openai

Install the required packages using pip:

pip install click pandas openai

Usage

python esolang_benchmark_runner.py --model <MODEL_NAME> --language <LANGUAGE> --tasks <TASKS_FILE> --output <OUTPUT_FILE>

Arguments

--model: The OpenAI model to use (e.g., gpt-4).
--language: The target esoteric programming language (e.g., brainfuck, befunge).
--tasks: Path to the JSON file containing tasks. Each task should be a JSON object with a prompt key and an optional expected_output key.
--output: Path to save the output metrics. Must have a .json or .csv extension.

Example

1. Create a tasks file tasks.json:

[
        {
            "prompt": "Translate to brainfuck",
            "expected_output": "mocked_generated_code"
        }
    ]

2. Run the tool:

python esolang_benchmark_runner.py --model gpt-4 --language brainfuck --tasks tasks.json --output output.json

3. Check the output file output.json for the results.

Source Code

import json
import time
import click
import pandas as pd
import openai
from typing import List, Dict

@click.command()
@click.option('--model', required=True, help='The OpenAI model to use (e.g., gpt-4).')
@click.option('--language', required=True, help='The target esoteric language (e.g., brainfuck, befunge).')
@click.option('--tasks', required=True, type=click.Path(exists=True), help='Path to the JSON file containing tasks.')
@click.option('--output', required=True, type=click.Path(), help='Path to save the output metrics (JSON or CSV).')
def main(model: str, language: str, tasks: str, output: str):
    """
    Executes EsoLang-Bench tasks using the specified LLM and generates performance metrics.
    """
    try:
        # Load tasks from the provided JSON file
        with open(tasks, 'r') as f:
            task_data = json.load(f)

        if not isinstance(task_data, list) or not all('prompt' in task for task in task_data):
            raise ValueError("Tasks file must be a JSON array of objects with a 'prompt' key.")

        results = []

        for task in task_data:
            prompt = task['prompt']
            expected_output = task.get('expected_output', '')

            start_time = time.time()
            try:
                # Call OpenAI API
                response = openai.ChatCompletion.create(
                    model=model,
                    messages=[{"role": "user", "content": prompt}]
                )
                generated_code = response['choices'][0]['message']['content']
                execution_time = time.time() - start_time

                # Measure accuracy (simple string comparison for now)
                accuracy = int(generated_code.strip() == expected_output.strip())

                results.append({
                    'prompt': prompt,
                    'generated_code': generated_code,
                    'expected_output': expected_output,
                    'accuracy': accuracy,
                    'execution_time': execution_time
                })

            except Exception as e:
                results.append({
                    'prompt': prompt,
                    'generated_code': None,
                    'expected_output': expected_output,
                    'accuracy': 0,
                    'execution_time': None,
                    'error': str(e)
                })

        # Validate output file extension
        if not output.endswith('.json') and not output.endswith('.csv'):
            raise ValueError("Output file must have a .json or .csv extension.")

        # Save results to the specified output file
        if output.endswith('.json'):
            with open(output, 'w') as f:
                json.dump(results, f, indent=4)
        elif output.endswith('.csv'):
            df = pd.DataFrame(results)
            df.to_csv(output, index=False)

        click.echo(f"Results saved to {output}")

    except Exception as e:
        click.echo(f"Error: {e}", err=True)
        raise SystemExit(1)

if __name__ == "__main__":
    main()

Community

Downloads

···

Rate this tool

No ratings yet — be the first!

Details

Tool Name: esolang_benchmark_runner
Category: LLM Esoteric Code Benchmarks
Generated: March 20, 2026
Tests: Passing ✅
Fix Loops: 3

Quick Install

Clone just this tool:

git clone --depth 1 --filter=blob:none --sparse \
  https://github.com/ptulin/autoaiforge.git
cd autoaiforge
git sparse-checkout set generated_tools/2026-03-20/esolang_benchmark_runner
cd generated_tools/2026-03-20/esolang_benchmark_runner
pip install -r requirements.txt 2>/dev/null || true
python esolang_benchmark_runner.py

Links

View source on GitHub Raw README.md