๐ฌ LLM Esoteric Code BenchmarksMarch 20, 2026โ
Tests passing
Esoteric Benchmark Runner
A CLI tool to execute EsoLang-Bench tasks on large language models, measure their code generation accuracy for esoteric languages, and generate detailed performance reports. This tool automates the benchmarking process for AI researchers and developers working with LLMs.
What It Does
- Executes EsoLang-Bench tasks using OpenAI models (e.g., GPT-4).
- Measures code generation accuracy and execution time.
- Supports output in JSON or CSV format.
Installation
- Python 3.7+
- Required Python packages:
clickpandasopenai
Install the required packages using pip:
pip install click pandas openaiUsage
python esolang_benchmark_runner.py --model <MODEL_NAME> --language <LANGUAGE> --tasks <TASKS_FILE> --output <OUTPUT_FILE>Arguments
--model: The OpenAI model to use (e.g.,gpt-4).--language: The target esoteric programming language (e.g.,brainfuck,befunge).--tasks: Path to the JSON file containing tasks. Each task should be a JSON object with apromptkey and an optionalexpected_outputkey.--output: Path to save the output metrics. Must have a.jsonor.csvextension.
Example
1. Create a tasks file tasks.json:
[
{
"prompt": "Translate to brainfuck",
"expected_output": "mocked_generated_code"
}
]2. Run the tool:
python esolang_benchmark_runner.py --model gpt-4 --language brainfuck --tasks tasks.json --output output.json3. Check the output file output.json for the results.
Source Code
import json
import time
import click
import pandas as pd
import openai
from typing import List, Dict
@click.command()
@click.option('--model', required=True, help='The OpenAI model to use (e.g., gpt-4).')
@click.option('--language', required=True, help='The target esoteric language (e.g., brainfuck, befunge).')
@click.option('--tasks', required=True, type=click.Path(exists=True), help='Path to the JSON file containing tasks.')
@click.option('--output', required=True, type=click.Path(), help='Path to save the output metrics (JSON or CSV).')
def main(model: str, language: str, tasks: str, output: str):
"""
Executes EsoLang-Bench tasks using the specified LLM and generates performance metrics.
"""
try:
# Load tasks from the provided JSON file
with open(tasks, 'r') as f:
task_data = json.load(f)
if not isinstance(task_data, list) or not all('prompt' in task for task in task_data):
raise ValueError("Tasks file must be a JSON array of objects with a 'prompt' key.")
results = []
for task in task_data:
prompt = task['prompt']
expected_output = task.get('expected_output', '')
start_time = time.time()
try:
# Call OpenAI API
response = openai.ChatCompletion.create(
model=model,
messages=[{"role": "user", "content": prompt}]
)
generated_code = response['choices'][0]['message']['content']
execution_time = time.time() - start_time
# Measure accuracy (simple string comparison for now)
accuracy = int(generated_code.strip() == expected_output.strip())
results.append({
'prompt': prompt,
'generated_code': generated_code,
'expected_output': expected_output,
'accuracy': accuracy,
'execution_time': execution_time
})
except Exception as e:
results.append({
'prompt': prompt,
'generated_code': None,
'expected_output': expected_output,
'accuracy': 0,
'execution_time': None,
'error': str(e)
})
# Validate output file extension
if not output.endswith('.json') and not output.endswith('.csv'):
raise ValueError("Output file must have a .json or .csv extension.")
# Save results to the specified output file
if output.endswith('.json'):
with open(output, 'w') as f:
json.dump(results, f, indent=4)
elif output.endswith('.csv'):
df = pd.DataFrame(results)
df.to_csv(output, index=False)
click.echo(f"Results saved to {output}")
except Exception as e:
click.echo(f"Error: {e}", err=True)
raise SystemExit(1)
if __name__ == "__main__":
main()
Community
Downloads
ยทยทยท
Rate this tool
No ratings yet โ be the first!
Details
- Tool Name
- esolang_benchmark_runner
- Category
- LLM Esoteric Code Benchmarks
- Generated
- March 20, 2026
- Tests
- Passing โ
- Fix Loops
- 3
Quick Install
Clone just this tool:
git clone --depth 1 --filter=blob:none --sparse \ https://github.com/ptulin/autoaiforge.git cd autoaiforge git sparse-checkout set generated_tools/2026-03-20/esolang_benchmark_runner cd generated_tools/2026-03-20/esolang_benchmark_runner pip install -r requirements.txt 2>/dev/null || true python esolang_benchmark_runner.py