AI Model Benchmark
This tool allows developers to benchmark GPT-5 and Claude 4.7 against a custom dataset of prompts. It evaluates response quality using metrics like response length, latency, and BLEU score (for reference-based evaluation), generating a comparative report. Useful for developers optimizing workflows or choosing the right model for specific tasks.
What It Does
This tool allows developers to benchmark GPT-5 and Claude 4.7 against a custom dataset of prompts. It evaluates response quality using metrics like response length, latency, and BLEU score (for reference-based evaluation), generating a comparative report. Useful for developers optimizing workflows.
Installation
1. Install the required Python packages:
pip install pandas nltk openai anthropic2. Ensure you have access to the GPT-5 and Claude 4.7 APIs.
Usage
Dataset file (dataset.json):
[
{"prompt": "What is AI?", "reference": "Artificial Intelligence is the simulation of human intelligence."}
]Command:
python ai_model_benchmark.py --dataset dataset.json --output report.htmlOutput (report.html):
<html>
<head><title>AI Model Benchmark Report</title></head>
<body>
<h1>AI Model Benchmark Report</h1>
<h2>GPT-5</h2>
<table border='1'>
<tr><th>Prompt</th><th>Response</th><th>Latency</th><th>Response Length</th><th>BLEU Score</th></tr>
<tr><td>What is AI?</td><td>Artificial Intelligence is the simulation of human intelligence.</td><td>0.10</td><td>50</td><td>1.0</td></tr>
</table>
<h2>Claude-4.7</h2>
<table border='1'>
<tr><th>Prompt</th><th>Response</th><th>Latency</th><th>Response Length</th><th>BLEU Score</th></tr>
<tr><td>What is AI?</td><td>Artificial Intelligence is the simulation of human intelligence.</td><td>0.10</td><td>50</td><td>1.0</td></tr>
</table>
</body>
</html>Source Code
import argparse
import json
import time
import pandas as pd
from nltk.translate.bleu_score import sentence_bleu
from openai import ChatCompletion
from anthropic import Client
def load_dataset(file_path):
if str(file_path).endswith('.json'):
with open(file_path, 'r') as f:
return json.load(f)
elif str(file_path).endswith('.csv'):
return pd.read_csv(file_path).to_dict('records')
else:
raise ValueError("Unsupported file format. Use JSON or CSV.")
def evaluate_model_responses(prompts, model_name, generate_response):
metrics = []
for prompt_data in prompts:
prompt = prompt_data['prompt']
reference = prompt_data.get('reference', None)
start_time = time.time()
try:
response = generate_response(prompt)
except Exception as e:
response = f"Error: {str(e)}"
latency = time.time() - start_time
response_length = len(response)
bleu_score = sentence_bleu([reference.split()], response.split()) if reference else None
metrics.append({
'prompt': prompt,
'response': response,
'latency': latency,
'response_length': response_length,
'bleu_score': bleu_score
})
return metrics
def generate_gpt5_response(prompt):
response = ChatCompletion.create(
model="gpt-5",
messages=[{"role": "user", "content": prompt}]
)
return response['choices'][0]['message']['content']
def generate_claude47_response(prompt):
client = Client(api_key="YOUR_ANTHROPIC_API_KEY")
response = client.completion(
prompt=prompt,
model="claude-4.7",
max_tokens=500
)
return response['completion']
def generate_report(metrics_gpt5, metrics_claude, output_file):
report = {
'GPT-5': metrics_gpt5,
'Claude-4.7': metrics_claude
}
if str(output_file).endswith('.json'):
with open(output_file, 'w') as f:
json.dump(report, f, indent=4)
elif str(output_file).endswith('.html'):
html_content = "<html><head><title>AI Model Benchmark Report</title></head><body>"
html_content += "<h1>AI Model Benchmark Report</h1>"
for model, metrics in report.items():
html_content += f"<h2>{model}</h2><table border='1'>"
html_content += "<tr><th>Prompt</th><th>Response</th><th>Latency</th><th>Response Length</th><th>BLEU Score</th></tr>"
for metric in metrics:
html_content += f"<tr><td>{metric['prompt']}</td><td>{metric['response']}</td><td>{metric['latency']:.2f}</td><td>{metric['response_length']}</td><td>{metric['bleu_score']}</td></tr>"
html_content += "</table>"
html_content += "</body></html>"
with open(output_file, 'w') as f:
f.write(html_content)
else:
raise ValueError("Unsupported output format. Use JSON or HTML.")
def main():
parser = argparse.ArgumentParser(description="AI Model Benchmark Tool")
parser.add_argument('--dataset', required=True, help="Path to the dataset file (JSON or CSV)")
parser.add_argument('--output', required=True, help="Path to the output report file (JSON or HTML)")
args = parser.parse_args()
try:
prompts = load_dataset(args.dataset)
except Exception as e:
print(f"Error loading dataset: {str(e)}")
return
try:
metrics_gpt5 = evaluate_model_responses(prompts, "GPT-5", generate_gpt5_response)
metrics_claude = evaluate_model_responses(prompts, "Claude-4.7", generate_claude47_response)
generate_report(metrics_gpt5, metrics_claude, args.output)
print(f"Benchmark report generated: {args.output}")
except Exception as e:
print(f"Error during benchmarking: {str(e)}")
if __name__ == "__main__":
main()
Community
Downloads
ยทยทยท
Rate this tool
No ratings yet โ be the first!
Details
- Tool Name
- ai_model_benchmark
- Category
- Advanced AI Model Releases
- Generated
- April 20, 2026
- Tests
- Passing โ
- Fix Loops
- 2
Quick Install
Clone just this tool:
git clone --depth 1 --filter=blob:none --sparse \ https://github.com/ptulin/autoaiforge.git cd autoaiforge git sparse-checkout set generated_tools/2026-04-20/ai_model_benchmark cd generated_tools/2026-04-20/ai_model_benchmark pip install -r requirements.txt 2>/dev/null || true python ai_model_benchmark.py