๐ง Generative AI AdvancementsMay 11, 2026โ
Tests passing
Model Stability Benchmark
This tool evaluates the stability of generative AI models by running multiple iterations of the same prompt and analyzing the consistency of the outputs. It provides metrics on output variance, token-level differences, and semantic similarity, helping developers identify how deterministic or stable their models are.
What It Does
- Measures output consistency across multiple runs.
- Calculates token-level differences between outputs.
- Computes semantic similarity using cosine distance.
- Saves results to a CSV file for further analysis.
Installation
1. Clone this repository:
git clone <repository_url>
cd model_stability_benchmark2. Install dependencies:
pip install -r requirements.txtUsage
python model_stability_benchmark.py --api_key "sk-abc123" \
--prompt "Translate this to French" \
--iterations 10 \
--output results.csvSource Code
import argparse
import csv
import numpy as np
from scipy.spatial.distance import cosine
from tqdm import tqdm
import openai
def generate_responses(api_key, prompt, iterations):
"""Generate responses from the OpenAI model for the given prompt."""
openai.api_key = api_key
responses = []
for _ in tqdm(range(iterations), desc="Generating responses"):
try:
response = openai.Completion.create(
engine="text-davinci-003",
prompt=prompt,
max_tokens=100
)
responses.append(response.choices[0].text.strip())
except Exception as e:
print(f"Error generating response: {e}")
responses.append("")
return responses
def calculate_token_differences(responses):
"""Calculate token-level differences between responses."""
token_sets = [set(response.split()) for response in responses]
differences = []
for i in range(len(token_sets)):
for j in range(i + 1, len(token_sets)):
diff = len(token_sets[i].symmetric_difference(token_sets[j]))
differences.append(diff)
return np.mean(differences), np.std(differences)
def calculate_semantic_similarity(responses):
"""Calculate semantic similarity using cosine distance."""
embeddings = []
for response in responses:
try:
embedding = openai.Embedding.create(input=response, model="text-embedding-ada-002")
embeddings.append(embedding['data'][0]['embedding'])
except Exception as e:
print(f"Error generating embedding: {e}")
embeddings.append([0] * 1536) # Default embedding size for text-embedding-ada-002
similarities = []
for i in range(len(embeddings)):
for j in range(i + 1, len(embeddings)):
sim = 1 - cosine(embeddings[i], embeddings[j])
similarities.append(sim)
return np.mean(similarities), np.std(similarities)
def save_to_csv(output_file, metrics):
"""Save metrics to a CSV file."""
with open(output_file, mode='w', newline='') as file:
writer = csv.writer(file)
writer.writerow(["Metric", "Mean", "Standard Deviation"])
for metric, values in metrics.items():
writer.writerow([metric, values[0], values[1]])
def main():
parser = argparse.ArgumentParser(description="Model Stability Benchmark")
parser.add_argument('--api_key', required=True, help="OpenAI API key")
parser.add_argument('--prompt', required=True, help="Prompt text to evaluate")
parser.add_argument('--iterations', type=int, required=True, help="Number of iterations")
parser.add_argument('--output', required=True, help="Output CSV file path")
args = parser.parse_args()
responses = generate_responses(args.api_key, args.prompt, args.iterations)
token_mean, token_std = calculate_token_differences(responses)
semantic_mean, semantic_std = calculate_semantic_similarity(responses)
metrics = {
"Token Difference": (token_mean, token_std),
"Semantic Similarity": (semantic_mean, semantic_std)
}
save_to_csv(args.output, metrics)
print(f"Metrics saved to {args.output}")
if __name__ == "__main__":
main()Community
Downloads
ยทยทยท
Rate this tool
No ratings yet โ be the first!
Details
- Tool Name
- model_stability_benchmark
- Category
- Generative AI Advancements
- Generated
- May 11, 2026
- Tests
- Passing โ
Quick Install
Clone just this tool:
git clone --depth 1 --filter=blob:none --sparse \ https://github.com/ptulin/autoaiforge.git cd autoaiforge git sparse-checkout set generated_tools/2026-05-11/model_stability_benchmark cd generated_tools/2026-05-11/model_stability_benchmark pip install -r requirements.txt 2>/dev/null || true python model_stability_benchmark.py