๐ง Generative AI AdvancementsMay 11, 2026โ
Tests passing
Synthetic Dataset Generator
This tool allows users to generate synthetic datasets by leveraging generative AI models. Developers can use it to create custom datasets for training or fine-tuning models, with controls over the style, complexity, and diversity of the generated data. It helps save time and effort in creating quality datasets.
What It Does
- Generate synthetic datasets using OpenAI's GPT models.
- Specify the number of data samples to generate.
- Choose between JSON or CSV output formats.
Installation
- Python 3.7+
openaipandaspytest
Usage
Run the script with the following command:
python synthetic_dataset_generator.py --api_key <your_openai_api_key> \
--prompt "<your_prompt>" \
--count <number_of_samples> \
--output <csv_or_json>Arguments
--api_key: Your OpenAI API key (required).--prompt: The prompt template for data generation (required).--count: The number of data samples to generate (required, must be a positive integer).--output: The output file format, eithercsvorjson(required).
Example
Generate 10 synthetic product descriptions in JSON format:
python synthetic_dataset_generator.py --api_key "your_openai_api_key" \
--prompt "Generate a list of product descriptions" \
--count 10 \
--output jsonSource Code
import argparse
import json
import pandas as pd
import openai
import os
def generate_synthetic_data(api_key, prompt, count, output_format):
"""
Generate synthetic data using OpenAI's API.
Args:
api_key (str): OpenAI API key.
prompt (str): Prompt template for data generation.
count (int): Number of data samples to generate.
output_format (str): Output format, either 'csv' or 'json'.
Returns:
str: Path to the output file containing the synthetic dataset.
"""
if count <= 0:
raise ValueError("Count must be a positive integer.")
if output_format not in ['csv', 'json']:
raise ValueError("Output format must be either 'csv' or 'json'.")
openai.api_key = api_key
data = []
for _ in range(count):
try:
response = openai.Completion.create(
engine="text-davinci-003",
prompt=prompt,
max_tokens=100
)
generated_text = response.choices[0].text.strip()
data.append(generated_text)
except Exception as e:
raise RuntimeError(f"Error during data generation: {e}")
output_file = f"synthetic_dataset.{output_format}"
if output_format == 'json':
with open(output_file, 'w') as f:
json.dump(data, f, indent=4)
elif output_format == 'csv':
df = pd.DataFrame(data, columns=['Generated Data'])
df.to_csv(output_file, index=False)
return output_file
def main():
parser = argparse.ArgumentParser(description="Synthetic Dataset Generator")
parser.add_argument('--api_key', required=True, help="OpenAI API key")
parser.add_argument('--prompt', required=True, help="Prompt template for data generation")
parser.add_argument('--count', type=int, required=True, help="Number of data samples to generate")
parser.add_argument('--output', required=True, choices=['csv', 'json'], help="Output file format (csv or json)")
args = parser.parse_args()
try:
output_file = generate_synthetic_data(
api_key=args.api_key,
prompt=args.prompt,
count=args.count,
output_format=args.output
)
print(f"Synthetic dataset generated and saved to {output_file}")
except Exception as e:
print(f"Error: {e}")
if __name__ == "__main__":
main()
Community
Downloads
ยทยทยท
Rate this tool
No ratings yet โ be the first!
Details
- Tool Name
- synthetic_dataset_generator
- Category
- Generative AI Advancements
- Generated
- May 11, 2026
- Tests
- Passing โ
- Fix Loops
- 2
Quick Install
Clone just this tool:
git clone --depth 1 --filter=blob:none --sparse \ https://github.com/ptulin/autoaiforge.git cd autoaiforge git sparse-checkout set generated_tools/2026-05-11/synthetic_dataset_generator cd generated_tools/2026-05-11/synthetic_dataset_generator pip install -r requirements.txt 2>/dev/null || true python synthetic_dataset_generator.py