๐ฌ LLM inference optimizationJune 9, 2026โ
Tests passing
LLM Quantization Evaluator
This tool evaluates the performance of quantized versions of large language models, comparing them against full-precision models in terms of speed, memory usage, and accuracy. It helps developers determine the trade-offs of model quantization for their workloads.
What It Does
- Quantize models using INT8 or FP16 precision.
- Evaluate model accuracy on a given dataset.
- Benchmark model performance in terms of inference time and memory usage.
Installation
- Python 3.8+
- Required Python packages:
torchtransformersdatasetspytest(for testing)
Install the required packages using:
pip install torch transformers datasets pytestUsage
Run the tool from the command line:
python llm_quantization_eval.py --model <model_name> --quantization <INT8|FP16> --dataset <dataset_name> --device <cpu|cuda>Example
python llm_quantization_eval.py --model gpt2 --quantization INT8 --dataset squad --device cpuSource Code
import argparse
import time
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import Dataset
def quantize_model(model, method):
"""Apply quantization to the model."""
if method == "INT8":
model = torch.quantization.quantize_dynamic(model, {torch.nn.Linear}, dtype=torch.qint8)
elif method == "FP16":
model = model.half()
else:
raise ValueError("Unsupported quantization method. Choose 'INT8' or 'FP16'.")
return model
def evaluate_model(model, tokenizer, dataset, device):
"""Evaluate the model's accuracy on the given dataset."""
model.to(device)
model.eval()
correct = 0
total = 0
for example in dataset:
input_text = example['question']
expected_answer = example['answers']['text'][0]
inputs = tokenizer(input_text, return_tensors="pt").to(device)
with torch.no_grad():
outputs = model.generate(**inputs)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
if expected_answer.lower() in generated_text.lower():
correct += 1
total += 1
accuracy = correct / total if total > 0 else 0
return accuracy
def benchmark_model(model, tokenizer, dataset, device):
"""Benchmark the model's speed and memory usage."""
model.to(device)
model.eval()
start_time = time.time()
if device == "cuda":
torch.cuda.reset_peak_memory_stats(device)
for example in dataset:
input_text = example['question']
inputs = tokenizer(input_text, return_tensors="pt").to(device)
with torch.no_grad():
model.generate(**inputs)
elapsed_time = time.time() - start_time
peak_memory = 0
if device == "cuda":
peak_memory = torch.cuda.max_memory_allocated(device) / (1024 ** 2) # Convert to MB
return elapsed_time, peak_memory
def main():
parser = argparse.ArgumentParser(description="LLM Quantization Evaluator")
parser.add_argument('--model', required=True, help="Name of the model (e.g., gpt2)")
parser.add_argument('--quantization', required=True, choices=['INT8', 'FP16'], help="Quantization method")
parser.add_argument('--dataset', required=True, help="Name of the evaluation dataset (e.g., squad)")
parser.add_argument('--device', required=True, choices=['cpu', 'cuda'], help="Hardware device")
args = parser.parse_args()
# Load model and tokenizer
print("Loading model and tokenizer...")
model = AutoModelForCausalLM.from_pretrained(args.model)
tokenizer = AutoTokenizer.from_pretrained(args.model)
# Quantize model
print(f"Applying {args.quantization} quantization...")
model = quantize_model(model, args.quantization)
# Load dataset
print("Loading dataset...")
dataset = Dataset.from_dict({
"question": ["What is AI?"],
"answers": [{"text": ["Artificial Intelligence"]}]
})
# Evaluate accuracy
print("Evaluating model accuracy...")
accuracy = evaluate_model(model, tokenizer, dataset, args.device)
# Benchmark performance
print("Benchmarking model performance...")
elapsed_time, peak_memory = benchmark_model(model, tokenizer, dataset, args.device)
# Output results
print("Evaluation Results:")
print(f"Accuracy: {accuracy:.2f}")
print(f"Inference Time: {elapsed_time:.2f} seconds")
print(f"Peak Memory Usage: {peak_memory:.2f} MB")
if __name__ == "__main__":
main()
Community
Downloads
ยทยทยท
Rate this tool
No ratings yet โ be the first!
Details
- Tool Name
- llm_quantization_eval
- Category
- LLM inference optimization
- Generated
- June 9, 2026
- Tests
- Passing โ
- Fix Loops
- 3
Quick Install
Clone just this tool:
git clone --depth 1 --filter=blob:none --sparse \ https://github.com/ptulin/autoaiforge.git cd autoaiforge git sparse-checkout set generated_tools/2026-06-09/llm_quantization_eval cd generated_tools/2026-06-09/llm_quantization_eval pip install -r requirements.txt 2>/dev/null || true python llm_quantization_eval.py