💬 LLM inference optimizationJune 9, 2026✅ Tests passing

Dynamic LLM Router

This tool routes incoming requests to different LLMs based on resource availability and input size, enabling efficient utilization of compute resources. It helps in scenarios where multiple models or devices are available and load balancing is critical.

View on GitHub Download ZIP

Share:X / Twitter LinkedIn Reddit Hacker News

What It Does

Dynamically checks the availability of compute resources (CPU and GPU).
Routes requests to the most suitable model and device based on resource availability.
Supports multiple LLMs and devices.

Installation

Install the required dependencies using pip:

pip install transformers torch psutil pytest

Usage

Run the tool from the command line:

python dynamic_llm_router.py --input "Your input text here" --models "gpt2,EleutherAI/gpt-neo-125M" --devices "cuda,cpu"

Arguments

--input: The input text to be processed by the LLM.
--models: A comma-separated list of model names (e.g., gpt2,EleutherAI/gpt-neo-125M).
--devices: A comma-separated list of devices to use (e.g., cuda,cpu).

Example

python dynamic_llm_router.py --input "What is the capital of France?" --models "gpt2" --devices "cpu"

Source Code

import argparse
import json
import logging
import psutil
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

def setup_logging():
    logging.basicConfig(
        format='%(asctime)s - %(levelname)s - %(message)s',
        level=logging.INFO
    )

def get_device_availability(devices):
    """Check the availability of devices."""
    available_devices = {}
    for device in devices:
        if device == 'cuda' and torch.cuda.is_available():
            available_devices['cuda'] = torch.cuda.get_device_properties(0).total_memory - torch.cuda.memory_allocated(0)
        elif device == 'cpu':
            available_devices['cpu'] = psutil.virtual_memory().available
    return available_devices

def load_model(model_name, device):
    """Load the specified model and tokenizer on the given device."""
    try:
        tokenizer = AutoTokenizer.from_pretrained(model_name)
        model = AutoModelForCausalLM.from_pretrained(model_name)
        if device == 'cuda' and torch.cuda.is_available():
            model = model.to('cuda')
        return model, tokenizer
    except Exception as e:
        logging.error(f"Error loading model {model_name} on {device}: {e}")
        raise

def route_request(input_text, models, devices):
    """Route the request to the optimal model and device."""
    available_devices = get_device_availability(devices)
    if not available_devices:
        raise RuntimeError("No available devices.")

    # Sort devices by available memory (descending)
    sorted_devices = sorted(available_devices.items(), key=lambda x: x[1], reverse=True)

    for device, _ in sorted_devices:
        for model_name in models:
            try:
                model, tokenizer = load_model(model_name, device)
                inputs = tokenizer(input_text, return_tensors="pt")
                if device == 'cuda':
                    inputs = {key: value.to('cuda') for key, value in inputs.items()}
                outputs = model.generate(**inputs)
                response = tokenizer.decode(outputs[0], skip_special_tokens=True)
                return {
                    "model": model_name,
                    "device": device,
                    "response": response
                }
            except Exception as e:
                logging.warning(f"Failed to process with model {model_name} on {device}: {e}")
    raise RuntimeError("Failed to process the request with all available models and devices.")

def main():
    parser = argparse.ArgumentParser(description="Dynamic LLM Router")
    parser.add_argument('--input', type=str, required=True, help="Input text for the LLM.")
    parser.add_argument('--models', type=str, required=True, help="Comma-separated list of model names.")
    parser.add_argument('--devices', type=str, required=True, help="Comma-separated list of devices (e.g., cuda,cpu).")

    args = parser.parse_args()

    input_text = args.input
    models = args.models.split(',')
    devices = args.devices.split(',')

    try:
        result = route_request(input_text, models, devices)
        print(json.dumps(result, indent=2))
    except Exception as e:
        logging.error(f"Error: {e}")
        print(json.dumps({"error": str(e)}))

if __name__ == "__main__":
    setup_logging()
    main()

Community

Downloads

···

Rate this tool

No ratings yet — be the first!

Details

Tool Name: dynamic_llm_router
Category: LLM inference optimization
Generated: June 9, 2026
Tests: Passing ✅
Fix Loops: 2

Quick Install

Clone just this tool:

git clone --depth 1 --filter=blob:none --sparse \
  https://github.com/ptulin/autoaiforge.git
cd autoaiforge
git sparse-checkout set generated_tools/2026-06-09/dynamic_llm_router
cd generated_tools/2026-06-09/dynamic_llm_router
pip install -r requirements.txt 2>/dev/null || true
python dynamic_llm_router.py

Links

View source on GitHub Raw README.md