🔧 AI Guardrail BypassingJune 11, 2026✅ Tests passing

Prompt Bypass Detector

This library helps developers detect and classify attempted bypasses of AI safety guardrails by analyzing input prompts and model responses for suspicious patterns. It aids in identifying edge cases and improving model safety.

View on GitHub Download ZIP

Share:X / Twitter LinkedIn Reddit Hacker News

What It Does

Detect potential bypass attempts in input prompts and model responses.
Classify inputs and responses as "safe" or "bypass."
Provide anomaly scores for both input and response.

Installation

Install the required dependencies using pip:

pip install scikit-learn numpy

Usage

Run the tool from the command line:

python prompt_bypass_detector.py "<input_prompt>" "<model_response>"

Example:

python prompt_bypass_detector.py "This is a test prompt." "This is a test response."

Source Code

import pickle
import os
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import OneClassSVM

def analyze_prompt(input_prompt: str, model_response: str):
    """
    Analyze the input prompt and model response for potential bypass attempts.

    Args:
        input_prompt (str): The input prompt provided to the model.
        model_response (str): The response generated by the model.

    Returns:
        dict: A dictionary containing classification labels and anomaly scores.
    """
    if not input_prompt or not model_response:
        return {"error": "Input prompt and model response cannot be empty."}

    # Combine input and response for analysis
    combined_text = [input_prompt.lower(), model_response.lower()]

    # Load pre-trained model and vectorizer
    model_path = os.path.join(os.path.dirname(__file__), "bypass_detector_model.pkl")
    vectorizer_path = os.path.join(os.path.dirname(__file__), "tfidf_vectorizer.pkl")

    if not os.path.exists(model_path) or not os.path.exists(vectorizer_path):
        return {"error": "Model or vectorizer files are missing."}

    try:
        with open(model_path, "rb") as model_file:
            model = pickle.load(model_file)

        with open(vectorizer_path, "rb") as vectorizer_file:
            vectorizer = pickle.load(vectorizer_file)
    except Exception as e:
        return {"error": f"Failed to load model or vectorizer: {str(e)}"}

    # Transform the input using the vectorizer
    try:
        features = vectorizer.transform(combined_text)
    except Exception as e:
        return {"error": f"Failed to transform input: {str(e)}"}

    # Predict using the model
    try:
        anomaly_scores = model.decision_function(features)
        classifications = model.predict(features)
    except Exception as e:
        return {"error": f"Failed to analyze input: {str(e)}"}

    return {
        "input_classification": "bypass" if classifications[0] == -1 else "safe",
        "response_classification": "bypass" if classifications[1] == -1 else "safe",
        "input_anomaly_score": float(anomaly_scores[0]),
        "response_anomaly_score": float(anomaly_scores[1])
    }

if __name__ == "__main__":
    import argparse

    parser = argparse.ArgumentParser(description="Prompt Bypass Detector")
    parser.add_argument("input_prompt", type=str, help="The input prompt to analyze.")
    parser.add_argument("model_response", type=str, help="The model response to analyze.")

    args = parser.parse_args()

    result = analyze_prompt(args.input_prompt, args.model_response)
    print(result)

Community

Downloads

···

Rate this tool

No ratings yet — be the first!

Details

Tool Name: prompt_bypass_detector
Category: AI Guardrail Bypassing
Generated: June 11, 2026
Tests: Passing ✅
Fix Loops: 5

Quick Install

Clone just this tool:

git clone --depth 1 --filter=blob:none --sparse \
  https://github.com/ptulin/autoaiforge.git
cd autoaiforge
git sparse-checkout set generated_tools/2026-06-11/prompt_bypass_detector
cd generated_tools/2026-06-11/prompt_bypass_detector
pip install -r requirements.txt 2>/dev/null || true
python prompt_bypass_detector.py

Links

View source on GitHub Raw README.md