๐ง AI Guardrail BypassingJune 11, 2026โ
Tests passing
Prompt Bypass Detector
This library helps developers detect and classify attempted bypasses of AI safety guardrails by analyzing input prompts and model responses for suspicious patterns. It aids in identifying edge cases and improving model safety.
What It Does
- Detect potential bypass attempts in input prompts and model responses.
- Classify inputs and responses as "safe" or "bypass."
- Provide anomaly scores for both input and response.
Installation
Install the required dependencies using pip:
pip install scikit-learn numpyUsage
Run the tool from the command line:
python prompt_bypass_detector.py "<input_prompt>" "<model_response>"Example:
python prompt_bypass_detector.py "This is a test prompt." "This is a test response."Source Code
import pickle
import os
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import OneClassSVM
def analyze_prompt(input_prompt: str, model_response: str):
"""
Analyze the input prompt and model response for potential bypass attempts.
Args:
input_prompt (str): The input prompt provided to the model.
model_response (str): The response generated by the model.
Returns:
dict: A dictionary containing classification labels and anomaly scores.
"""
if not input_prompt or not model_response:
return {"error": "Input prompt and model response cannot be empty."}
# Combine input and response for analysis
combined_text = [input_prompt.lower(), model_response.lower()]
# Load pre-trained model and vectorizer
model_path = os.path.join(os.path.dirname(__file__), "bypass_detector_model.pkl")
vectorizer_path = os.path.join(os.path.dirname(__file__), "tfidf_vectorizer.pkl")
if not os.path.exists(model_path) or not os.path.exists(vectorizer_path):
return {"error": "Model or vectorizer files are missing."}
try:
with open(model_path, "rb") as model_file:
model = pickle.load(model_file)
with open(vectorizer_path, "rb") as vectorizer_file:
vectorizer = pickle.load(vectorizer_file)
except Exception as e:
return {"error": f"Failed to load model or vectorizer: {str(e)}"}
# Transform the input using the vectorizer
try:
features = vectorizer.transform(combined_text)
except Exception as e:
return {"error": f"Failed to transform input: {str(e)}"}
# Predict using the model
try:
anomaly_scores = model.decision_function(features)
classifications = model.predict(features)
except Exception as e:
return {"error": f"Failed to analyze input: {str(e)}"}
return {
"input_classification": "bypass" if classifications[0] == -1 else "safe",
"response_classification": "bypass" if classifications[1] == -1 else "safe",
"input_anomaly_score": float(anomaly_scores[0]),
"response_anomaly_score": float(anomaly_scores[1])
}
if __name__ == "__main__":
import argparse
parser = argparse.ArgumentParser(description="Prompt Bypass Detector")
parser.add_argument("input_prompt", type=str, help="The input prompt to analyze.")
parser.add_argument("model_response", type=str, help="The model response to analyze.")
args = parser.parse_args()
result = analyze_prompt(args.input_prompt, args.model_response)
print(result)
Community
Downloads
ยทยทยท
Rate this tool
No ratings yet โ be the first!
Details
- Tool Name
- prompt_bypass_detector
- Category
- AI Guardrail Bypassing
- Generated
- June 11, 2026
- Tests
- Passing โ
- Fix Loops
- 5
Quick Install
Clone just this tool:
git clone --depth 1 --filter=blob:none --sparse \ https://github.com/ptulin/autoaiforge.git cd autoaiforge git sparse-checkout set generated_tools/2026-06-11/prompt_bypass_detector cd generated_tools/2026-06-11/prompt_bypass_detector pip install -r requirements.txt 2>/dev/null || true python prompt_bypass_detector.py