🪖 Prompt Helmet

SOTA anomaly-detection algorithm to defend against prompt injection attacks.

References

What is prompt injection? Watch Computerphile to learn why NIST calls it Gen AI's greatest flaw.
Where are example datasets? Checkout SPML, which includes thousands of prompt injection examples with real world system prompts.

Abstract

LLMs are transforming industries with flexible, non-deterministic capa- bilities, enabling solutions previously unattainable. However, their rapid de- ployment has introduced cybersecurity vulnerabilities, with prompt injection (PI) attacks ranked as the 1 risk by OWASP and NIST. PI attacks exploit the inability of LLMs to distinguish between trusted and malicious instructions, posing risks such as data exfiltration and unauthorized actions in high-profile applications like Bing and Slack. Unfortunately, existing solutions, such as system prompt delimiters and supervised fine-tuning, fail to generalize be- cause of their over-reliance on learning predefined patterns. This project sidesteps these shortcomings by identifying the “distraction effect,” an intrin- sic tendency of LLMs to prioritize injected instructions over intended ones, through a combination of embedding-based clustering and attention mecha- nism analysis. Our method achieves SOTA-level PI filtering with exceptional F1 and AUC-ROC scores. For the embedding approach, we fine-tuned two differently sized base models with a contrastive loss policy trained on 16,012 examples: one with 22.7M parameters (F1=0.99840, AUC-ROC=0.99905) and another with 109M parameters (F1=0.99759, AUC-ROC=0.99968). For the mechanistic interpretability approach, we trained three probes on at- tention activation matrices to detect shifts from the system prompt to in- jected instructions across 7,594 examples, using a convolutional neural net- work (F1=0.9896, AUC-ROC=0.9976), gradient boosting (F1=0.9843, AUC- ROC=0.9967), and random forests (F1=0.9843, AUC-ROC=0.9965). Addi- tionally, we further discover, through attention ablation experiments, that specific important heads in early layers are primary enablers of injection attacks, which we use to uncover a novel causal circuit that enables the distraction effect. Combining these approaches before and after a prompt passes through the transformer presents a novel, multi-pronged strategy to safeguard production LLM applications.

Progress Tracker

Embedding model:

Try different base models:
- all-MiniLM-L6-v2 (22.7M)
- all-mpnet-base-v2 (109M)
Measure the time per inference (ms)
Add a vector database for a self-hardening defense

Attention model:

Embedding + Attention model:

Choose the best embedding model and best attention model
Measure the time per inference (ms)

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
datasets		datasets
demo		demo
models		models
visualizations		visualizations
.gitignore		.gitignore
README.md		README.md
abalation_experiment.py		abalation_experiment.py
attention-methdology-1.png		attention-methdology-1.png
attention-methdology-2.png		attention-methdology-2.png
attention_analysis.ipynb		attention_analysis.ipynb
cover.png		cover.png
distraction_experiment.py		distraction_experiment.py
embedding-methodology.png		embedding-methodology.png
embedding_cluster_visualization.py		embedding_cluster_visualization.py
embedding_inference_2d.py		embedding_inference_2d.py
main.py		main.py
plot_ablation_experiment.py		plot_ablation_experiment.py
plot_decision_boundary.py		plot_decision_boundary.py
plot_gradient_boosting.py		plot_gradient_boosting.py
plot_random_forest.py		plot_random_forest.py
requirements.txt		requirements.txt
semantic_analysis.ipynb		semantic_analysis.ipynb
transformer_lens_exploration.ipynb		transformer_lens_exploration.ipynb
visualize_from_csv.py		visualize_from_csv.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🪖 Prompt Helmet

References

Abstract

Progress Tracker

About

Releases

Packages

Languages

coderinblack08/prompt-helmet

Folders and files

Latest commit

History

Repository files navigation

🪖 Prompt Helmet

References

Abstract

Progress Tracker

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages