Getting More Out of Prompt Injection Detection

by Gadi Evron

11 April 2024

Let's talk about prompt injection detection, and how with relative ease, we can improve it significantly.

Most detection systems are trained on specific public datasets. We asked: Why shouldn’t we train a model based on all available datasets? A straight forward approach, but it yielded significant results.

Full credit goes to Andreas Pung and Eddie Aronovich.

Our Process: Evaluating Leading Pre-trained Models

Our new model, enhancing Microsoft’s DeBERTaV3, boasted approximately 99% accuracy, compared to 90% we observed with other models, a significant reduction in false positives. In our evaluation, the model achieved a false positive rate of 0.002 which means an average of one false positive for every 500 detections, surpassing a well-accepted public model rate of 0.017 which stands for a false positive every 60 detections.

Challenges in Dataset Integration: Overcoming Data Engineering Hurdles

We initially assessed two leading pre-trained models: protectai/deberta-v3-base-prompt-injection and deepset/deberta-v3-base-injection, which boast an accuracy of over 99% on the HuggingFace platform. We then evaluated their performance on 17 HuggingFace and GitHub datasets, including deepset/prompt-injections, JasperLS/prompt-injections, Harelix/Prompt-Injection-Mixed-Techniques-2024, imoxto/prompt_injection_cleaned_dataset-v2, and others, to gauge their real-world efficacy.

During testing, we couldn't replicate the model performance to the published HuggingFace metrics, when tested for our needs. The accuracy achieved while running the 'protectai/deberta-v3-base-prompt-injection' model was 90%, compared to the reported 99.99%, and 84% for the 'deepset/deberta-v3-base-injection' model, compared to the reported 99.14%.

Given the critical role of dataset diversity in fine-tuning optimization, we consolidated the 17 HuggingFace and GitHub models which contain prompt injection and jailbreak data into a single, comprehensive dataset, yielding superior accuracy in detection.

Challenges in Dataset Integration: Overcoming Data Engineering Hurdles

Integrating disparate datasets posed data engineering challenges, including gathering, standardizing, preprocessing, and rectifying label discrepancies across multiple open-source and disparate datasets.

For example, we standardized the representation of True/False values across various boolean notations, account for missing values, and clean up duplicates that resulted from merging processes that could potentially improve the reliability of our data split, ensuring robust training and evaluation processes.

Summary: The Effectiveness of Our Technique

The technique proved effective, and we hope our efforts help the community in building better detection systems. We’d love to geek out on similar ideas!

Lastly, if you had better success with replicating the model performance results from Hugging Face, please do let us know!

For regular updates and insights from Knostic research, follow us on Linkedin.