What Doesnât Work
- Block lists based off keywords
- Prompt Hardening (Instructing LLM on what to do and what not to do)
- Adversarial Training to make more robust models
- Detect anomalies (How?)
Hard But Not Impossible
- Implementing mechanisms to validate and sanitize input prompts
- Contextual sanitization might work
- Training models to detect and prevent insecure outputs
- Data auditing (yeah but how are you going to sanitize all of wikipedia?)
Decent Ideas
- Using a 2nd LLM to scan the output.
- Signed Prompts (https://arxiv.org/abs/2401.07612)
- RPO (https://arxiv.org/abs/2401.17263)
Good Ideas
- Dont have anything important to leak
- Assume leakage and plan for that
Meh Ideas
- Keep training data and exposure as minimum as possible. Cut as much as you can
- Multi layered defenses