🍈 Zettelkasten

❯

Machine Learning

❯

LLM Vulnerability Mitigation

LLM Vulnerability Mitigation

Oct 03, 20251 min read

machine_learning
security

What Doesn’t Work

Block lists based off keywords
Prompt Hardening (Instructing LLM on what to do and what not to do)
Adversarial Training to make more robust models
Detect anomalies (How?)

Hard But Not Impossible

Implementing mechanisms to validate and sanitize input prompts
- Contextual sanitization might work
Training models to detect and prevent insecure outputs
Data auditing (yeah but how are you going to sanitize all of wikipedia?)

Decent Ideas

Using a 2nd LLM to scan the output.
Signed Prompts (https://arxiv.org/abs/2401.07612)
RPO (https://arxiv.org/abs/2401.17263)

Good Ideas

Dont have anything important to leak
Assume leakage and plan for that

Meh Ideas

Keep training data and exposure as minimum as possible. Cut as much as you can
Multi layered defenses

Graph View

What Doesn’t Work
Hard But Not Impossible
Decent Ideas
Good Ideas
Meh Ideas

Created with Quartz v4.4.0 © 2025

GitHub
Discord Community