What Doesn’t Work

Hard But Not Impossible

  • Implementing mechanisms to validate and sanitize input prompts
    • Contextual sanitization might work
  • Training models to detect and prevent insecure outputs
  • Data auditing (yeah but how are you going to sanitize all of wikipedia?)

Decent Ideas

Good Ideas

  • Dont have anything important to leak
  • Assume leakage and plan for that

Meh Ideas

  • Keep training data and exposure as minimum as possible. Cut as much as you can
  • Multi layered defenses