This is a talk by David Van Bruwaene.

Notes

  • Isaac Asimov’s I Robot believed that AI will have intelligence, meaning it will do things, but we dont know exactly what it will do
  • We dont have much control over these autonamous systems
  • In I Robot, there is a central control over the robot, that governs Three Laws of Robotics
  • They create a AI that evaluates compliance of other agents (Agent as a judge)
  • AI Regulations have been implemented within
  • Google loses huge valuation when Google Gemini came out, and started generating people to be black or chinese
  • Chatbots recommend suicide
  • Classifiers that give racist recommendations to prisoners sentences. Trained off of racist data, so it has a data bias.
  • In the story I Robot, the robots lock humans up
  • Data inference to determine the private data present within training datasets by interrogating model outputs
  • Model extraction (Learn or steal a model by training a new model on sample inputs)
  • They use adversarial testing to check for certain detections
  • Legal skills are very good for coding surprisingly
  • Humans need to be able to decide, and have reasoning. This is where humans will remain on top
  • You will have standardized test results from these benchmarks
  • There are certain benchmarks for certain biases within models (how many black, how many chinese for image generation)
  • Testing aligning is to just tell the model, that it is not in a sandbox.
  • Probes are used to test specific categories
  • Ascenion has a python SDK to work within a CICD pipeline
  • If you know its not a secure system, dont trust to secure it
  • Use cheap regex guards first
  • Probes are categorized for specific attacks, these are usually categories gotten from OWASP Top 10 for LLM, other security vulns
  • Probes also include evaluation criteria, they look for a target goal, that the model has to aim for
  • How do we test probing models? We use human-generated test sets that will generalize and standardize evaluations
  • Evaluation attack, you can use Mechanistic Interpretability, to see what circuits in the model are learning specific contexts.
  • The field is 3yrs old, it is growing fast, super fast