Show HN: LLaMA 3 tokenizer runs in the browser

(belladoreai.github.io)

10 points | by belladoreai 11 days ago

3 comments

  • bschmidt1 11 days ago
    I'm not sure it's working correctly, I entered the word "what" and it says "4 characters, 3 tokens", I type a space and it says "4 tokens" - shouldn't it just be 1 token? and the space shouldn't count in this case?

    Also occasionally a space appears as a capital G (in Chrome)

    Probably a minor issue. Question: Is there a special ruleset that llama3 follows that other LMs don't as far as what qualifies as a token?

    • belladoreai 11 days ago
      > I'm not sure it's working correctly, I entered the word "what" and it says "4 characters, 3 tokens", I type a space and it says "4 tokens" - shouldn't it just be 1 token? and the space shouldn't count in this case?

      When you enter the word "what", the 3 tokens were: start-of-string token, the token "what", and end-of-string token. I made a change now to hide the special start-of-string and end-of-string tokens so that the visualization is a bit simplified.

      Adding a space to input changes the tokenization of the input. Sometimes the resulting token count is the same (if the space is merged into some other text), sometimes the resulting token count increases by one (if the space does not get merged).

      That part of the tokenizer is working correctly.

      > Also occasionally a space appears as a capital G (in Chrome)

      Fixed, thanks for reporting! This is a fork of my earlier tokenizer for LLaMA 1 and the demo visualizer had special handling for tokens 0-256 in LLaMA 1. This LLaMA 3 tokenizer doesn't have same special tokens, so some tokens would be visualized in a weird way (like that G thing you reported). I removed that special handling now and it fixed the visualization issue.

      > Question: Is there a special ruleset that llama3 follows that other LMs don't as far as what qualifies as a token?

      Different models use different tokenization schemes. Most models use some kind of variant of Byte Pair Encoding, trained with their data (the tokenizer itself is also trained, not only the language model).

      • bschmidt1 11 days ago
        Hm I had not heard of tokenizing like that, typically it's just words or occasionally a word + some adjacent stuff like a punctuation or space. "What " might be a different token than "What" but the total token count shouldn't increment, would just be a different token, right?

        > Different models use different tokenization schemes

        Curious then why this is called "LLaMA 3 tokenizer" what does it have to do with llama3?

        • belladoreai 10 days ago
          > "What " might be a different token than "What" but the total token count shouldn't increment, would just be a different token, right?

          The input string "What" (without trailing space) tokenizes into 1 token. The input string "What " tokenizes into 2 tokens. In theory, one might have a tokenizer that would simply tokenize "What " into a single token, but the actual tokenizers we have will tokenize that into at least 2 tokens.

          > Curious then why this is called "LLaMA 3 tokenizer" what does it have to do with llama3?

          When you input text into any of the LLaMA 3 models, the first step in the process is tokenizing your input. This library is called "LLaMA 3 tokenizer", because it produces the same tokenization as the official LLaMA 3 repo.

          When I said that different models use different tokenization schemes, I am talking in comparison to other models, such as LLaMA 1, or GPT-4. Different models use different tokenizers, so the same text is tokenized into different tokens depending on if you're using GPT-4 or LLaMA 3 or what not.

          • bschmidt1 10 days ago
            Thanks for clarifying, this is exactly where I was confused.

            I just read about how both sentencepiece and tiktoken tokenize.

            Thanks for making this (in JavaScript no less!) and putting it online! I'm going to use it in my auto-completion library (here: https://github.com/bennyschmidt/next-token-prediction/blob/m...) instead of just `.split(' ')` as I'm pretty sure it will be more nuanced :)

            Awesome work!

            • bschmidt1 9 days ago
              Well I installed your npm and tried to integrate it, but no matter what every token is always " word" with a leading space, and it's isolating foreign symbols as standalone tokens. I tried different options to strip those or to not include preceding spaces but it's always that way. It's probably how llama3 tokenizes text but I can't get use out of it for my autocomplete library unfortunately. I would need more-or-less the tokens to be words or occasional phrases.

              I really love that it is 0 deps and that you provided the npm, and would love to defer this part of my work to an efficient library like this.

              • belladoreai 9 days ago
                I don't think I really understand your use case.

                My library solves the following problem: how to tokenize text in a way that is compatible with llama3.

                If you don't have any particular constraint (as in "tokenize text in a way that is compatible to model X"), then you can just write your own tokenization that tokenizes the text however you want. It doesn't really make sense to use a complicated tokenization scheme from some LLM model if you don't need to be compatible with that model.

                If you really want each word to be its own token, you can easily do that by just splitting on whitespace and punctuation (though that will lead to a huge vocabulary).

  • belladoreai 11 days ago
  • mrbishalsaha 10 days ago
    Really good. I am actually using js-tiktokken and wish there was a package to handle all the other LLMs also but still something I can work with.
    • belladoreai 10 days ago
      If you need to work with multiple LLMs, you probably want to use transformers.js
      • mrbishalsaha 10 days ago
        Isn't it to much for just calculating the number of token?
        • belladoreai 9 days ago
          It's the best option you have if you need to work with multiple LLMs in the browser.