From text to token: How tokenization pipelines work

(paradedb.com)

58 points | by philippemnoel 1 day ago

5 comments

heikkilevanto 10 minutes ago
Good explanation on tokenizing English text for regular search. But it is far from universal, and will not work well in Finnish, for example.
Folding diacritics makes "vähä" (little) into "vaha" (wax).
Dropping stop words like "The" misses the word for "tea" (in rather old-fashioned finnish, but also in current Danish).
Stemming Finnish words is also much more complex, as we tend to append suffixes to the words instead of small words in front to the word. "talo" is "house", "talosta" is "from the house", "talostani" is "from my house", and "talostaniko" makes it a question "from my house?"
If that sounds too easy, consider Japanese. From what little I know they don't use whitespace to separate words, mix two phonetic alphabets with Chinese ideograms, etc.
gortok 20 minutes ago
My biggest complaints about search come from day-to-day uses:
I use search in my email pretty heavily, and I’m most interested in specific words in the email; and when those emails are from specific folks or a specific domain. But, the mobile version of Gmail produces different results than the mobile Outlook app than the desktop version of Gmail, and all of them are pretty terrible at search as it pertains to email.
I have a hard to getting them to pull up emails in search that I know exist, that I know have certain words, and I know have certain email addresses in the body.
I recognize a generalized searching mechanisms is going to get domain specific nuances wrong, but is it really so hard to make a search engine that works on email and email based attachments that no one cares enough to try?
wongarsu 3 hours ago
Notably tokenization for traditional search. LLMs use very different tokenization with very different goals
the_arun 46 minutes ago
Just curious - if we remove stop words from prompts before going to LLM, wouldn't it reduce token size? Will it keep the response from LLM same (original vs without stop tokens)?
[-]
- kylecazar 2 minutes ago
  Search engines are often keyword based and can afford to throw out stopwords. Modern LLM's can't afford to lose the nuance and semantics they signal though -- so they don't automatically strip them.
  Yeah, it'll be less input tokens if you omitted them yourself. It's not guaranteed to keep the response the same, though. You're asking the model to work with less context and more ambiguity at that point. So stripping your prompt of stopwords is probably going to save you trivial $ and potentially cost a lot in model performance.
semicognitive 51 minutes ago
ParadeDB is a great team, highly recommend using