Ask HN: Books about full text search?

I would love to learn more about FTS at a very low level and I'm looking for books to read more on that topic. Any good suggestions ?

232 points | by sopromo 899 days ago

18 comments

binarymax 898 days ago
“Relevant search” by Doug Turnbull and John Berryman, published by Manning, is THE best book to get started with tuning search engines.
I’be been a search engineer for >10 years and this is always the first book I recommend.
https://www.manning.com/books/relevant-search
[-]
- softwaredoug 898 days ago
  Awe thanks Max <3
  [-]
  - deanebarker 898 days ago
    Before I read your book, I thought, "I know all about search!"
    After I read it...
    "...I knew NOTHING about search."
    No book has ever knocked me off my pedestal so brutally and so thoroughly.
  - aliswe 898 days ago
    For a moment I thought you were Doug Cutting
ssn 898 days ago
Three reference textbooks are available openly:
* Introduction to Information Retrieval, http://informationretrieval.org/
* Information Retrieval in Practice, http://www.search-engines-book.com/
* Entity-Oriented Search, https://eos-book.org/
Modern Information Retrieval is also a classic reference. Not openly available but some contents are (were?) available online. Their site seems to be down but the Internet Archive has a copy.
Additional resources here:
* https://nlp.stanford.edu/IR-book/information-retrieval.html http://web.archive.org/web/20220708135205/http://grupoweb.up...
[-]
- firebones 898 days ago
  I am biased, but building the Intro to Information Retrieval chapters in your favorite language, bit by bit, is really good to get the feel of the tradeoffs for index capabilities.
100k 898 days ago
At a general audience level, "Index" is on my list to read. It covers the invention of the index up to digital search engines. https://www.nytimes.com/2022/02/09/books/review-index-histor...
"Introduction to Information Retrieval" is a textbook which is available online https://nlp.stanford.edu/IR-book/ Here's a review: http://glinden.blogspot.com/2009/02/book-review-introduction...
Another textbook which IMHO is a bit lower level is "Information Retrieval: Implementing and Evaluating Search Engines". The book website is down for me right now, but you can find it on Amazon here: https://www.amazon.com/Information-Retrieval-Implementing-Ev...
Another commenter linked to "Relevant Search", which is great if you want to learn how to effectively use a search engine to improve relevance (as opposed to how to implement a search engine). It's old, but another book in that vein that was really helpful for me earlier in my career is Lucene in Action: https://www.amazon.com/Lucene-Action-Second-Covers-Apache/dp...
[-]
- driscoll42 898 days ago
  Going to second the rec on "Index", it's a very understandable, well researched book that the general audience or even a skilled practitioner would enjoy.
DamonHD 899 days ago
Managing Gigabytes
https://books.google.co.uk/books/about/Managing_Gigabytes.ht...
Old but good!
[-]
- CoolestBeans 898 days ago
  Came here to recommend Managing Gigabytes as well. People these days are managing far more than gigabytes but the fundamental ideas remain useful.
- dekhn 898 days ago
  Check out the first review on the Amazon page. Norvig read it around the time he started at Google.
francoisprunier 898 days ago
Not a book, but this paper from 2019 covers a lot of ground and reviews the different topics extensively: https://tonellotto.github.io/publication/fntir/fntir_main.pd...
pixelmonkey 898 days ago
Take a look at my post “Lucene: The Good Parts”—
https://blog.parse.ly/lucene/
The book mentioned there is Lucene in Action.
And then this YouTube presentation by a Lucene/Elasticsearch committer will give you a nice overview of some related algorithms—
https://youtu.be/eQ-rXP-D80U
brudgers 898 days ago
Not a book but Hellerstein’s CS186 from 2015 starting with Lecture 17 gave me a basic understanding (I think).
Playlist https://youtube.com/playlist?list=PLhMnuBfGeCDPtyC9kUf_hG_Qw...
Also from that lecture series, the low level is always IO. One disk read tends to dwarf n^2 in-memory algorithms.
And IO is all about tuning caches and hardware for the specific structural relationships in the data, the way in which it is accessed, and the hardware everything runs on.
Good luck.
MonkoftheFunk 898 days ago
Hotz... Is that you... Trying to learn to improve Twitter search? ;)
fiedzia 899 days ago
https://www.manning.com/books/relevant-search
Also "taming text"
[-]
- arooaroo 898 days ago
  Manning also have a book on Lucene, the library that powers Solr and ElasticSearch. IIRC the book covered how Lucene actually works under-the-good and would therefore act as a good reference on the subject in general.
- gardenfelder 898 days ago
  Taming Text is about building a question-answering system; it came out about the time Watson came online; it's not a plan, rather a cookbook of experiments using Apache products like Solr and OpenNLP, but is a great tutorial on how question answering works.
vdfs 898 days ago
Lucene in Action, good introduction to Lucene, which can be helpful to learn ElasticSearch (most used FTS these days)
[-]
- _tom_ 898 days ago
  Lucene in Action covers Lucene 3.0, and is from 2010. Current version is 9.4.2. So much has changed.
tgv 898 days ago
Check the literature of open courses on Text Retrieval. E.g. https://stanford.edu/class/cs276/
Beefin 898 days ago
series of tutorials and comparisons that aim to teach a foundations about vector search:
https://vectorsearch.dev/
cb321 898 days ago
It's all in the Nim programming language, but if you prefer reading code or running diffs then you might get a vague sense of (some) low level nuts & bolts from: https://github.com/c-blake/nimsearch
User23 898 days ago
Is there some better alternative to Knuth-Morris-Pratt or Boyer-Moore? Both can easily be adapted to regular expression matching and as far as I know there’s no faster algorithm that doesn’t do preprocessing.
Beefin 898 days ago
Stanford's NLP course:
https://www.youtube.com/playlist?list=PLoROMvodv4rOSH4v6133s...
leeseonwook 898 days ago
123
[-]
- leeseonwook 898 days ago
  456
unixhero 898 days ago
Just use Postgres fulltext Search, its good enough http://rachbelaid.com/postgres-full-text-search-is-good-enou...
[-]
- johnthescott 898 days ago
  for postgres, i highly recommend the rum index over the core fts. rum is written by postgrespro, who also wrote core fts and json indexing in pg.
```
    https://github.com/postgrespro/rum
```
  rum handles +20mil pdf pages, interactively.
  [-]
  - SPBS 898 days ago
    Pleasantly surprised that RUM is just a drop-in replacement for the built-in GIN index, you can still use Postgres' native FTS operations with it.
  - unixhero 898 days ago
    Sounds very interesting. Never heard of rum, thank you for suggesting it.