Modern Information Retrieval is also a classic reference. Not openly available but some contents are (were?) available online. Their site seems to be down but the Internet Archive has a copy.
I am biased, but building the Intro to Information Retrieval chapters in your favorite language, bit by bit, is really good to get the feel of the tradeoffs for index capabilities.
Another textbook which IMHO is a bit lower level is "Information Retrieval: Implementing and Evaluating Search Engines". The book website is down for me right now, but you can find it on Amazon here: https://www.amazon.com/Information-Retrieval-Implementing-Ev...
Another commenter linked to "Relevant Search", which is great if you want to learn how to effectively use a search engine to improve relevance (as opposed to how to implement a search engine). It's old, but another book in that vein that was really helpful for me earlier in my career is Lucene in Action: https://www.amazon.com/Lucene-Action-Second-Covers-Apache/dp...
Going to second the rec on "Index", it's a very understandable, well researched book that the general audience or even a skilled practitioner would enjoy.
Also from that lecture series, the low level is always IO. One disk read tends to dwarf n^2 in-memory algorithms.
And IO is all about tuning caches and hardware for the specific structural relationships in the data, the way in which it is accessed, and the hardware everything runs on.
Manning also have a book on Lucene, the library that powers Solr and ElasticSearch. IIRC the book covered how Lucene actually works under-the-good and would therefore act as a good reference on the subject in general.
Taming Text is about building a question-answering system; it came out about the time Watson came online; it's not a plan, rather a cookbook of experiments using Apache products like Solr and OpenNLP, but is a great tutorial on how question answering works.
It's all in the Nim programming language, but if you prefer reading code or running diffs then you might get a vague sense of (some) low level nuts & bolts from: https://github.com/c-blake/nimsearch
Is there some better alternative to Knuth-Morris-Pratt or Boyer-Moore? Both can easily be adapted to regular expression matching and as far as I know there’s no faster algorithm that doesn’t do preprocessing.
I’be been a search engineer for >10 years and this is always the first book I recommend.
https://www.manning.com/books/relevant-search
After I read it...
"...I knew NOTHING about search."
No book has ever knocked me off my pedestal so brutally and so thoroughly.
* Introduction to Information Retrieval, http://informationretrieval.org/
* Information Retrieval in Practice, http://www.search-engines-book.com/
* Entity-Oriented Search, https://eos-book.org/
Modern Information Retrieval is also a classic reference. Not openly available but some contents are (were?) available online. Their site seems to be down but the Internet Archive has a copy.
Additional resources here:
* https://nlp.stanford.edu/IR-book/information-retrieval.html http://web.archive.org/web/20220708135205/http://grupoweb.up...
"Introduction to Information Retrieval" is a textbook which is available online https://nlp.stanford.edu/IR-book/ Here's a review: http://glinden.blogspot.com/2009/02/book-review-introduction...
Another textbook which IMHO is a bit lower level is "Information Retrieval: Implementing and Evaluating Search Engines". The book website is down for me right now, but you can find it on Amazon here: https://www.amazon.com/Information-Retrieval-Implementing-Ev...
Another commenter linked to "Relevant Search", which is great if you want to learn how to effectively use a search engine to improve relevance (as opposed to how to implement a search engine). It's old, but another book in that vein that was really helpful for me earlier in my career is Lucene in Action: https://www.amazon.com/Lucene-Action-Second-Covers-Apache/dp...
https://books.google.co.uk/books/about/Managing_Gigabytes.ht...
Old but good!
https://blog.parse.ly/lucene/
The book mentioned there is Lucene in Action.
And then this YouTube presentation by a Lucene/Elasticsearch committer will give you a nice overview of some related algorithms—
https://youtu.be/eQ-rXP-D80U
Playlist https://youtube.com/playlist?list=PLhMnuBfGeCDPtyC9kUf_hG_Qw...
Also from that lecture series, the low level is always IO. One disk read tends to dwarf n^2 in-memory algorithms.
And IO is all about tuning caches and hardware for the specific structural relationships in the data, the way in which it is accessed, and the hardware everything runs on.
Good luck.
Also "taming text"
https://vectorsearch.dev/
https://www.youtube.com/playlist?list=PLoROMvodv4rOSH4v6133s...