Show HN: Adventures in OCR

(blog.medusis.com)

121 points | by bambax 4 days ago

16 comments

pronoiac 4 days ago
Oh wow! I've worked on turning PAIP (Paradigms of Artificial Intelligence Programming) from a book into a bunch of Markdown files, but that's "only" about a thousand pages long, compared to the roughly 27000 pages long of all those volumes. I have advice, possibly helpful, possibly not.
Getting higher quality scans could save you some headaches. Check the Internet Archive. Or, get library copies, and the right camera setup.
Scantailor might help; it lets you semi-automate a chunk of things, with interactive adjustments. I don't know how its deskewing would compare to ImageMagick. The signature marks might be filtered out here.
I wrote out some of my process for handling scans here - https://github.com/norvig/paip-lisp/releases/tag/v1.2 . I maybe should blog about it.
If you get to the point of collaborative proofreading, I highly recommend Semantic Linefeeds - each sentence gets its own line. https://rhodesmill.org/brandon/2012/one-sentence-per-line/ I got there by:
* giving each paragraph its own line
* then, linefeed at punctuation, maybe with quotation marks and parentheses? It's been a while
[-]
- bambax 4 days ago
  You are right that the quality of the scans is paramount! Unfortunately I don't have access to the physical books and have to work with the scans as they are (they're not good). But I will look at Scantailor, it looks interesting.
  For now I reconstruct paragraphs in html but I could do markdown just as well (where paragraph breaks are marked by double line breaks, and single line breaks don't count).
  Collaborative proofreading would be cool but it would require some way of properly tracking who wrote what, and I'm not sure what to use or if I should build a simple system from scratch. Do you have recommendations?
  [-]
  - pronoiac 3 days ago
    I got a copy of the 30-year old book from EBay or Amazon for $20, chopped the spine off, and fed it through a scanner. Doing that to a century-old book feels wrong!
    ScanTailor was tricky to start with; dunno if there's a manual. I remember belatedly realizing that there's automation at each step, that one can then quickly skim and manually adjust.
    For collaborative editing, git via GitHub worked for us. Tracking who did what, and when, is easy. It allowed for sweeping edits covering multiple chapters. Building some porcelain on top of that, for less technical folks, could be good.
  - pronoiac 2 days ago
    > Pour obtenir un document de Gallica en haute définition, contacter [email protected].
    roughly:
    > To obtain a Gallica document in high definition, contact [email protected].
    My expectations would be very low, but I'd reach out to them anyway.
  - jfil 3 days ago
    Because you're creating webpages from the text, one option for collaborative notes/corrections is to use a Web Annotation system like Hypothes.is.
- 2Gkashmiri 3 days ago
  A few years ago I got so good at the whole scan>scantailor>PDF that I could scan a 100-150 page book, send that to scantailor, edit it and improve it to TIFF. Convert to PDF and OCR it in half hour.
  I got very good at this but page turning way a bore.
  The PDF turned out in a mechanical fashion without much effort.
  I made a few scripts to do TIFF to PDF and then stictching them and doing OCR.
  [-]
  - pronoiac 3 days ago
    Page turning? So, non-destructive, with cameras? How's the quality?
ksampath02 4 days ago
You could try Aryn DocParse, which segments your documents first before running OCR: https://www.aryn.ai/ (full disclosure: I work there).
[-]
- bambax 4 days ago
  I will try that, thanks.
eigenvalue 3 days ago
Out of curiosity, I tried submitting the first 200 pages of the PDF he used to my new tool that I also submitted today [0] to Show HN, ( fixmydocuments.com ), and it generated the following without any further interaction besides submitting the PDF file:
https://fixmydocuments.com/api/hosted/m-moires-de-saint-simo...
I think it's not a bad result, and any minor imperfections could be revised easily in the markdown. My feature to turn the document into presentation slides got a bit confused because of the French language, so some slides ended up getting translated into English. But again, it wouldn't be hard to revise the slide contents using ChatGPT or Claude to make them all either French or English:
https://fixmydocuments.com/api/hosted/m-moires-de-saint-simo...
[0] https://news.ycombinator.com/item?id=42453651
[-]
- bambax 3 days ago
  Thanks, but I'm sorry to say, the result is... bad. It invents words ("rusticas" in the second line of the title of the output, isn't anywhere in the source file -- it's not even a French word). And it completely drowns the footnotes inside the main text, inventing layout and text enrichment in the process.
  Footnotes are an important part of this project, if not the main point. If they are mangled with the main text then it's pointless. In your rendering there doesn't seem to be footnotes at all? just text with random titles here and there, and even more random tables that (to me) make no sense.
  I wouldn't call that "minor imperfections". As it is, it really isn't usable.
  [-]
  - eigenvalue 3 days ago
    OK, thanks for the feedback. I really only tested with English language input documents, so I had low expectations going in. And you're right that this is certainly a challenging case with a lot of document structure and not very high quality scans.
lassenordahl 4 days ago
OCR to original structure is a really fun problem! I did something similar in an internship for newspapers pre-LLM Vision models, and it ended up being a bunch of interval problems re-aligning and formatting the extracted text. Found that Azure's OCR model was the most accurate by bounding box, which helped a lot.
Funny how vision models would almost be able to one-shot it, modulo some hallucination issues. Some of the research back then ~2020 was starting to use vision models for layout generations.
[-]
- aziaziazi 4 days ago
  Author take one the LLM approach for first pass:
  > Trying to get LLMs with vision to properly identify zones also were found to be slow and unreliable, and the risk of hallucinated results is unacceptable, especially as a first step. Non-deterministic systems may be fine for creative projects, but not here. (Once we have a reliable reference we can then play with LLMs and if necessary, control the results by measuring the distance to the source.)
  He tried it for fixing footnotes and the result went "classic LLM":
  > It was a complete flop. Using OpenRouter, I tested over 200 models. More than 70% couldn't even count the footnotes right, but that wasn't the worst part.
  The "best" models just made stuff up to meet the requirements. They lied in three ways:
  Basic (stupid) lies: wrong counts but claiming they matched ('foonotes: 5, references: 3, match: true') Better lies: claiming they placed references when they hadn't Premium lies: making up new text to attach footnotes to when they weren't sure where they went (against explicit instructions in the prompt never to do that)
throwaway81523 4 days ago
You could upload the books to the Internet Archive and let their OCR pipeline take a try. It is (or at least was) written around Abbyy. Results weren't great but they were a start.
I wonder what eventually happened with Ocropus which was supposed to help with page segmentation. I was a bit disappointed to see that this article used Google Vision as its OCR engine. I was hoping for something self hosted.
[-]
- pronoiac 2 days ago
  I searched the Internet Archive for Les Mémoires de Saint-Simon - https://archive.org/search?query=Les+Mémoires+de+Saint-Simon... - skimmed the results for items with the right number of pages, and came up empty.
  I uploaded a new item - https://archive.org/details/memoires-de-saint-simon-nouvelle... - though I made a mess of the metadata. It's still processing.
- cxr 4 days ago
  The Internet Archive's OCR is built around tesseract nowadays, but you're right about piggybacking off their pipeline. Upload a text to archive.org and get hOCR for free.
- zozbot234 4 days ago
  The book is being worked on here https://fr.wikisource.org/wiki/Livre:Saint-Simon_-_M%C3%A9mo... already (volume 1 of 20). Not the same edition as what OP is working with, but it's a start.
  [-]
  - bambax 4 days ago
    That edition (the Chéruel edition) is the first integral edition of the Mémoires. It's been OCRed a long time ago and has been available in text form for 20+ years. But it has almost no footnotes.
    The edition I'm working on here, the "Boislisle", is completely different thanks of the richness and coverage of its footnotes (but the main text should be almost identical).
    [-]
    - cdrini 4 days ago
      If it's public domain, you can create a new record for it on Wikisource once you think it's ready for the human touch. This is the purpose of Wikisource though, taking the messy automated OCR, and allowing volunteers to correct/proofread/format everything.
- bambax 4 days ago
  Well, in my experience Google Vision is far, far ahead of Tesseract.
bondeau 1 day ago
I’ve used Surya (https://github.com/VikParuchuri/surya) before. It is very good (on par with Google Vision, potentially better layout analysis), but yours is a challenging use case. I wonder if it would be useful.
constantinum 4 days ago
> The "best" models just made stuff up to meet the requirements. They lied in three ways:
> The main difficulty of the is project lies in correctly identifying page zones; wouldn't it be possible to properly find the zones during the OCR phase itself instead of rebuilding them afterwards?
Anyone curious, try LLMWhisperer[1] for OCR. It doesn't use LLMs, so no hallucination side effects. It also preserves the layout of the input document for more context and clarity.
[1] https://unstract.com/llmwhisperer/
Examples of extracting complex layout:
https://imgur.com/a/YQMkLpA
https://imgur.com/a/NlZOrtX
https://imgur.com/a/htIm6cf
[-]
- bambax 3 days ago
  Looks interesting, but the cost is prohibitive for a hobby project. Also, it doesn't really solve my problem.
  Google Vision already returns the coordinates of each word (and even of each letter), so it's easy to know where the word was on the page, and even, if necessary, to rebuild the page with the words correctly placed -- that's fundamentally what I do with the mouseover on the interactive demo: https://divers.medusis.net/boislisle/pub (at the paragraph level).
  But my problem isn't to know where the words are (Google Vision provides that); it's to know what belongs to what, what is footnotes, what is main text, etc. This is what the post discusses. Just having the text following the same layout as in the original wouldn't help, because I'm not trying to reproduce the layout or the typesetting, I want to rebuild the content semantically, so as to do different "flows".
  That said, it got me thinking... there may be an opportunity to do a cheaper version of LLMwhisperer? ;-)
- lethibo 2 days ago
  Hi,
  You should give a chance to groundx from eyelevel. Ready to discuss this further with you, and open an account on Ground x for you to test it: https://www.groundx.ai/
  Thibaut
TacticalCoder 4 days ago
If it's to be really 100% automated I don't think there's much solution besides recreating the exact layout, using the very same font, and then superimposing the "OCR then re-rendered" text with the original scan and see if they're close enough. This means finding the various fonts, sizes, types (italic, bold, etc.).
But we'll get there eventually with AIs. We'll be able to tell: "Find me the exact font, styles, etc. And re-render it using InDesign (or LaTeX or whatever fancies you), then compare with the source and see what you got wrong. Rinse and repeat".
We'll eventually have the ability to do just that.
complexworld 4 days ago
Getting the footnotes right is going to be really tricky. Sometimes I couldn't even read the superscript numbering on the original scans. And that was after zooming in to the max.
Reliably identifying the superscript locations should be enough since they are in the same order as the footnotes.
It's a little early for feature requests... but I would love to see an EPUB edition! It shouldn't be too hard once done with the hard work of getting the data structured structured.
[-]
- bambax 4 days ago
  Yes. The original idea was to have some LLM place footnotes references in the text, based on the content of the footnotes themselves, but as I say in the blog post, that failed spectacularly.
  Now another idea is to manually put placeholders for footnotes references in the text, and then number them automatically. Before that, I manually enter the number of footnotes on each page, for verification. I have already done this for the first two volumes, it's pretty fast. Having the number of footnotes on a page lets:
  - check that the number of footnotes is correct
  - (and therefore) also check that footnotes numbers are also correct (from 1 to n, in order)
  - also check that the number of footnotes references is also correct (should exactly match the number of footnotes)
  - and finally, properly number the placeholders.
  Manually inputing numbers in the main text would be very difficult and error-prone, but simply putting placeholders and checking them automatically, should be much faster and safer.
gregschlom 4 days ago
"A very crude method would be to remove the last line every 16 pages but that would not be very robust if there were missing scans or inserts, etc. I prefer to check every last line of every page for the content of the signature mark, and measuring a Levenshtein distance to account for OCR errors."
I'm curious: did you also check whether the signature mark was indeed found every 16 pages? Were there any scans missing?
Great project btw!
[-]
- bambax 4 days ago
  Yes, that's one of the (many) benefits of logging!
  And in fact, there is a hiatus, because the introduction at the beginning is from a different "sub-book", where the pages are numbered using roman numerals. Typically the introduction would be written and typeset after the main book had been typeset, so its number of pages would not known in advance and that's why it uses a different numbering system.
  So one finds a signature mark on pages 9, 25 41, 57, 73, 89, and then it starts again at page 93 109, 125, 141, 157, 173, 189, etc. (those numbers come from the filenames of the scans, not the numbers printed on the pages).
  => Another reason for not starting with the first signature mark and simply adding 16, is that would miss the changing of sub-book (or any irregular number of pages, for any reason).
gregschlom 4 days ago
For the human review part: maybe crowdsource it? Make the book available for reading online, with a UI to submit corrections (Wikipedia-style).
[-]
- cxr 4 days ago
  You're describing Wikisource, a Wikipedia sister project.
  Pros:
  - no setup
  - WMF foots the bill
  Cons:
  - you have to deal with the fragile egos of wikijerks
wll 4 days ago
Use a ~SoTA VLM like Gemini 2.0 Flash on the images. It’ll zero-shot de-hyphenated text in semantic HTML with linked footnotes.
[-]
- bambax 4 days ago
  Hallucinations are problematic, and they're hard to defend against, if there's only one source of truth. I was surprised by the creativity that LLMs showed for the simple task of placing footnotes references, as I explain in the post.
  .. But there's no harm in trying. At the very least it could be done in conjunction with traditional OCR to check for whole sentences of pure invention.
  [-]
  - ajcp 4 days ago
    We have been using Gemini 1.5 Flash at enterprise scale on a massively varied, form based document dataset and we have yet to see hallucinations, on either our ground-truth dataset or in our random audits for evaluation. Just to make sure though we threw some recursion on it: we take the output and give it right back to the model with the original prompt and output and ask it how accurate it is. If it thinks it's not accurate we tell it to rewrite the original prompt to provide for a more accurate output. We then stuff that right back down its own function :)
    For your use-case it would be exponentially easier as all you'd need to provide Gemini your "zones" as the JSON schema for output and it will quite reliably identify them.
    [-]
    - bambax 3 days ago
      I just tested Gemini 1.5 Flash (interactively on Google AI Studio) and the results are far from acceptable.
      OCR seems good, on par with Google Vision.
      But the footnotes are not properly identified on most pages; they are properly identified when there is a large gap and the first line of the footnotes starts with a number; but when the footnotes block starts with text (continuing a footnote from a previous page) and/or the gap is small or almost non-existent, it fails (all text on the page is considered belonging to main text).
      But the main problem isn't even that, it's that it takes between 10 to 20 seconds per page. That would mean over three hours per volume of 600 pages. Google Vision takes less than one second per page.
      It's possible there is a setup cost and that doing batches or even full PDFs would be better, though. Do you have experience with this? And can you maybe share "prompt secrets" that would improve the results...?
      [-]
      - staticman2 3 days ago
        Gemini 1.5 pro worked better for me at Korean OCR on camera phone taken scans so must be better in some scenarios. You could try it but it's certainly slow.
      - wll 3 days ago
        Could you share the page numbers where it has trouble with footnotes? I’ll give it a try.
      - ajcp 3 days ago
        - There is no reason it should take 10-20 seconds per page. In my experience with complex documents of 20-30 pages it takes 3-4 seconds.
        - Please message me for more info if you like as this should be something LLMs can handle with ease.
        This was my prompt (except for the URLs I provided the actual JPG, one as the example and the other the page to process). It took 2 seconds:
        "
        Extract the text from this page, correctly identifying the different zones the text appears in.
        Pages contain the following possible zones (colors refer to the image)
        -header (blue)
        -comments in the margins (green)
        -main text (pink)
        -footnotes (yellow)
        -signature mark (purple)
        {https://blog.medusis.com/assets/38/ss_example_zones_colors.j...}
        Extracting the text means correctly parsing the words in those different zones and reconstructing them properly, so as to produce readable text.
        - Many pages have comments or subtitles in the margins that don't belong in the main text
        - There are extensive footnotes that need to be properly tagged since they're not part of the main text
        - Each page has a header that you might need to remove for smooth reading across pages
        - On some pages, but not all, there's a 'signature mark' at the bottom. These need to be captured.
        So you need to process the output to properly identify all these elements and mark paragraphs correctly.
        {https://divers.medusis.net/boislisle/pub/tomes/t01/out/cropp...}
        "
        This is the JSON schema I told it to output in:
        { "type": "OBJECT", "properties": { "header": { "type": "string", "nullable": false }, "main_text": { "type": "STRING", "nullable": false }, "footnotes": { "type": "array", "nullable": true, "items": { "type": "string" } }, "comments": { "type": "STRING", "nullable": true }, "signature_mark": { "type": "STRING", "nullable": true } }, "required": [ "header", "main_text", "footnotes", "comments", "signature_mark" ] }
        This was the output:
        { "comments": "Guimaraes, son caractère et ses mœurs.", "footnotes": [ "1. Voyez une lettre du général Marquis de Saint-Simon, dans le Moniteur, du 18 août 1838. — Cet ouvrage, cessionnaire de Boisange, eut vingt et un volumes in-8° par Sautel, publiés par Delloye, et celle de 1883, publiée par les frères Ducharne, quarante volumes in-18.", "2. L'édition de 4820-4830, la publiée de l'édition de 1840.", "3. Mémoires complets et authentiques du duc de Saint-Simon sur le règne de Louis XIV, et la Régence, collationnés sur le manuscrit ori- ginal par M. Chéruel, et précédés d'une notice par Sainte-Beuve, de l'Académie française. — Paris, 1856, in-8° de 1840 pages. — Cette édition est imprimée en deux volumes, sans faute, et avec une exactitude parfaite, en raison des volumes de 1861 ; un autre, dans le format in-42, ac- compagné de dix-un.", "4. En treize volumes. — Un premier tirage, sans le concours ; un troi- sième, dans le format in-18, en 1883, et un quatrième en 1865, dans le format in-16.", "5. Cette maison venait d'inaugurer sa Bibliothèque des chemins de fer, qui contribua beaucoup au succès de cette publication.", "6. Sa propriété est particulièrement confirmée par des arrêts anté- rieurs à l'acquisition ; l'un du tribunal de première instance de Paris en date du 8 juin 1856, un autre de la Cour d'appel en date du 8 fé-" ], "header": "MEMOIRES DE SAINT-SIMON.", "main_text": "ce manuscrit, en y pratiquant toutefois ce qu'il appelait « les corrections et les retranchements indispensables ». Outre cette première édition, datée de 1829-1830, les Mémoires complets et authentiques du duc de Saint-Si- mon sur le siècle de Louis XIV et la Régence furent deux fois réimprimés par les soins du général de Saint-Simon en 1840 et 1856, avant que M. Chéruel obtint de faire l'édition de 1856, que depuis lors, on a considéré, non sur l'original une nouvelle revision ou d'ont sorties sans raison, comme édition principale, et plusieurs réim- pressions successives du texte sec, en moindre format, toutes faites par la maison Hachette³, qui devint propriétaire du manuscrit des Mémoires.", "signature_mark": null }
        [-]
        ajcp 3 days ago
        Also, if you're wondering why it output "Guimaraes, son caractère et ses mœurs." as a comment, it's because my instructions were not clear enough and it thought the prompt was asking for it to provide comment on the text :D
  - richard_todd 3 days ago
    Yes I periodically try to get scanned images of Medieval Latin and Hebrew books ocr’d and translated by Gemini and ChatGPT… sometimes the results are amazing, but you have to proofread it all because they occasionally go off the rails. They will either skip sentences, or start regurgitating sentences from another similar text that they must have been trained on. Sometimes, after helping me with several pages, Gemini will suddenly decide to announce “I’m just an LLM, and I can’t process images”, and I have to encourage it to try anyway. It’s strange. Still overall a time saver.
    As for segmenting the images (header/footer/table/main text) I’ve been using Abbyy and it’s generally pretty good at it. It unfortunately often fails at footnotes in much the same way as described in the post, so it won’t get you past that hurdle.
- TacticalCoder 4 days ago
  I've not done it at scale but so far I've had very good experience with OCR using AI models. Maintenance bill for my car in german: OCR, boom translation to french in no time. Works amazingly well.
fschuett 3 days ago
> After these experiments, it's clear some human review is needed for the text, including spelling fixes and footnote placement.
I just use ChatGPT for spelling fixes (i.e. when rewriting articles). You just have to instruct it to NOT auto-rephrase the article.
joeevans1000 2 days ago
I'm just trying to become literate in AI. Does anyone have any tips or links for how I could use the vectors for building a RAG?
lproven 2 days ago
> correclty parsing the words
In context: heh.
(I know, typo not OCR-o, but still...)