Show HN: Zerox – Document OCR with GPT-mini

(github.com)

228 points | by themanmaran 46 days ago

28 comments

  • serjester 46 days ago
    It should be noted for some reason OpenAI prices GPT-4o-mini image requests at the same price as GPT-4o. I have a similar library but we found OpenAI has subtle OCR inconsistencies with tables (numbers will be inaccurate). Gemini Flash, for all its faults, seems to do really well as a replacement while being significantly cheaper.

    Here’s our pricing comparison:

    *Gemini Pro* - $0.66 per 1k image inputs (batch) - $1.88 per text output (batch API, 1k tokens) - 395 pages per dollar

    *Gemini Flash* - $0.066 per 1k images (batch) - $0.53 per text output (batch API, 1k tokens) - 1693 pages per dollar

    *GPT-4o* - $1.91 per 1k images (batch) - $3.75 per text output (batch API, 1k tokens) - 177 pages per dollar

    *GPT-4o-mini* - $1.91 per 1k images (batch) - $0.30 per text output (batch API, 1k tokens) - 452 pages per dollar

    [1] https://community.openai.com/t/super-high-token-usage-with-g...

    [2] https://github.com/Filimoa/open-parse

    • themanmaran 46 days ago
      Interesting. It didn't seem like gpt-4o-mini was priced the same as gpt-4o during our testing. We're relying on OpenAI usage page of course, which doesn't give as much request by request pricing. But we didn't see any huge usage spike after testing all weekend.

      For our testing we ran a 1000 page document set, all treated as images. We got to about 25M input / 0.4M output tokens for 1000 pages. Which would be a pretty noticeable difference based on the listed token prices.

      gpt-4o-mini => (24M/1M * $0.15) + (0.4M/1M * 0.60) = $4.10

      gpt-4o => (24M/1M * $5.00) + (0.4M/1M * 15.00) = $126.00

      • serjester 46 days ago
        The pricing is strange because the same images will use up 30X more tokens with mini. They even show this in the pricing calculator.

        [1] https://openai.com/api/pricing/

        • elvennn 46 days ago
          Indeed it does. But also the price for output tokens of the OCR is cheaper. So in total it's still much cheaper with gpt-4o-mini.
    • raffraffraff 45 days ago
      That price compares favourably with AWS Textract. Has anyone compared their performance? Because a recent post about OCR had Textract at or near the top in terms of quality.
      • ianhawes 45 days ago
        Can you locate that post? In my own experience, Google Document AI has superior quality but I'm looking for something a bit more objective and scientific.
      • aman2k4 45 days ago
        I’m using AWS textract for scanning grocery receipts and i find it does it very well and fast. Can you say which performance metric you have in mind?
    • yfontana 45 days ago
      [dead]
  • 8organicbits 46 days ago
    I'm surprised by the name choice, there's a large company with an almost identical name that has products that do this. May be worth changing it sooner rather than later.

    https://duckduckgo.com/?q=xerox+ocr+software&t=fpas&ia=web

    • ot 46 days ago
      > there's a large company with an almost identical name

      Are you suggesting that this wasn't intentional? The name is clearly a play on "zero shot" + "xerox"

      • UncleOxidant 46 days ago
        I think they're suggesting that Xerox will likely sue them so might as well get ahead of that and change the name now.
        • 8organicbits 46 days ago
          Even if they don't sue, do you really want to deal with people getting confused and thinking you mean one of the many pre-existing OCR tools that Xerox produces? A search for "Zerox OCR" will lead to Xerox products, for example. Not worth the headache.

          https://duckduckgo.com/?q=Zerox+OCR

      • themanmaran 46 days ago
        Yup definitely a play on the name. Also the idea of photocopying a page, since we do pdf => image => markdown.

        We're not planning to name a company after it or anything, just the OS tool. And if xerox sues I'm sure we could rename the repo lol.

        • ssl-3 45 days ago
          I was involved in a somewhat similar trademark issue once.

          I actually had a leg to stand on (my use was not infringing at all when I started using it), and I came out of it somewhat cash-positive, but I absolutely never want to go through anything like that ever again.

          > Yup definitely a play on the name. Also the idea of photocopying a page,

          But you? My God, man.

          With these words you have already doomed yourself.

          Best wishes.

          • neilv 45 days ago
            > With these words you have already doomed yourself.

            At least they didn't say "xeroxing a page".

        • haswell 45 days ago
          If they sue, this comment will be used to make their case.

          I guess I just don’t understand - how are you proceeding as if this is an acceptable starting point?

          With all respect, I don’t think you’re taking this seriously, and it reflects poorly on the team building the tool. It looks like this is also a way to raise awareness for Omni AI? If so, I’ve gotta be honest - this makes me want to steer clear.

          Bottom line, it’s a bad idea/decision. And when bad ideas are this prominent, it makes me question the rest of the decisions underlying the product and whether I want to be trusting those decision makers in the many other ways trust is required to choose a vendor.

          Not trying to throw shade; just sharing how this hits me as someone who has built products and has been the person making decisions about which products to bring in. Start taking this seriously for your own sake.

        • wewtyflakes 46 days ago
          It still seems reasonable someone may be confused, especially since the one letter of the company name that was changed has identical pronunciation (x --> z). It is like offering "Phacebook" or "Netfliks" competitors, but even less obviously different.
          • qingcharles 45 days ago
            Surprisingly, http://phacebook.com/ is for sale.
            • austinjp 44 days ago
              From personal experience, I'd wager that anyone buying that domain will receive a letter from a Facebook lawyer pretty quickly.
        • ned_at_codomain 46 days ago
          I would happily contribute to the legal defense fund.
    • HumblyTossed 45 days ago
      I'm sure that was on purpose.

      Edit: Reading the comments below, yes, it was.

      Very disrespectful behavior.

    • blacksmith_tb 46 days ago
      If imitation is the sincerest form of flattery, I'd have gone with "Xorex" myself.
    • 627467 45 days ago
      the commercial service is called OmniAI. zerox is just the name of a component (github repo, library) in a possible software stack.

      am I only one finding these sort of takes silly in a cumulative globalized world with instant communications? There are so many things to be named, everything named is instantly available around the world, so many jurisdictions to cover - not all providing the same levels of protections to "trademarks".

      Are we really suggesting this issue is worth defending and spending resources on?

      what is the ground for confusion here? that a developer stumbles on here and thinks zerox is developed/maintained by xerox? this developer gets confused but won't simply check who is the owner of the repository? What if there's a variable called zerox?

      I mean, I get it: the whole point of IP at this point is really just to create revenue streams for the legal/admin industry so we should all be scared and spend unproductive time naming a software dependency

      • 8organicbits 45 days ago
        > Are we really suggesting this issue is worth defending and spending resources on?

        Absolutely.

        Sure, sometimes non-competing products have the same name. Or products sold exclusively in one country use the same name as a competitor in a different country. There's also companies that don't trademark or protect their names. Often no one even notices the common name.

        That's not whats happening here. Xerox is famously litigious about their trademark; often used as a case study. The product competes with Xerox OCR products in the same countries.

        It's a strange thing to be cavalier about and to openly document intent to use a sound-alike name. Besides, do you really want people searching for "Zerox OCR" to land on a Xerox page? There's no shortage of other names.

      • HumblyTossed 45 days ago
        > so we should all be scared and spend unproductive time naming a software dependency

        All 5 minutes it would take to name it something else?

    • pkaye 46 days ago
      Maybe call it ZeroPDF?
    • froh 46 days ago
      gpterox
  • jerrygenser 45 days ago
    Azure document AI accuracy I would categorize as high not "mid". Including hand writing. However for the $1.5/1000 pages, it doesn't include layout detection.

    The $10/1000 pages model includes layout detection (headers, etc.) as well as key-value pairs and checkbox detection.

    I have continued to do proofs of concept with Gemini and GPT, and in general any new multimodal model that comes out but have it is not on par with the checkbox detection of azure.

    In fact the results from Gemini/GPT4 aren't even good enough to use as a teacher for distillation of a "small" multimodal model specializing in layout/checkbox.

    I would like to also shout out surya OCR which is up and coming. It's source available and free for under a certain funding or revenue milestone - I think $5m. It doesn't have word level detection yet but it's one of the more promising non-hyper scaler/ heavy commercial OCR tools I'm aware of.

    • ianhawes 45 days ago
      Surya OCR is great in my test use cases! Hoping to try it out in production soon.
  • hugodutka 46 days ago
    I used this approach extensively over the past couple of months with GPT-4 and GPT-4o while building https://hotseatai.com. Two things that helped me:

    1. Prompt with examples. I included an example image with an example transcription as part of the prompt. This made GPT make fewer mistakes and improved output accuracy.

    2. Confidence score. I extracted the embedded text from the PDF and compared the frequency of character triples in the source text and GPT’s output. If there was a significant difference (less than 90% overlap) I would log a warning. This helped detect cases when GPT omitted entire paragraphs of text.

    • themanmaran 46 days ago
      One option we've been testing is the 'maintainFormat` mode. This tries to return the markdown in a consistent format by passing the output of a prior page in as additional context for the next page. Especially useful if you've got tables that span pages. The flow is pretty much:

      - Request #1 => page_1_image

      - Request #2 => page_1_markdown + page_2_image

      - Request #3 => page_2_markdown + page_3_image

    • sidmitra 46 days ago
      >frequency of character triples

      What are character triples? Are they trigrams?

      • hugodutka 46 days ago
        I think so. I'd normalize the text first: lowercase it and remove all non-alphanumeric characters. E.g for the phrase "What now?" I'd create these trigrams: wha, hat, atn, tno, now.
    • nbbaier 45 days ago
      > I extracted the embedded text from the PDF

      What did you use to extract the embedded text during this step? Other than some other OCR tech

  • ndr_ 45 days ago
    Prompts in the background:

      const systemPrompt = `
        Convert the following PDF page to markdown. 
        Return only the markdown with no explanation text. 
        Do not exclude any content from the page.
      `;
    
    For each subsequent page: messages.push({ role: "system", content: `Markdown must maintain consistent formatting with the following page: \n\n """${priorPage}"""`, });

    Could be handy for general-purpose frontend tools.

    • markous 45 days ago
      so this is just a wrapper around gpt-4o mini?
  • beklein 46 days ago
    Very interesting project, thank you for sharing.

    Are you supporting the Batch API from OpenAI? This would lower costs by 50%. Many OCR tasks are not time-sensitive, so this might be a very good tradeoff.

    • themanmaran 46 days ago
      That's definitely the plan. Using batch requests would definitely move this closer to $2/1000 pages mark. Which is effectively the AWS pricing.
  • surfingdino 46 days ago
    Xerox tried it a while ago. It didn't end well https://www.dkriesel.com/en/blog/2013/0802_xerox-workcentres...
    • merb 46 days ago
      > This is not an OCR problem (as we switched off OCR on purpose)
      • yjftsjthsd-h 46 days ago
        It also says

        > This is not an OCR problem, but of course, I can't have a look into the software itself, maybe OCR is still fiddling with the data even though we switched it off.

        But the point stands either way; LLMs are prone to hallucinations already, so I would not trust them to not make a mistake in OCR because they thought the page would probably say something different than it does.

        • mlyle 46 days ago
          > It also says...

          It was a problem with employing the JBIG2 compression codec, which cuts and pastes things from different parts of the page to save space.

          > But the point stands either way; LLMs are prone to hallucinations already, so I would not trust them to not make a mistake in OCR because they thought the page would probably say something different than it does.

          Anyone trying to solve for the contents of a page uses context clues. Even humans reading.

          You can OCR raw characters (performance is poor); use letter frequency information; use a dictionary; use word frequencies; or use even more context to know what content is more likely. More context is going to result in many fewer errors (of course, it may result in a bigger proportion of the remaining errors seeming to have significant meaning changes).

          A small LLM is just a good way to encode this kind of "how likely are these given alternatives" knowledge.

          • tensor 46 days ago
            Traditional OCR neural networks like tesseract crucially they have strong measures of their accuracy levels, including when they employ dictionaries or the like to help with accuracy. LLMs, on the other hand, give you zero guarantees, and have some pretty insane edge cases.

            With a traditional OCR architecture maybe you'll get a symbol or two wrong, but an LLM can give you entirely new words or numbers not in the document, or even omit sections of the document. I'd never use an LLM for OCR like this.

            • mlyle 44 days ago
              If you use LLM stupidly, sure. You can get from the LLM pseudo-probabilities of next symbol and use e.g Bayes rule to combine the information of how well it matches the page. You can also report the total uncertainty at the end.

              Done properly, this should strictly improve the results.

          • surfingdino 46 days ago
            It's all fun and games until you need to prove something in court or to the tax office. I don't think that throwing an LLM into this mix helps.
            • wmf 46 days ago
              Generally when OCRing documents you should keep the original scans so you can refer back to them in case of any questions or disputes.
        • qingcharles 45 days ago
          It depends what your use-case is. At a low enough cost this would work for a project I'm doing where I really just need to be able to mostly search large documents. 100% accuracy and a lost or hallucinated paragraph here and there wouldn't be a deal-killer, especially if the original page image is available to the user too.

          And additionally, this also might work if you are feeding the output into a bunch of humans to proof.

    • ctm92 45 days ago
      That was also what first came to my mind, I guess Zerox might be a reference to this
  • binalpatel 45 days ago
    You can do some really cool things now with these models, like ask them to extract not just the text but figures/graphs as nodes/edges and it works very well. Back when GPT-4 with vision came out I tried this with a simple prompt + dumping in a pydantic schema of what I wanted and it was spot on, pretty much this (before json mode was a supported):

        You are an expert in PDFs. You are helping a user extract text from a PDF.
    
        Extract the text from the image as a structured json output.
    
        Extract the data using the following schema:
    
        {Page.model_json_schema()}
    
        Example:
        {{
          "title": "Title",
          "page_number": 1,
          "sections": [
            ...
          ],
          "figures": [
            ...
          ]
        }}
    
    
    https://binal.pub/2023/12/structured-ocr-with-gpt-vision/
  • constantinum 45 days ago
    If you want to do document OCR/PDF text extraction with decent accuracy without using an LLM, do give LLMWhisperer[1] a try.

    Try with any PDF document in the playground - https://pg.llmwhisperer.unstract.com/

    [1] - https://unstract.com/llmwhisperer/

  • bearjaws 46 days ago
    I did this for images using Tesseract for OCR + Ollama for AI.

    Check it out, https://cluttr.ai

    Runs entirely in browser, using OPFS + WASM.

  • amluto 45 days ago
    My intuition is that the best solution here would be a division of labor: have the big multimodal model identify tables, paragraphs, etc, and output a mapping between segments of the document and texture output. Then a much simpler model that doesn’t try to hold entire conversations can process those segments into their contents.

    This will perform worse in cases where whatever understanding the large model has of the contents is needed to recognize indistinct symbols. But it will avoid cases where that very same understanding causes contents to be understood incorrectly due to the model’s assumptions of what the contents should be.

    At least in my limited experiments with Claude, it’s easy for models to lose track of where they’re looking on the page and to omit things entirely. But if segmentation of the page is explicit, one can enforce that all contents end up in exactly one segment.

  • aman2k4 45 days ago
    I am using AWS Textract + LLM (OpenAI/Claude) to read grocery receipts for <https://www.5outapp.com>

    So far, I have collected over 500 receipts from around 10 countries with 30 different supermarkets in 5 different languages.

    What has worked for me so far is having control over OCR and processing (for formatting/structuring) separately. I don't have the figures to provide a cost structure, but I'm looking for other solutions to improve both speed and accuracy. Also, I need to figure out a way to put a metric around accuracy. I will definitely give this a shot. Thanks a lot.

    • sleno 45 days ago
      Cool design. FYI the "Try now" card looks like it didn't render right, just seeing a blank box around the button.
      • aman2k4 45 days ago
        You meant in the web version? it is supposed to look like a blank box in the rectangle grocery bill shape, but i suppose the design can be a bit better there. Thanks for the feedback.
  • lootsauce 45 days ago
    In my own experiments I have had major failures where much of the text is fabricated by the LLM to the point where I just find it hard to trust even with great prompt engineering. What I have been very impressed with is it’s ability to take medium quality ocr from acrobat with poor formatting, lots of errors and punctuation problems and render 100% accurate and properly formatted output by simply asking it to correct the ocr output. This approach using traditional cheap ocr for grounding might be a really robust and cheap option.
  • refulgentis 46 days ago
    Fwiw have on good sourcing that OpenAI supplies Tesseract output to the LLM, so you're in a great place, best of all worlds
  • jimmyechan 45 days ago
    Congrats! Cool project! I’d been curious about whether GPT would be good for this task. Looks like this answers it!

    Why did you choose markdown? Did you try other output formats and see if you get better results?

    Also, I wonder how HMTL performs. It would be a way to handle tables with groupings/merged cells

    • themanmaran 45 days ago
      I think that I'll add an optional configuration for HTML vs Markdown. Which at the end of the day will just prompt the model differently.

      I've not seen a meaningful difference between either, except when it comes to tables. It seems like HTML tends to outperform markdown tables, especially when you have a lot of complexity (i.e. tables within tables, lots of subheaders).

  • josefritzishere 46 days ago
    Xerox might want to have a word with you about that name.
  • samuell 45 days ago
    The problem I've not found one OCR solution to handle well is complex column based layouts in magazines. Perhaps one problem is that there are often images spanning anything from one to all columns, and so the text might flow in sometimes funny ways. But in this day and age, this must be possible to handle for the best AI-based tools?
  • jagermo 45 days ago
    ohh, that could finally be a great way to get my ttrpg books readable for kindle. I'll give it a try, thanks for that.
  • 8organicbits 46 days ago
    > And 6 months from now it'll be fast, cheap, and probably more reliable!

    I like the optimism.

    I've needed to include human review when using previous generation OCR software; when I needed the results to be accurate. It's painstaking, but the OCR offered a speedup over fully-manual transcription. Have you given any thought to human-in-the-loop processes?

    • themanmaran 45 days ago
      I've been surprised so far by llms capability, so I hope it continues.

      On the human in loop side, it's really use case specific. For a lot of my company's work, it's focused on getting trends from large sets of documents.

      Ex: "categorize building permits by municipality". If the OCR was wrong on a few documents, it's still going to capture the general trend. If the use case was "pull bank account info from wire forms" I would want a lot more double checking. But that said, humans also have a tendency to transpose numbers incorrectly.

      • raisedbyninjas 45 days ago
        Our human in the loop process with traditional OCR uses confidence scores from regions of interest and the page coordinates to speed-up the review process. I wish the LLM could provide that, but both seem far off on the horizon.
      • 8organicbits 45 days ago
        Hmm, sounds like different goals. I don't work on that project any longer but it was a very small set of documents and they needed to be transcribed perfectly. Every typo in the original needed to be preserved.

        That said, there's huge value in lossy transcription elsewhere, as long as you can account for the errors they introduce.

    • throwthrowuknow 45 days ago
      Have you tried using the GraphRAG approach of just rerunning the same prompts multiple times and then giving the results along with a prompt to the model telling it to extract the true text and fix any mistakes? With mini this seems like a very workable solution. You could even incorporate one or more attempts from whatever OCR you were using previously.

      I think that is one of the key findings from GraphRAG paper: the gpt can replace the human in the loop.

  • downrightmike 46 days ago
    Does it also produce a confidence number?
    • ndr_ 45 days ago
      The only thing close are the "logprobs": https://cookbook.openai.com/examples/using_logprobs

      However, commenters around here noted that these have likely not been fine-tuned to correlate with accuracy - for plaintext LLM uses. Would be interested in hearing finding for MLLM use-cases!

    • tensor 46 days ago
      No, there is no vision LLM that produces confidence numbers to my knowledge.
    • wildzzz 46 days ago
      The AI says it's 100% confident that it's hallucinations are correct.
    • ravetcofx 46 days ago
      I don't think openAI's api for gpt4o-mini has any such mechanism.
  • Dkuku 46 days ago
    Check gpt-4o, gpt-4o-mini uses around 20 times more tokens for the same image: https://youtu.be/ZWxBHTgIRa0?si=yjPB1FArs2DS_Rc9&t=655
  • ipkstef 45 days ago
    I think i'm missing something.. why would i pay to ocr the images when i can do it locally for free? Tesseract runs pretty well on just cpu, wouldn't even need something crazy powerful.
    • daemonologist 45 days ago
      Tesseract works great for pure label-the-characters OCR, which is sufficient for books and other sources with straightforward layouts, but doesn't handle weird layouts (tables, columns, tables with columns in each cell, etc.) People will do absolutely depraved stuff with Word and PDF documents and you often need semantic understanding to decipher it.

      That said, sometimes no amount of understanding will improve the OCR output because a structure in a document cannot be converted to a one-dimensional string (short of using HTML/CSS or something). Maybe we'll get image -> HTML models eventually.

    • gregolo 45 days ago
      And OpenAI uses Tesseract in the background, as it sometimes answers that Hungarian language is not installed for Tesseract for me
      • s5ma6n 45 days ago
        I would be extremely surprised if that's the case. There are "open-source" multimodal LLMs can extract text from images as a proof that the idea works.

        Probably the model is hallucinating and adding "Hungarian language is not installed for Tesseract" to the response.

  • throwthrowuknow 45 days ago
    Have you compared the results to special purpose OCR free models that do image to text with layout? My intuition is mini should be just as good if not better.
  • ravetcofx 46 days ago
    I'd be more curious to see the performance over local models like LLaVa etc.
  • cmpaul 46 days ago
    Great example of how LLMs are eliminating/simplifying giant swathes of complex tech.

    I would love to use this in a project if it could also caption embedded images to produce something for RAG...

    • hpen 46 days ago
      Yay! Now we can use more RAM, Network, Energy, etc to do the same thing! I just love hot phones!
      • hpen 45 days ago
        Oops guess I'm not sippin' the koolaid huh?
  • jdthedisciple 45 days ago
    Very nice, seem to work pretty well!

    Just

        maintainFormat: true
    
    did not seem to have any effect in my testing.
  • fudged71 46 days ago
    Llama 3.1 now has images support right? Could this be adapted there as well, maybe with groq for speed?
    • themanmaran 46 days ago
      Yup! I want to evaluate a couple different model options over time. Which should be pretty simple!

      The main thing we're doing is converting documents to a series of images, and then aggregating the response. So we should be model agnostic pretty soon.

    • daemonologist 45 days ago
      Meta trained a vision encoder (page 54 of the Llama 3.1 paper) but has not released it as far as I can tell.
  • daft_pink 46 days ago
    I would really love something like this that could be run locally.