Ask HN: What are you using to parse PDFs for RAG?

Hi, I'm looking for a simple way to convert PDFs into markdown with integrated images and tables. Tried Llamaindex, but no integrated images. Tried Langchain, but some PDFs will have the footer being parsed before the top. Tried to use Adobe PDF API, but have to pay $25K upfront!

163 points | by carlbren 461 days ago

48 comments

whakim 457 days ago
We have been using different things for text, images, and tables. I think it's worth pointing out that PDFs are extremely messy under-the-hood so expecting perfect output is a fool's errand; transformers are extremely powerful and can often do surprisingly well even when you've accidentally mashed a set of footnotes into the middle of a paragraph or something.
For text, unstructured seems to work quite well and does a good job of quickly processing easy documents while falling back to OCR when required. It is also quite flexible with regards to chunking and categorization, which is important when you start thinking about your embedding step. OTOH it can definitely be computationally expensive to process long documents which require OCR.
For images, we've used PyMuPDF. The main weakness we've found is that it doesn't seem to have a good story for dealing with vector images - it seems to output its own proprietary vector type. If anyone knows how to get it to output SVG that'd obviously be amazing.
For tables, we've used Camelot. Tables are pretty hard though; most libraries are totally fine for simple tables, but there are a ton of wild tables in PDFs out there which are barely human-readable to begin with.
For tables and images specifically, I'd think about what exactly you want to do with the output. Are you trying to summarize these things (using something like GPT-4 Vision?) Are you trying to present them alongside your usual RAG output? This may inform your methodology.
[-]
- Tade0 457 days ago
  > I think it's worth pointing out that PDFs are extremely messy under-the-hood so expecting perfect output is a fool's errand
  This.
  A while ago someone asked me why their banking solution doesn't allow to paste payment amounts (among other things) and surely there must be a way to do it correctly.
  Not with PDF. What a person reads as a single number may be any grouping of entities which may or may not paste correctly.
  Some banks simply don't want to deal with this sort of headache.
- mkaszkowiak 457 days ago
  How do you combine the outputs? Wouldn't there be data duplication between unstructured text and tables?
  [-]
  - whakim 456 days ago
    We just skip several of unstructured's categories, such as tables and images. We also do some deduplication post-ANN as we want to optimize for novelty as well as relevance. That being said, how are you planning to embed an image or a table to make it searchable? It sounds simple in theory, but how do you generate an actually good image summary (without spending huge amounts of money filling OpenAI's coffers for negligible benefit)? How do you embed a table?
    [-]
    - mkaszkowiak 456 days ago
      Thanks for answering! In my case, I don't directly use RAG; but rather post-process documents via LLMs to extract a set of specific answers. That's also why I've asked about deduplication - asking LLM to provide an answer from 2 different data sources (invalid unstructured table text & valid structured table contents) quickly ramps up errors.
infecto 457 days ago
I am surprised nobody has mentioned it yet.
If this is for anything slightly commercial related you are probably going to have the best luck using Textract/Document Intelligence/Document AI. Nothing else listed in the comments is as accurate, especially when trying to extract forms, tables and text. Multi-modal will take care of your the images. The combination of those two will get you a great representation of the PDF.
Opensource tools work and can be extremely powerful but you 1) won't have images and 2) your workflows will break if you are not building it for a specific pdf template.
[-]
- cpa 457 days ago
  I completely agree. Like the previous comment mentioned, I've explored this area over the past year, and in my tests, the offerings from Amazon, Google, and Microsoft were far superior to the open-source options, especially for long documents. It's unfortunate, but that's the way it is.
  OCR itself isn't the issue; most open-source models handle that adequately. The problem lies in the lack of comprehensive features:
  - Identification of chapters and headings
  - Segmentation of headers and footers with an easy way to filter them out
  - Handling of images
  - Correctly processing two-column or other non-standard layouts
  - Avoiding out-of-memory (OOM) errors, which, while not a flaw of the open-source software itself, is a common and frustrating issue
  - Transcription of tables and forms, which exists in open-source models but isn't as effective
  These ergonomic features are where the open-source solutions fall short.
  [-]
  - constantinum 456 days ago
    Perfectly put!
    Other challenges are:
    1. Complex layout tables, tables that span multiple pages
    2. Handwritten text - in loan processing and income tax documents
    3. Checkboxes and radio buttons are so important in insurance and loan processing to automate workflows.
    4. Scanned images
    5. Photographed documents from the field.
    6. Orientation - landscape mode vs. Portrait mode
    7. Text represented as a Bezier curve
    8. Non-aligned texts in multicolumn text layout
    9. Background images and watermarks
    Other important considerations:
    1. Privacy and security - cloud vs. On-premise
    2. Performance and speed of extraction at scale
    3. If you are ultimately feeding to LLMs to intelligence then how does the extractor help in reducing tokens
    Anyone curious about why parsing PDF is hell for RAG you can refer this - https://unstract.com/blog/pdf-hell-and-practical-rag-applica...
    [edit] - formatting
  - vikp 456 days ago
    Hi, I'm the author of marker - https://github.com/VikParuchuri/marker - from my testing, marker handles almost all the issues you mentioned. The biggest issue (that I'm working on fixing right now) is formatting tables properly.
- equilibrium 457 days ago
  Having explored this topic over the passed month this is the correct answer. And it has been mentioned in the comments by jumploops
- CharlieDigital 457 days ago
  Azure Document Intelligence with the Document Layout Model is pretty damn amazing at this.
  Key thing is it labels titles, headers, sections, etc. This way you can stuff headers into the child chunks for much better RAG.
reerdna 457 days ago
For use in retrieval/RAG, an emerging paradigm is to not parse the PDF at all.
By using a multi-modal foundation model, you convert visual representations ("screenshots") of the pdf directly into searchable vector representations.
Paper: Efficient Document Retrieval with Vision Language Models - https://arxiv.org/abs/2407.01449
Vespa.ai blog post https://blog.vespa.ai/retrieval-with-vision-language-models-... (my day job)
[-]
- attilakun 457 days ago
  I do something similar in my file-renamer app (sort.photos if you want to check it out):
  1. Render first 2 pages of PDF into a JPEG offline in the Mac app.
  2. Upload JPEG to ChatGPT Vision and ask what would be a good file name for this.
  It works surprisingly well.
- qeternity 457 days ago
  I'm sure this will change over time, but I have yet to see an LMM that performs (on average) as well as decent text extraction pipelines.
  Text embeddings for text also have much better recall in my tests.
- infecto 457 days ago
  No multi-modal model is ready for that in reality. The accuracy from other tools to extract tables and text are far superior.
- authorfly 456 days ago
  You have detractors, but this is the future.
- cpursley 456 days ago
  Is anyone actually having success with this approach? If so, how and with what models (and prompts)?
  [-]
  - distracted_boy 456 days ago
    Claude.ai handles tables very well, at least in my tests. It could easily convert a table from a financial document into a markdown table, among other things.
jumploops 457 days ago
In my experience Azure’s Form Recognizer (now called “Document Intelligence”) is the best (cheapest/most accurate) PDF parser for tabular data.
If I were working on this problem in 2024, I’d use Azure to pre-process all docs into something machine parsable, and then use an LLM to transform/structure the processed content into my specific use-case.
For RAG, I’d treat the problem like traditional search (multiple indices, preprocess content, scoring, etc.).
Make the easy things easy, and the hard things possible.
[-]
- mkaszkowiak 457 days ago
  Did you encounter hidden costs when using Azure Document Intelligence? I processed some PDFs using the paid tier, but the resulting costs were way higher than expected, despite using a prebuilt layout model for only structured extraction. Have no clue what could cause it, no extra details on the billing page. Not sure if the price is misleading, or if it's a skill issue on my part :)
  [-]
  - jumploops 456 days ago
    We did not, I remember costs matching our expectations.
    With that said, I have only used the previous tool (Form Recognizer) in production. Not sure if the new rebrand/product suite has more opaque costs.
serjester 456 days ago
Open Source Full Featured: https://github.com/Filimoa/open-parse/ [mine]
https://docs.llamaindex.ai/en/stable/api_reference/node_pars... [text splitters lose page metadata]
https://github.com/VikParuchuri/marker [strictly to markdown]
Layout Parsers: These are collections of ML models to parse the core "elements" from a page (heading, paragraph, etc). You'll still need to work on combining these elements into queryable nodes. https://github.com/Layout-Parser/layout-parser
https://github.com/opendatalab/PDF-Extract-Kit
https://github.com/PaddlePaddle/PaddleOCR
Commercial: https://reducto.ai/ [great, expensive]
https://docs.llamaindex.ai/en/stable/llama_cloud/llama_parse... [cheapest, but buggy]
https://cloud.google.com/document-ai https://aws.amazon.com/textract/
esquivalience 457 days ago
PyMuPDF seems to be intended for this use-case and mentions images:
https://medium.com/@pymupdf/rag-llm-and-pdf-conversion-to-ma...
(Though the article linked above has the feeling, to me, of being at least partly AI-written, which does cause me to pause)
> Update: We have now published a new package, PyMuPDF4LLM, to easily convert the pages of a PDF to text in Markdown format. Install via pip with `pip install pymupdf4llm`. https://pymupdf4llm.readthedocs.io/en/latest/
[-]
- barrenko 457 days ago
  So it works for complicated tables et al.?
  [-]
  - postepowanieadm 457 days ago
    You may have to write a custom tracer.
Teleoflexuous 457 days ago
My use case is research papers. That means very clear text, combined with graphs of varying form and quality and finally occasional formulas.
Two approaches I had most, but not full, success with are: 1) converting to image with pdf2image, then reading with pytesseract 2) throwing whole pdfs into pypdf 3) experimental multimodal models
You can get more if you make content more predictable (if you know this part is going to be pure text just put it in pypdf, if you know this is going to be a math formula explain the field to the model and have it read it back for high accessibility needs audience) the better it will go, but it continues to be a nightmare and a bottleneck.
[-]
- freethejazz 446 days ago
  Depending on how much structure you want to extract before passing the pdf contents to the next step in your pipeline, this paper[1] might be helpful in surfacing more options. It's a review/benchmark of numerous tools applied to the information extraction of academic documents. I haven't been through to evaluate the solutions they examined, but it's how I discovered GROBID and IMO lays out the strengths of each approach clearly.
  [1] https://arxiv.org/pdf/2303.09957
- authorfly 456 days ago
  I have great news I wish someone delivered to me when I was in your shoes - try "GROBID". It parses papers into objects with abstract/body/figures! It will help you out a great deal. It is designed for papers and can extract the text almost flawlessly, but also give information on graphs for separate processing. I have several years experience with academic text processing (including presentations) working with an Academic Publisher if I could be helpful to anything?
  [-]
  - Teleoflexuous 456 days ago
    I have no idea how did I miss them last time I was looking around, unless they grew significantly over last half a year or so. I'll check it out when I get back to this project, thanks.
    I wish I was hiring, if that's what you're asking ;) Otherwise, if you have any ideas for processing formulas (even just for reading them out, but any extra steps towards expressing what they mean - ' 'sum divided by count' is 'mean'/'average' value ' being the most simple example I can think of) I'd love to hear them. Novel ideas in technical papers are often expressed with formulas which aren't that complicated conceptually, but are critical to understanding the whole paper and that was another piece I was having very mixed results with.
    [-]
    - authorfly 456 days ago
      No worries. Sure, as to Formulas... I suspect many of them are LaTeX. If it is possible to parse that, it could help? At sufficient picture quality, vision models can accurately parse images of formulas to photos.
      Neither will probably help you with a "readable" formula system because in my experience the readers that do this for LaTeX or normal formula text have flaws any way (it's also slightly cultural and dependent on field of study). Maybe the best bet is a prompt to a vision model with "read this formula out loud in a digestible, understandable concise way".. though this may have issues with the recall accuracy.
- siamese_puff 456 days ago
  Check out appjsonify for research papers
kkfx 457 days ago
You can start to look at pdftotext -layout and pandoc maybe.
Personally I hope a day publishers start learning about the value of data and their representations so they decide to embed them like a *sv file attached to a pdf where the tabular data are immediately available, a .gp and alike file for graphs etc. Essentially the concept of embedding pdf "sources" as an attachment. In LaTeX it's easy to attach the LaTeX itself to the final pdf but so far no one seems to be interested to do so as an habit.
[-]
- arminiusreturns 457 days ago
  Came here to say pandoc. OP, I've setup stacks for taking physical books to editable pdf, ebook, and asciidoc. You can do it to markdown too. Add in tesseract for ocr to catch misses and you are good! (be careful on foreign language sections)
longnguyen 457 days ago
My apps are native Mac apps [0] [1] so naturally I use the native SDK for that.
Apple provides PDFKit framework to work with PDFs and it works really well.
For scanned documents, I use the Vision framework to OCR the content.
Some additional content cleaning is still required but overall I don’t need any other third-party libraries.
[0]: https://boltai.com
[1]: https://pdfpals.com
[-]
- Terretta 456 days ago
  Let's apply your app to your home page, easy as 1, 2, 3!
  1. Reader mode, copy all text.
  2. Prompt ChatGPT 4o:
  Which headline or sentence structures in the following web page copy suggest a non-native English speaker, and how should each incorrect phrase be fixed for most native-sounding American English? Omit testimonials written by others, just focus on the marketing copy and faq, include reasons for revision.
  3. Response:
  Here are refined sentences focused on the marketing copy and FAQs:
  Original: "From blog outlines, to highly technical content."
  - Punctuation: "From blog outlines to highly technical content."
  Original: "Natively intergrated with your favorite apps."
  - Spelling: "Natively integrated with your favorite apps."
  Original: "All costs are estimated, please refer to your OpenAI dashboard for the most accurate cost of your API key."
  - Structure: "All costs are estimates. Please refer to your OpenAI dashboard for the most accurate pricing for your API key."
  Original: "How does license work?"
  - Non-native indicator: "How does the license work?"
  Original: "The ChatGPT Plus subscription is separate and managed by OpenAI, it does not provide an API key that you can use with BoltAI."
  - Punctuation: "The ChatGPT Plus subscription is separate and managed by OpenAI; it does not provide an API key for BoltAI."
  Original: "Do you offer team plan license?"
  - Non-native indicator: "Do you offer a team plan license?"
  Original: "Absolutely. If for any reason you're not satisfied with your purchase, you can request a refund within 30 days of purchase."
  - Fluency: "Absolutely. If you're not satisfied with your purchase for any reason, you can request a refund within 30 days."
  These edits aim to make the text sound more natural and fluent in American English, improving clarity and coherence.
  [-]
  - longnguyen 456 days ago
    Thank you. I will improve my landing page following your suggestion. You're right, I'm not a native English speaker
marcoperuano 457 days ago
Haven’t tried converting it to markdown specifically, but if you want to try a different approach, google’s DocAI has been pretty great. It provides you with the general structure of the document as blocks (paragraph and headers) with coordinates. This makes it so you can send that data to an LLM during the RAG process and get citations of where the answers were found, down to the line of text.
martincollignon 457 days ago
Have you tried https://github.com/VikParuchuri/marker ?
[-]
- mkaszkowiak 457 days ago
  For my use case, overall Marker seems to work pretty well - but it has issues with tables. Merged cells, misplaced headers, and so forth. I'm currently extracting Polish PDFs that are //not// scanned
  When compared to Azure Document Intelligence, Marker is really cheap when self-hosted (assuming you fall under the license requirements), but it does not produce high quality data. YMMV.
  [-]
  - vikp 456 days ago
    Working on improving tables soon (I'm the author of marker)
    [-]
    - mkaszkowiak 456 days ago
      Glad to hear that :) Thanks for developing Marker!
      [-]
      - chandrai 442 days ago
        2nd that. Marker work pretty well as async internal service for us! Thanks!
  - cpursley 456 days ago
    Yeah, the header stuff (and empty cells) for tables needs some work.
- cpursley 456 days ago
  Maker worked pretty well for me in my limited testing. They also have a hosted solution:
  https://www.datalab.to
screature2 457 days ago
Maybe Nougat? The examples look pretty impressive: https://facebookresearch.github.io/nougat/ https://github.com/facebookresearch/nougat
Though the model weight licenses are CC by NC
zbyforgotp 457 days ago
Tables are a hard case for RAG, even if you parse them perfectly into Markdown, the LLMs still tend to struggle with interpreting them.
[-]
- constantinum 456 days ago
  Indeed! Accuracy is only a part of the problem. One way to crack this is to maintain the layout in the extraction. Layout preservation means more context and better LLM interpretation. A write-up is here if you are curious https://unstract.com/blog/extract-table-from-pdf/
gvv 457 days ago
I've had most success using PDFMinerLoader
(https://api.python.langchain.com/en/latest/document_loaders/...)
It deals pretty well with PDF containing a lot of images.
yawnxyz 457 days ago
for web PDFs I'm using https://jina.ai/reader/ — completely free. Does most of the job fine.
Code: https://github.com/jina-ai/reader
[-]
- ipsum2 457 days ago
  Not open source, the code just calls their proprietary API.
nicoboo 457 days ago
I've experimented with GCP's Stack using Agent Builder and relying on Gemini Pro 1.5.
I also experimented with pretty large of various files (around 6000 video games full notices) where I used OCR parsing in a similar configuration with mixed results due to the visual complexity of the original content.
gautiert 457 days ago
Hi! Show HN: Zerox – Document OCR with GPT-mini | https://news.ycombinator.com/item?id=41048194
This lib converts pdf page by page to image and feed it to gpt-4o-mini The results are pretty good!
dgelks 457 days ago
Previously have used https://github.com/pdf2htmlEX/pdf2htmlEX to convert PDF to HTML at scale, could potentially try and parse the output html to markdown as second stage.
[-]
- cpursley 456 days ago
  I looked into this but the html this thing outputs is a noisy mess.
siquick 457 days ago
Llamaparse by LlamaIndex is probably SOTA at the moment and seems to have no problems with tables. Pricing is good at the moment too.
https://www.llamaindex.ai/enterprise
shauntrennery 457 days ago
https://github.com/Azure/AI-in-a-Box/tree/main/ai-services/d...
Angostura 456 days ago
Fascinating discussion- but ‘RAG’? Sorry probably obvious but can someone clue me in
[-]
- ikesau 456 days ago
  Retrieval-augmented generation (RAG) is a technique for enhancing the accuracy and reliability of generative AI models with facts fetched from external sources
  [-]
  - Angostura 456 days ago
    Thank you. Much obliged
pookee 457 days ago
We're currently implementing this with https://mathpix.com/, it is not free but really not that expensive. It looks very promising
cm2187 457 days ago
I had some success using pdfpig, by ugly toad.
https://uglytoad.github.io/PdfPig/
Plus you get to rise the eyebrows of your colleagues.
mschwarz 457 days ago
Did you try llamaparse from Llamaindex? It’s a cloud service with a free tier. Recently switched to it from unstructured.io and it works great with the kinds of images and table graphics I feed it.
bartread 457 days ago
I need to get some data out of a table in a regularly published PDF file.
The thing is the table looks like a table when the PDF is rendered, but there's nothing within the PDF itself to semantically mark it out as a table: it's just a bunch of text and graphical elements placed on the page in an arrangement that makes them look like a table to a human being reading the document.
What I've ended up doing, after much experimentation[0], is use poppler to convert the PDF to HTML, then find the start and end of the table by matching on text that always appears at header and footer. Fortunately the row values appear in order in the markup so I can then look at the x coordinates of the elements to figure out which column they belong to or, rather, when a new row starts.
What I actually do due to #reasons is spit out the rows into a text file and then use Lark to parse each row.
Bottom line: it works well for my use case but I'd obviously recommend you avoid any situation where your API is a PDF document if at all possible.
EDIT: A little bit more detail might be helpful.
You could use poppler to convert to HTML, then from there implement a pipeline to convert the HTML to markdown. Just bear in mind that the HTML you get out of poppler is far removed from anything semantic, or at least it has been with the PDFs I'm working with: e.g., lots of <span> elements with position information and containing text, but not much to indicate the meaning. Still, you may find that if you implement a pipeline, where each stage solves one part of the problem of transforming to markdown, you can get something usable.
Poppler will spit out the images for you but, for reasons I've already outlined, tables are likely to be painful to deal with.
I notice some commenters suggesting LLM based solutions or services. I'd be hesitant about that. You might find an LLM helpful if there is a high degree of variability within the structural elements of the documents you're working with, or for performing specific tasks (like recognising and extracting markup for a table containing particular information of interest), but I've enough practical experience with LLMs not to be a maximalist, so I don't think a solely LLM-based approach or service will provide a total solution.
[0] Python is well served with libraries that either emit or parse PDFs but working with PDF object streams is no joke, and it turned out to be more complex and messier - for my use case - than simply converting the PDF to an easier to work with format and extracting the data that way.
constantinum 460 days ago
I would recommend give LLMWhisperer a try with the documents pertaining to your use case.
https://unstract.com/llmwhisperer/
Try demo in playground: https://pg.llmwhisperer.unstract.com/
Quick tutorial: https://unstract.com/blog/extract-table-from-pdf/
[-]
- ipsum2 457 days ago
  not open source, and OP seems to be the owner.
  [-]
  - lumos_maxima93 456 days ago
    it is open-source, the main platform is - Unstract https://github.com/Zipstack/unstract
    [-]
    - ipsum2 456 days ago
      Nope, LLMWhisperer to parse PDFs is called through an paid API.
      [-]
      - constantinum 456 days ago
        I'm not sure why the comment is downvoted! Let me see; the OP did not specifically try/ask for open-source solutions; at least, that is what I read.
        Let me break it down!
        As one of the commenters mentioned, he/she uses four different tools to parse PDFs to handle common parsing cases — tables, tables with images, OCR, layouts, handwriting, etc.
        With LLMwhisperer, you don't need that.
        Parsing is just a part of the problem. Engineers still need to figure out what LLM models work/are sufficient, reduce costs(tokens) and performance(parsing a million pages), and make the AI stack production-ready.
        LLMWhisperer at least handles most use cases and moves out of your way fast.
        Also, LLMwhisperer is not open-source; it's API is charged based on pages parsed.
simianparrot 457 days ago
I convert the PDF to images and then parse the images with tesseract OCR. That’s been the most consistent approach to run locally.
[-]
- dotancohen 456 days ago
  What resolution do you use? Any other tips?
  Thanks.
zoeyzhang 457 days ago
Check out this - https://hellorag.ai/
jwilk 457 days ago
What's RAG?
[-]
- wkat4242 457 days ago
  Retrieval Augmented Generation. It's a way to automatically provide an LLM model with the correct context based on the user's question, providing information directly into the model context, used for information that's not part of the model training.
- gforce_de 457 days ago
  from https://en.wikipedia.org/wiki/Retrieval-augmented_generation
  "Retrieval augmented generation (RAG) is a type of information retrieval process. It modifies interactions with a large language model (LLM) so that it responds to queries with reference to a specified set of documents, using it in preference to information drawn from its own vast, static training data."
- noufalibrahim 457 days ago
  It's a way of giving LLMs extra information (usually fast changing) that they were no trained with so that you can have them return relevant information. Think of asking an LLM "Who is Paul Graham" (assuming PG is relatively unknown) and it would say it doesn't know. But if you search your own knowledge base and then augment the prompt to something like "Paul Graham is a well known Venture Capitalist. Who is Paul Graham?", it can give you that information back. The idea of adding the extra information is the augmenting and you do that by retrieving relevant information from a knowledge base before you involve the LLM.
BerislavLopac 457 days ago
Have you tried Pandoc [0]?
[0] https://pandoc.org/
gsemyong 457 days ago
Checkout this https://parsedog.io
teapowered 457 days ago
Apache Tika Server is very easy to set up - it can be configured to use tesseract for OCR.
[-]
- mgkimsal 457 days ago
  Came here to mention Tika. I just set up a small POC with the 'full' tika docker container - default OCR bundled (with... 5 languages? English, Spanish, etc).
  I parsed a PDF and when looking at the output, I noticed 'united stotes of america' was in the text. Didn't make any sense... Digging further, I saw that it had also parsed the images in the PDF, and one of them was some govt logo with bad artifacting. It did indeed read more like 'stotes' than 'states'.
  Edit: That said, the OP asked about tables. I haven't tested any table stuff with tika (not something I need right now). Is the tika table support any good? Does it even exist? Seems like it might not really matter for many tika use cases (but I might be missing something obvious!)
paulluuk 457 days ago
As others have mentioned, if you have text-only PDFs then pypdf is free, fast and simple.
chewz 457 days ago
pyMuPdf and pyMuPdf4llm
https://github.com/pymupdf/PyMuPDF
https://github.com/pymupdf/RAG
postepowanieadm 457 days ago
mupdf's mutool gives access to most data of all solutions I have checked.
wcallahan 456 days ago
Jina.ai’s API is one of the best parsers I’ve seen. And better priced.
Ey7NFZ3P0nzAe 456 days ago
For my RAG projet [WDoc](https://github.com/thiswillbeyourgithub/WDoc/tree/dev) I use multiple pdf parser then use heuristics the keep the best one. The code is at https://github.com/thiswillbeyourgithub/WDoc/blob/654c05c5b2...
And the heurstics are partly based on using fasttext to detecr languages : https://github.com/thiswillbeyourgithub/WDoc/blob/654c05c5b2...
It's probably crap for tables but I don't want to rely on external parsers.
exe34 457 days ago
nowadays you might have some luck feeding PNG images to multimodal LLMs.
rayxi271828 457 days ago
Have you tried unstructured.io? So far seems promising.
brudgers 460 days ago
have to pay $25K upfront
That's a lot of your money.
It's not a big dose of OPM (Other People's Money).
When building a business, adequate capitalization solve a lot technical problems and it is no different when a business is built. If you aren't building a business, money is different and there's nothing wrong with not building a business. Good luck.
[-]
- muzani 458 days ago
  Idk man, it's a massive chunk of Other People's Money too. Prices like this is why Microsoft Teams has dominance over Slack.
  There's really two use cases:
  1. If you don't use the budget, you have less budget. I see this happen a lot in construction too, where each project has a set budget and it goes down or gets stolen in time. They'd rather pay $25k for the lifetime of a 2 year project than pay $500/month. (the other fear is that these startups shut down in a year)
  2. Tax exemptions, or some sort of money laundering where there's more value to pay a big name lots of money.
- manquer 457 days ago
  For me it wouldn't be about the 25k upfront,it is just working with Adobe is incredibly painful, same for Oracle. I really really don't want to work with them in any capacity if i can avoid.
  If say MS charged the same or even double that I would still work with them, they at-least they don't treat their customers as criminals.
- sshine 457 days ago
  There's a lot of valid business cases that rely on parsing PDFs but does not warrant spending $25k on it.
  I've helped prototype something using PyMuPDF. It worked as well as it had to, and it didn't cost $25k.
- Kiro 457 days ago
  Strange comment and bad advice in general. This is a problem that doesn't need $25k to be solved. You're basically telling them to give up or raise money for no reason at all.
yawnxyz 457 days ago
[dead]
nelson234 457 days ago
[dead]
oneunion12 457 days ago
[dead]
oneunion12 457 days ago
[flagged]
alexliu518 460 days ago
I understand your idea, but if you are sure that a software is excellent, then paying for it is a good habit.
[-]
- jmnicolas 457 days ago
  25k isn't exactly pocket change!
arthurcolle 457 days ago
We have an excellent solution for this at Brainchain AI that we call Carnivore.
We're only a few days away from deploying an SDK for this exactly use case among some others.
If you'd like to speak with our team, please contact us! We would love to help you get through your PDF and other file type parsing issues with our solution. Feel free to ping us at data [at] brainchain.ai