I recently bought a tablet for sheet music, mostly to replace a stack of jazz "Real Books" at jam sessions. And the phone camera scans I made are okay, but fixed in size and have a lot of artifacts. And it would be great to transpose on the fly for e.g. Bb or Eb instruments, but being a scan this is obviously not possible.
I got digging into the state of optical music recognition and came away concluding that music is basically a greenfield for AI wherever you look. Optical music recognition is pretty terrible. AI understanding of music theory is terrible (actually looking at music that is; LLMs do okay at text descriptions of theory concepts where you can imagine some online texts making it in).
I think the issue is that we still don't have great digital formats that encode the dots on paper that musicians read. Music notation is pretty rich. Midi doesn't capture all of what's needed for symbolic understanding, because it was mostly made for capturing aspects relevant for playback or performance. MusicXML seems to be the closest for a digital format that encodes the information a musician would want, but there aren't great corpora of training data that would connect a MusicXML representation to sheet music images or to audio. I think that's because MusicXML falls short of encoding enough information to engrave music. Tools like MuseScore need to track a bunch of layout information that isn't encodable in MusicXML. Lilypond format is less verbose that MusicXML and contains a bit more information that is useful to the score creators, but most people don't create sheet music in lilypond. (As an aside, Lilypond bums me out with the state of jazz fonts. I hate looking at "legit" scores in jazz context)
I realize this is mildly off topic, but every time I see people making incremental gains on OCR, which to my mind is pretty good, I am reminded of how abysmal OMR is.
I observe that space regularly and the only really good solution is soundslice. You scan and review some edge cases and get really good results. Paid service by a small company, very worthy to be supported!
The way I understand this works is that the researchers found a clever architectural hack to stop AI from hoarding memory when reading long documents.
Normally, when an AI transcribes a 100 page PDF, it tries to remember every single word it has already ingested. This short-term memory (the KV cache) grows linearly O(N) until the model runs out of VRAM and crashes (or caps it) To avoid this, developers are forced to build janky code that chops PDFs into individual pages, processes them one by one, and glues the text back together.
Unlimited OCR uses Reference Sliding Window Attention (R-SWA) to split the AI's focus into two paths:
Global Reference: The AI keeps full, uncompromised sight of the original document image so it never loses context.
Local Generation: The AI restricts its memory of its own typed text to a tight, moving window (like the last 128 words) and safely forgets the rest.
Will be very interesting for local AI and can’t wait to see what the community builds and extends with it!
It's the opposite of shade, unless GP is being sarcastic. "Class act" is normally a compliment, and in the context here it sounds to me like they're congratulating Baidu/the researchers in being transparent about where their ideas came from.
Pretty decent might be quiet the stretch. I'd term it almost acceptable, but only if you're using commercial solutions like amazon's textract, doing it with open source tools is at best, extremely painful and vaguely accurate.
Exactly my experience. If you try to OCR hand-filled forms with a fixed structure, traditional OCR models are great. Vision-llms can improve a bit on character recognition, but at the cost of harder to detect failure modes.
But if you are trying to ingest diverse documents with headings, multi-column layouts, headers and footers, ad space in the middle of your text, etc, vision-llms are a giant step forward. But you need the context of the previous page to make good decisions about the current page, which is where things quickly get janky (or slow, if you choose the naive approach)
Vision-llms also seem to deal much better with variance in scripts. Cursive, random Japanese in the middle of the text, weird math symbols, handwriting from three centuries ago, all "just works" without you even having to remember that this can happen
I haven't done much long-run OCR, so unsure of the current state, but it would seem they overcome this (from their paper):
"A widely held view is that employing a large language model (LLM) as the decoder allows the model to leverage the prior distribution of language, leading to improved OCR performance. However, the downside is equally evident: as the output sequence lengthens, the accumulated KV cache drives up memory consumption and progressively slows down generation."
> I would definitely understand post processing, like extracting data, answering question .. etc, but why re-doing the OCR engine itself?
Well... the idea seems to be (as far as I understand it, at least) that optical errors and artifacts can now be compensated as the OCR engine is now context-aware.
Say, for example, some random long ass name chemical. It's not going to be in a word correction database, but a context-aware engine (ideally, one that has been supplemented with chemistry data) can now correct "bad" reads of the chemical's name.
Of course, there remains the issue of how to prevent the infamous Xerox bug [1]...
I don't think that's a universal statement that aplies to every kind of documents and languages. Mistral OCR is able to do things no "traditional" OCR was ever able to.
I got digging into the state of optical music recognition and came away concluding that music is basically a greenfield for AI wherever you look. Optical music recognition is pretty terrible. AI understanding of music theory is terrible (actually looking at music that is; LLMs do okay at text descriptions of theory concepts where you can imagine some online texts making it in).
I think the issue is that we still don't have great digital formats that encode the dots on paper that musicians read. Music notation is pretty rich. Midi doesn't capture all of what's needed for symbolic understanding, because it was mostly made for capturing aspects relevant for playback or performance. MusicXML seems to be the closest for a digital format that encodes the information a musician would want, but there aren't great corpora of training data that would connect a MusicXML representation to sheet music images or to audio. I think that's because MusicXML falls short of encoding enough information to engrave music. Tools like MuseScore need to track a bunch of layout information that isn't encodable in MusicXML. Lilypond format is less verbose that MusicXML and contains a bit more information that is useful to the score creators, but most people don't create sheet music in lilypond. (As an aside, Lilypond bums me out with the state of jazz fonts. I hate looking at "legit" scores in jazz context)
I realize this is mildly off topic, but every time I see people making incremental gains on OCR, which to my mind is pretty good, I am reminded of how abysmal OMR is.
It may not be necessary…a lot of the training pairs/data for this could probably be procedurally created via code.
Would be pretty fun to work on and see it come to life.
The way I understand this works is that the researchers found a clever architectural hack to stop AI from hoarding memory when reading long documents.
Normally, when an AI transcribes a 100 page PDF, it tries to remember every single word it has already ingested. This short-term memory (the KV cache) grows linearly O(N) until the model runs out of VRAM and crashes (or caps it) To avoid this, developers are forced to build janky code that chops PDFs into individual pages, processes them one by one, and glues the text back together.
Unlimited OCR uses Reference Sliding Window Attention (R-SWA) to split the AI's focus into two paths:
Global Reference: The AI keeps full, uncompromised sight of the original document image so it never loses context.
Local Generation: The AI restricts its memory of its own typed text to a tight, moving window (like the last 128 words) and safely forgets the rest.
Will be very interesting for local AI and can’t wait to see what the community builds and extends with it!
Class Act.
A simple example is words that are supposed to be in other languages being automatically translated to English, which ruins the effect
I would definitely understand post processing, like extracting data, answering question .. etc, but why re-doing the OCR engine itself?
OCR still sucks in 2026. Hopefully this might improve the situation but I haven't tested it yet.
But if you are trying to ingest diverse documents with headings, multi-column layouts, headers and footers, ad space in the middle of your text, etc, vision-llms are a giant step forward. But you need the context of the previous page to make good decisions about the current page, which is where things quickly get janky (or slow, if you choose the naive approach)
Vision-llms also seem to deal much better with variance in scripts. Cursive, random Japanese in the middle of the text, weird math symbols, handwriting from three centuries ago, all "just works" without you even having to remember that this can happen
- marker (with --force-ocr) gives me the best results
- Mistral OCR (seems really great, but I never managed to get it work)
- Mathpix (tried a long time ago)
- docling (gives me garbage, I must use it wrong)
- Unlimited OCR (will try it)
- ???
- AWS Textract
"A widely held view is that employing a large language model (LLM) as the decoder allows the model to leverage the prior distribution of language, leading to improved OCR performance. However, the downside is equally evident: as the output sequence lengthens, the accumulated KV cache drives up memory consumption and progressively slows down generation."
Well... the idea seems to be (as far as I understand it, at least) that optical errors and artifacts can now be compensated as the OCR engine is now context-aware.
Say, for example, some random long ass name chemical. It's not going to be in a word correction database, but a context-aware engine (ideally, one that has been supplemented with chemistry data) can now correct "bad" reads of the chemical's name.
Of course, there remains the issue of how to prevent the infamous Xerox bug [1]...
[1] https://www.dkriesel.com/en/blog/2013/0802_xerox-workcentres...
CJK have lots of character and high confusion rate.
Arabic scripts are complex and have lots of morphs.
Vietnamese have easily confused diacritics.
Thai have lots of non-standard fonts.
In your opinion, what is SOTA here?