Text classification with Python 3.14's ZSTD module

(maxhalford.github.io)

122 points | by alexmolas 2 days ago

7 comments

pornel 22 minutes ago
The application of compressors for text statistics is fun, but it's a software equivalent of discovering that speakers and microphones are in principle the same device.
(KL divergence of letter frequencies is the same thing as ratio of lengths of their Huffman-compressed bitstreams, but you don't need to do all this bit-twiddling for real just to count the letters)
The article views compression entirely through Python's limitations.
> gzip and LZW don’t support incremental compression
This may be true in the Python's APIs, but is not true about these algorithms in general.
They absolutely support incremental compression even in APIs of popular lower-level libraries.
Snapshotting/rewinding of the state isn't exposed usually (custom gzip dictionary is close enough in practice, but a dedicated API would reuse its internal caches). Algorithmically it is possible, and quite frequently used by the compressors themselves: Zopfli tries lots of what-if scenarios in a loop. Good LZW compression requires rewinding to a smaller symbol size and restarting compression from there after you notice the dictionary stopped being helpful. The bitstream has a dedicated code for this, so this isn't just possible, but baked into the design.
ks2048 3 hours ago
This looks like a nice rundown of how to do this with Python's zstd module.
But, I'm skeptical of using compressors directly for ML/AI/etc. (yes, compression and intelligence are very closely related, but practical compressors and practical classifiers have different goals and different practical constraints).
Back in 2023, I wrote two blog-posts [0,1] that refused the results in the 2023 paper referenced here (bad implementation and bad data).
[0] https://kenschutte.com/gzip-knn-paper/
[1] https://kenschutte.com/gzip-knn-paper2/
[-]
- duskwuff 1 hour ago
  Concur. Zstandard is a good compressor, but it's not magical; comparing the compressed size of Zstd(A+B) to the common size of Zstd(A) + Zstd(B) is effectively just a complicated way of measuring how many words and phrases the two documents have in common. Which isn't entirely ineffective at judging whether they're about the same topic, but it's an unnecessarily complex and easily confused way of doing so.
- shoo 3 hours ago
  Good on you for attempting to reproduce the results & writing it up, and reporting the issue to the authors.
  > It turns out that the classification method used in their code looked at the test label as part of the decision method and thus led to an unfair comparison to the baseline results
throwaway81523 3 hours ago
Python's zlib does support incremental compression with the zdict parameter. gzip has something similar but you have to do some hacky thing to get at it since the regular Python API doesn't expose the entry point. I did manage to use it from Python a while back, but my memory is hazy about how I got to it. The entry point may have been exposed in the code module but undocumented in the Python manual.
chaps 3 hours ago
Ooh, totally. Many years ago I was doing some analysis of parking ticket data using gnuplot and had it output a chart png per-street. Not great, but worked well to get to the next step of that project of sorting the directory by file size. The most dynamic streets were the largest files by far.
Another way I've used image compression to identify cops that cover their body cameras while recording -- the filesize to length ratio reflects not much activity going on.
Scene_Cast2 4 hours ago
Sweet! I love clever information theory things like that.
It goes the other way too. Given that LLMs are just lossless compression machines, I do sometimes wonder how much better they are at compressing plain text compared to zstd or similar. Should be easy to calculate...
EDIT: lossless when they're used as the probability estimator and paired with something like an arithmetic coder.
[-]
- az09mugen 3 hours ago
  I do not agree on the "lossless" adjective. And even if it is lossless, for sure it is not deterministic.
  For example I would not want a zip of an encyclopedia that uncompresses to unverified, approximate and sometimes even wrong text. According to this site : https://www.wikiwand.com/en/articles/Size%20of%20Wikipedia a compressed Wikipedia without medias, just text is ~24GB. What's the medium size of an LLM, 10 GB ? 50 GB ? 100 GB ? Even if it's less, it's not an accurate and deterministic way to compress text.
  Yeah, pretty easy to calculate...
  [-]
  - notpushkin 1 hour ago
    With a temperature of zero, LLM output will always be the same. Then it becomes a matter of getting it to output the exact replica of the input: if we can do that, it will always produce it, and the fact it can also be used as a bullshit machine becomes irrelevant.
    With the usual interface it’s probably inefficient: giving just a prompt alone might not produce the output we need, or it might be larger than the thing we’re trying to compress. However, if we also steer the decisions along the way, we can probably give a small prompt that gets the LLM going, and tweak its decision process to get the tokens we want. We can then store those changes alongside the prompt. (This is a very hand-wavy concept, I know.)
    [-]
    - duskwuff 21 minutes ago
      There's an easier and more effective way of doing that - instead of trying to give the model an extrinsic prompt which makes it respond with your text, you use the text as input and, for each token, encode the rank of the actual token within the set of tokens that the model could have produced at that point. (Or an escape code for tokens which were completely unexpected.) If you're feeling really crafty, you can even use arithmetic coding based on the probabilities of each token, so that encoding high-probability tokens uses fewer bits.
      From what I understand, this is essentially how ts_zip (linked elsewhere) works.
  - nagaiaida 2 hours ago
    (to be clear this is not me arguing for any particular merits of llm-based compression, but) you appear to have conflated one particular nondeterministic llm-based compression scheme that you imagined with all possible such schemes, many of which would easily fit any reasonable definitions of lossless and deterministic by losslessly doing deterministic things using the probability distributions output by an llm at each step along the input sequence to be compressed.
- dchest 4 hours ago
  https://bellard.org/ts_zip/
- fwip 4 hours ago
  Aren't LLMs lossy? You could make them lossless by also encoding a diff of the predicted output vs the actual text.
  Edit to soften a claim I didn't mean to make.
  [-]
  - thomasmg 3 hours ago
    LLMs are good at predicting the next token. Basically you use them to predict what are the probabilities of the next tokens to be a, b, or c. And then use arithmetic coding to store which one matched. So the LLM is used during compression and decompression.
rob_c 1 hour ago
So you just discovered pca in some other form?
OutOfHere 3 hours ago
Can `copy.deepcopy` not help?