Compression Dictionary Transport

(datatracker.ietf.org)

70 points | by tosh 121 days ago

8 comments

  • freeqaz 121 days ago
    This is the link you want if you want to read the actual meaty contents of this page. Took me a sec to find. https://datatracker.ietf.org/doc/html/draft-ietf-httpbis-com...
  • bcoates 121 days ago
    Any evidence this actually causes material performance improvement?

    Pre-shared compression dictionaries are rarely seen in the wild because they rarely provide meaningful benefit, particularly in the cases you care about most.

    • csswizardry 121 days ago
    • vitus 121 days ago
      The one example I can think of with a pre-seeded dictionary (for web, no less) is Brotli.

      https://datatracker.ietf.org/doc/html/rfc7932#appendix-A

      You can more or less see what it looks like (per an older commit): https://github.com/google/brotli/blob/5692e422da6af1e991f918...

      Certainly it performs better than gzip by itself.

      Some historical discussion: https://news.ycombinator.com/item?id=19678985

    • patrickmeenan 121 days ago
      Absolutely, for the use cases where it makes sense. There are some examples here: https://github.com/WICG/compression-dictionary-transport/blo...

      In the web case, it mostly only makes sense if users are using a site for more than one page (or over a long time).

      Some of the common places where it can have a huge impact:

      - Delta-updating wasm or JS/CSS code between releases. Like the youtube player JavaScript or Adobe Web's WASM code. Instead of downloading the whole thing again, the version in the user's cache can be used as a "dictionary" and just the deltas for the update can be delivered. Typically this is 90-99% smaller than using Brotli with no dictionary.

      - Lazy-loading a site-specific dictionary for the HTML content. Pages after the first one can use the dictionary and just load the page-specific content (compresses away the headers, template, common phrases, logos, inline svg's or data URI's, etc). This usually makes the HTML 60-90% smaller depending on how much unique content is in the HTML (there is a LOT of site boilerplate).

      - JSON API's can load a dictionary that has the keys and common values and basically yield a binary format on the wire for JSON data, compressing out all of the verbosity.

      I expect we're still just scratching the surface of how they will be used but the results are pretty stunning if you have a site with regular user engagement.

      FWIW, they are not "pre-shared" so it doesn't help the first visit to a site. The can use existing requests for delta updates or the dictionaries can be loaded on demand but it is up to the site to load them (and create them).

      It will probably fall over if it gets hit too hard, but there is some tooling here that can generate dictionaries for you (using the brotli dictionary generator) and let you test the effectiveness: https://use-as-dictionary.com/

      • jiggawatts 121 days ago
        > if you have a site with regular user engagement.

        Ah, gotcha: this is a new Google standard that helps Google sites when browsed using Google Chrome.

        Everyone else will discover that keeping the previous version of a library around at build time doesn’t fit into the typical build process and won’t bother with this feature.

        Only Facebook and maybe half a dozen similar other orgs will enable this and benefit.

        The Internet top-to-bottom is owned by the G in FAANG with FAAN just along for the ride.

        • patrickmeenan 120 days ago
          Even a simple case of loading more than 2-3 pages from a given site over a few weeks could benefit from a dictionary for the HTML content (compressing out all of the common template code).

          Just about any e-commerce site will involve a few pages to go through search, product pages and checkout.

          Most news sites will likely see at least 2-3 pages loaded by a given user over the span of a few months. Heck, even this site could shave 20-50% for each of the pages a user visits by compressing out the common HTML: https://use-as-dictionary.com/generate/result.php?id=d639194...

          Any place you'd justify building a SPA, by definition, could also be a multi-page app with dictionaries.

          If you have a site that visitors always bounce immediately and only visit one page in their lifetime then it can't help, but those tend to be a lot less common.

          It doesn't have to be a huge site to benefit, it just needs users that regularly visit to show large gains.

          As CDNs implement support, the effort involved in supporting it will also drop, likely to the point where it can just be a checkbox to generate and use dictionary-based compression.

        • uf00lme 120 days ago
          Cloudflare and similarly placed CDNs will likely make it enabling it for sites a checkbox like option, e.g., https://blog.cloudflare.com/this-is-brotli-from-origin/ That's where it will have the most savings globally.
    • fnordpiglet 121 days ago
      If you care about latencies or are on very low bandwidth or noisy connections preshared dictionaries matter a lot. By and large they generally always help but come with complexity so are often avoided in favor of a simpler approach whose compression is acceptable rather than optimal. But if there’s a clear and well implemented standard that’s widely adopted I would always choose preshared dictionaries. Likewise secure connections are hard and without TLS and other standards most people wouldn’t try unless they really needed it. But with a good standard and broad implementation it’s basically ubiquitous.
    • therein 121 days ago
      Yup, we did experiment with SDCH at a large social network around 2015 and it didn't massively outperform gzip. Outperforming it required creating a pipeline for dynamic dictionary generation and distribution.
    • CaptainOfCoit 121 days ago
      I guess you should outline the cases you care the most about, for anyone to be able to answer if there is any material performance improvements.
      • hansvm 121 days ago
        The only requirements are serving a lot of the same "kind" of content, enough such (or for some other business reason) that it doesn't make sense to send it all at once, and somebody willing to spend a bit of time to implement it. Map and local geocoding information comes to mind as a decent application, and if the proposed implementation weren't so hyper-focused on common compression standards you could probably do something fancy for, e.g., meme sites (with a finite number of common or similar background images).

        Assuming (perhaps pessimistically) it won't work well for most sites serving images, other possibilities include:

        - Real-estate searches (where people commonly tweak tons of parameters and geographies, returning frequently duplicated information like "This historical unit has an amazing view in each of its 2 bedrooms").

        - Expanding that a bit, any product search where you expect a person to, over time, frequently query for the same sorts of stuff.

        - Local business information (how common is it to open at 10am and close at 6pm every weekday in some specific locale for example).

        ...

        If the JS, image, and structural HTML payloads dwarf everything else then maybe it won't matter in practice. I bet somebody could make good use of it though.

    • itsthecourier 121 days ago
      Bro, this stuff will be wild for our iot stuff
  • dexterdog 121 days ago
    Why can't we implement caching on the integrity sha of the library so they can be shared across sites? Sure there is technically an attack vector there but that is pretty easily scannable for tampering by verifying with trusted CDNs.

    With something like that you could preload all of the most common libraries in the background.

    • demurgos 121 days ago
      The problem are timing attacks leaking the navigation history. An attacker can load a lib shared with some other site. Based on the load time he may then learn if the user visited the target site recently.
      • drdaeman 121 days ago
        Can't we record the original download timings and replicate them artificially, with some noise added for privacy? Of course, with reasonable upper latency threshold (derived from overall network performance metrics) to prevent DoS on clients.

        While this won't improve load times, it can save bandwidth, improving user experience on slow connections. And most popular hand-selected library versions can get preloaded by the browser itself and listed as exceptions with zero added latency.

        Also, this way the networking telemetry would gain a meaningful purpose for end users, not just developers.

        • manwe150 121 days ago
          My understanding had been that you get a double whammy of there being too many versions of the “common” dependencies, so it is already in-effect a per-site cache (or nearly so), but that uniqueness also means a fingerprint can be established with fairly few latency tests
      • dexterdog 121 days ago
        How would you know the original site if it's a common asset?
        • patrickmeenan 120 days ago
          There is some level of data exposure even if you can't tell which site, specifically, a user has visited. You can tell that "this user has been to at least one page that used X".

          Depending on what the common resource is, most people wouldn't necessarily care but there may be some cases where they would.

          If I can tell that you used facebook login at one point, even though it is used by a good chunk of the web, I could tailor a phish targeted at FB credentials for example.

        • josephg 121 days ago
          No need. Just make a custom asset referenced only by those sites, and get the browser to cache it.
    • pornel 120 days ago
      Apart from tracking, the other concern is reliability — the server may lose or change the file, and this may not be noticed when clients use the hash, until the hash stops being popular, and you're left with a broken site.
      • dexterdog 120 days ago
        If you change the file the hash changes
        • pornel 119 days ago
          The hash is sent separately from the file. For it to change in the HTML and other places linking to it, there has to be some mechanism to update the links, and that is the part that can get out of sync or bitrot.
          • dexterdog 119 days ago
            Which is part of the security of the hash. When I embed a script on my site I get the hash at the time I test it. If the content changes without my knowledge it will not allow a replacement from the source. Assets are not overly cacheable if they don't change the location when the file changes.
  • kevincox 121 days ago
    It is interesting that a proxy won't be able to see the complete response anymore. It will see the dictionary ID and hash but without a copy the server's response won't be fully intelligible to it.
    • patrickmeenan 121 days ago
      If the proxy is correctly handling the Accept-Encoding (rewriting it with only encodings that it understands), it can either remove the `dcb` and `dcz` encodings or it can check if it knows the announced dictionary and only allow them through if it has the dictionary.

      MITM devices that just inspect and fail on unknown content-encoding values will have a problem and will need to be updated (there is an enterprise policy to disable the feature in Chrome for that situation until the proxies can be updated).

      • kevincox 121 days ago
        That's a good point. I meant to say proxies that don't alter the headers. These methods could easily be stripped if the proxy wants to ensure that it understands the response.
  • hedora 121 days ago
    I’m guessing the primary use case for this will be setting cookies. (Google fonts or whatever sends different dictionaries to different clients, the decompression result is unique per client).

    I wonder what percentage of http traffic is redundant compression dictionaries. How much could this actually help in theory?

    • derf_ 121 days ago
      > I wonder what percentage of http traffic is redundant compression dictionaries. How much could this actually help in theory?

      A lot of times you will send a (possibly large) blob of text repeatedly with a few minor changes.

      One example I've used in practice is session descriptions (SDP) for WebRTC. Any time you add/remove/change a stream, you have to renegotiate the session, and this is done by passing an SDP blob that describes all of the streams in a session in and out of the Javascript Session Establishment Protocol (JSEP) API on both sides of the connection. A video conference with dozens of participants each with separate audio and video streams joining one at a time might require exchanging hundreds or even thousands of SDP messages, and in large sessions these can grow to be hundreds of kB each, even though only a tiny portion of the SDP changes each time.

      Now, you could do a lot of work to parse the SDP locally, figure out exactly what changed, send just that difference to the other side, and have it be smart enough to patch its local idea of the current SDP with that difference to feed into JSEP, test it on every possible browser, make it robust to future changes in the SDP the browser will generate, etc.

      OR

      You could just send each SDP message compressed using the last SDP you sent as the initial dictionary. It will compress really well. Even using gzip with the first ~31 kB of the previous SDP will get you in the neighborhood of 200:1 compression[0]. Now your several-hundred kB SDP fits in a single MTU.

      I'm sure WebRTC is not the only place you will encounter this kind of pattern.

      [0] Not that either gzip or using part of a document as a dictionary are supported by this draft.

      • toast0 121 days ago
        > Now, you could do a lot of work to parse the SDP locally, figure out exactly what changed, send just that difference to the other side, and have it be smart enough to patch its local idea of the current SDP with that difference to feed into JSEP, test it on every possible browser, make it robust to future changes in the SDP the browser will generate, etc.

        I work with WebRTC outside a browser context, and we signal calls with simpler datastructures and generate the SDP right near the WebRTC Api. Our SFU doesn't ever see an SDP, because SDP is a big messy format around simpler parts --- it's easier to just communicate the simpler parts, and marshal into SDP at the API border. Even for 1:1 calls, we signal the parts, and then both ends generate the SDP to feed to WebRTC.

        IMHO, you're going to have to test your generated SDPs everywhere anyway, regardless of if clients or servers generate them.

        Well managed compression would certainly help reduce SDP size in transit, but reduce, reuse, recompress in that order.

    • JoshTriplett 121 days ago
      > I’m guessing the primary use case for this will be setting cookies.

      There's a huge difference between "this could potentially be used to fingerprint a client" and "this will primarily be used to fingerprint a client".

      There are many potential fingerprinting mechanisms that rely on tracking what the client already has in its cache. Clients could easily partition their cached dictionaries by origin, just as they could partition other cached information by origin, and that would prevent using this for cross-origin tracking.

      This proposal would help substantially with smaller requests, where a custom dictionary would provide better compression but not better enough to offset the full size of the dictionary.

    • SahAssar 121 days ago
      This follows the same origin rules as cookies and caching, so it is not useful for tracking anymore than those already are.

      As the when this is useful there are many examples:

      * Map tiling (like OSM, mapbox, google maps, etc.) often have many repeated tokens/data but the tiles are served as individual requests

      * Code splitting of CSS/JS, where they share a lot of the same tokens but you don't want to have to load the full bundle on every load since much of it wont be used

      * Any site where the same terms and/or parts are recurring across multiple pages (for many sites like IMDB or similar I'd guess 80% of the HTML is non-unique per request)

      This has been tried before with SDCH which unfortunately died, hopefully this goes better.

      • patrickmeenan 121 days ago
        FWIW, this addresses the BREACH/CRIME issues that killed SDCH by only operating on CORS-readable content.

        It also solves the problem that SDCH had where the dictionary would be forced on the response and the client would have to go fetch it if it didn't have it before it could process the response.

        The tooling for generating dictionaries is also WAY better than it was back in the SDCH days (and they are a much cleaner format, being arbitrary byte strings that Brotli and ZStandard can back-reference).

        Lots of people involved in working on it were also involved with SDCH and have been trying to find a workable solution since it had to be turned down.

    • magicalist 121 days ago
      > I’m guessing the primary use case for this will be setting cookies.

      The cache is partitioned by document and resource origins, so you might as well just use first party cookies at that point (or etags if you insist on being sneaky).

    • bawolff 121 days ago
      Sheesh, the google conspiracy theories are getting out of hand. Why would you use this for setting cookies when you could just use cookies? If for some reason you dont want to use real cookies, why wouldnt you just use the cache side channel directly?

      This doesn't really add any fingerprinting that doesnt already exist.

      • giantrobot 121 days ago
        Because groups want to track people without cookies. This re-introduces the problem of shared caches. Site A can know if someone visited Site B by whether or not they have the dictionary for some shared resource. Site A never has to show a cookie banner to do this tracking because there's no cookie set. This is just reintroducing the "feature" of cross-site tracking with shared caches in the browser.
        • bawolff 121 days ago
          Are you sure? I would assume that available dictionaries would be partioned by site (eTLD+1).
        • patrickmeenan 121 days ago
          Nope. The dictionaries are partitioned the same way as the caches and cookies (whichever is partitioned more aggressively for a given browser). Usually by site and frame so there are no cross-site vectors that it opens.
    • jepler 121 days ago
      this was my first thought as well. Authors just acknowledge it and move on; it's not like shopify and google care whether there's another way to successfully track users online.

          10.  Privacy Considerations
      
             Since dictionaries are advertised in future requests using the hash
             of the content of the dictionary, it is possible to abuse the
             dictionary to turn it into a tracking cookie.
      • patrickmeenan 121 days ago
        Which is why they are treated as if they are cookies and are cleared any time the cache or cookies are cleared so that they can not provide an additional tracking vector beyond what cookies can do (and when 3rd party cookies are partitioned by site/frame, they are also partitioned the same).

        There are LOTS of privacy teams within the respective companies, W3C and IETF that have looked it over to make sure that it does not open any new abuse vectors. It's worth noting that Google, Mozilla and Apple are all supportive of the spec and have all been involved over the last year.

        • patrickmeenan 121 days ago
          Sorry, I should provide more context. The language in the IETF draft is a bit generic because it is a HTTP spec intended to be used more broadly than just web content in browsers and each should evaluate the risks for their use case.

          For browsers specifically, the fetch spec changes will be explicit about the cache clearing and partitioning (partitioned by both top-level document site and frame origin). You can see Chrome's implementation here: https://source.chromium.org/chromium/chromium/src/+/main:net...

          The fetch spec changes are in progress (just documenting, the discussions have already happened). You can follow along here if you'd like: https://github.com/whatwg/fetch/issues/1739

  • wmf 121 days ago
    Interesting that they removed SDCH in 2017 and now they're adding it back. Let's hope this version sticks.
    • devinplatt 121 days ago
      I was curious about this given what happened with SDCH.

      Here is what Wikipedia[0] says

      > Due to the diffing results and the data being compressed with the same coding, SDCH dictionaries aged relatively quickly and compression density became quickly worse than with the usual non-dictionary compression such as GZip. This created extra effort in production to keep the dictionaries fresh and reduced its applicability. Modern dictionary coding such as Shared Brotli has a more effective solution for this that fixes the dictionary aging problem.

      This new proposal uses Brotli.

      [0]: https://en.m.wikipedia.org/wiki/SDCH

      • patrickmeenan 121 days ago
        SDCH was removed when SPECTRE became a thing (CRIME/BREACH) because it was open to side-channel attacks.

        Yes, it had other problems, not the least of which was that it would block the processing of a response while a client fetched the dictionary, but the side-channel attacks were what killed it.

        The compression dictionary transport work addresses all of the known issues that we had with SDCH and we're cautiously optimistic that this will be around for a long time.

  • londons_explore 121 days ago
    This is going to go the same way as HTTP/2 Server Push (removed from Chrome in 2022).

    Great tech promising amazing efficiencies and speedups, but programmers are too lazy to use them correctly, so they see barely any use, and the few places it is used correctly are overshadowed by the few places it is used wrongly hurting performance.

    Lesson: Tech that leads to the same page loading slightly faster generally won't be used unless they are fully automatic and enabled by default.

    Http push required extra config on the server to decide what to push. This requires extra headers to determine what content to compress against (and the server to store that old/common content).

    Neither will succeed because the vast majority of developers don't care about that last little bit of loading speed.

    • magicalist 121 days ago
      > Http push required extra config on the server to decide what to push. This requires extra headers to determine what content to compress against (and the server to store that old/common content).

      This does seem like an advanced use case you won't generally want to set up manually, but it is pretty different than push. Push required knowing what to push, which depends on what a page loads but also when the page loads it, which itself depends on the speed of the client. Mess that up and you can actually slow down the page load (like pushing resources the client already started downloading).

      Compression dictionaries could be as simple as vercel or whoever supporting deltas for your last n deployments, at which point the header to include is trivial.

      • londons_explore 121 days ago
        > Push required knowing what to push,

        Could have been as simple as once-per-resource webpack build step which detected all resources downloaded per-page (easy with a headless chromium), and then pushing all of those (or all of the non-cacheable ones if the client sends a recent cookie).

        Yet frameworks and build systems didn't do that, and the feature was never really used.

        > you can actually slow down the page load (like pushing resources the client already started downloading).

        Any semi-smart server can prevent that, because it already knows what is currently being transferred to the client, or was already transferred in this session. There is no race condition as long as the server refuses to double-send resources. Granted, some server side architectures make this hard without the load balancer being aware.

        This compression dictionary will see the exact same fate IMO.