Caching secrets of the HTTP elders, part 1

(csvbase.com)

141 points | by calpaterson 13 days ago

13 comments

  • zer00eyz 13 days ago
    This makes me feel old.

    IM amused by some of the JS community acting like server side rendering and hydration is akin to discovering fire when they just brought back progressive enhancement from circa 2009.

    Next week we're going to have a lesson on semaphores.

    On a more serious note, there are a lot of places where we could take some lessons from the past to heart. (read stop reinventing the wheel). Permissions systems spring to mind... Something unix/ldap like would fit a lot of use cases and be much more clear that some of the awful things I have seen. Good database design needs to make a come back (and people need to stop being afraid of sql). Being able to let go of data, why do some people have so much trouble purging old logs.

    I could go on but... dam you kids get off my lawn!

    • simonjgreen 13 days ago
      Absolutely spot-on regarding the cyclical nature of server-side techniques. These were once the backbone of high-performance web hosting and architecture. Remember Steve Souders’ “High Performance Websites”? It feels like a prophetic read today as it delved deep into what’s now being rediscovered.

      Full stack development has indeed morphed into a daunting field. The complexity has soared with numerous frameworks and PaaS solutions, which, while they offer a lot for “free,” tend to obscure the foundational principles that once were essential knowledge. This shift might not be detrimental as it allows developers to specialize or concentrate on business logic. However, it does make it challenging for small teams or solo developers to build high-performance applications due to the breadth of skills required.

      This democratization of technology might make the field more accessible, lowering the barriers to entry for newcomers. While some might view this perspective as elitist, I think it’s just being realistic about the skills and knowledge that defined a ‘good’ developer in the early 2000s compared to today.

      I’m nostalgic for the 2000-2010 era too, not just for the technologies and paradigms we used, but for the spirit of exploration and understanding that pervaded our approaches to problems.

      • chrisldgk 13 days ago
        The good thing about all this is that a lot of things also combined into single languages and that you can still spool up everything you need to run on a cheap VPS or even a small computer at home.

        Need to build a backend? JavaScript and NodeJS. Need to query a database or store data? JavaScript and an ORM of your choice. Need to build a frontend? JavaScript and a frontend library of your choice.

        My point is the only thing you actually need to know is a single language and you can build anything you want. This extends to PHP with Laravel, C# and Python as well. Getting into developing things got a lot easier because you don‘t need to get into the nitty gritty of how a web or database server functions.

        Obviously there‘s stuff like CI/CD, logging and error tracking that‘s a lot harder to do without PaaS solutions, not to speak of scalability. But these are things that should be left to the professionals, which should absolutely learn about the core concepts of the software they‘re building and deploying. But that’s what they‘re (read: we’re) being paid for.

    • mlinhares 12 days ago
      Nah man, half the jobs in tech exist because someone decided do reinvent the wheel and learn everything from scratch again, we need the jobs!
    • calpaterson 13 days ago
      One of the things I really need to sort on csvbase is permissions. Currently it has something very basic: you own your table and it's either public (meaning everyone has read) or private (meaning only you have read).

      I really don't want to implement, eg RBAC, by myself but equally the state-of-the-art seems to be integration against pretty complicated external stuff (/cloud services) which is also undesirable when it's your side-project.

      Advice appreciated. I suspect LDAP is not a fit but I will investigate it...

      • zer00eyz 13 days ago
        Funny where talking about this: I have spent the last two weeks deep into Postgres roles. The whole system is, to say the least, convoluted. I think I need to write "Postgres roles for dummies"... because I feel like one.

        Ldap would be a great fit. And go go read here: https://www.zytrax.com/books/ldap/ if you want a dive that you can wrap your head around.

        Candidly I would NOT go for ldap or RBAC now. The two things I would try to bite off are shared ownership, and then revokable tokens at the table level. It's up to the token creator what permissions to give them (read/write), how long they should last, and if they want to just publish them or create one per "user".

      • _factor 12 days ago
        You probably want to grant full rights to an outside authentication/authorization service that handles the fine grained rights for private tables.

        That way you don’t lock the main functionality up with the rights management aspect and 1,000+ custom permissions and role sets.

        Authorization is the front door to a hallway with keyed doors behind. Have a peep hole to authenticate where necessary, but don’t complicate the core product with it.

        Heck, roll your own version if you have time. With this system you can change it in and out and not have to rewrite your base to accommodate. An oversimplification, as are most things.

        My .02c

      • withinboredom 13 days ago
        Heh, I suspect you are on the right path:

        https://news.ycombinator.com/item?id=40054085

    • jgrahamc 13 days ago
      Server-side rendering! You mean CGI written in Perl spitting out fully-formed HTML in 1994?
      • calpaterson 13 days ago
        csvbase is written in Python instead of Perl and FCGI instead of "trad" CGI but otherwise is a lot like a site from 1994...
      • barryrandall 12 days ago
        Did you mean serverless?
    • mhuffman 12 days ago
      >server side rendering and hydration is akin to discovering fire when they just brought back progressive enhancement from circa 2009.

      Active Server Pages and PHP from the mid-90's enter the chat...

      My favorite is "static site generation" as if that is new. There were complex perl and bash scripts that generated full sites with templates in the 90's as well. I wrote one in C back then just for fun!

    • hyggetrold 12 days ago
      > Good database design needs to make a come back (and people need to stop being afraid of sql)

      Aye - folks should learn third normal form. They should learn SQL as well.

    • imetatroll 12 days ago
      There is something weirdly naive and arrogant about the way this works.
  • vvoyer 13 days ago
    I am surprised this doesn't list two major resources for learning how caching works:

    - https://www.mnot.net/cache_docs/ for a long time this was the best online resource

    - https://developer.mozilla.org/en-US/docs/Web/HTTP/Caching is extremely detailed, and based on the previous link too from what I can tell

  • kstrauser 12 days ago
    ETags are also a built-in way to avoid conflicts when multiple clients are mutating the same data. Instead of each client sending

      PUT /doc
    
    And blindly overwriting what’s there, they send

      PUT /doc
      If-Match: <etag they fetched>
    
    If the current server-side ETag had a different value, it can return a 409. Then the client can re-fetch the document, re-apply the changes, and re-PUT it in a loop until it succeeds.

    You wouldn’t do that for huge, frequently changing docs like a collaborative spreadsheet or such. It’s perfect for small documents where you’d expect success most of the time but failure is frequent enough that you want some sort of smart error handling.

  • repelsteeltje 13 days ago
    ETags are brilliant at reducing bandwidth and work especially well if the origin webserver is close to the client. Unfortunately, there isn't a lot it can do about round trip latency. Even if the payload resides on a proxy cache close to the client, that proxy cannot instantly answer 304 Not changed because it needs to revalidate its cache with its upstream as well (using If-None-Match).

    So, serving a (relatively small) cvsbase Bay Area to Australia will still be slow unless you're willing to accept stale data (Ie. Cache-Control: max-age / Expires headers).

    • mrighele 13 days ago
      If the main concern is latency, you could have the main store send push notifications to the caches (not client) whenever some data has been modified.

      Not something trivial, but I think it is doable if you have control of both the origin server and the intermediate cache.

    • zer00eyz 13 days ago
      Ohh this is an interesting case. You could abuse the hell out of Etags, Urls and DIFFS to get a much better long haul response.

      https://csvbase.com/user-name/table-name/ would return you an etag. The etag needs to be a hash of the file (sha256).

      https://csvbase.com/user-name/table-name/etag could return a 204 for an unchanged document. If the two dont match then return a diff + new etag. Apply the diff as a patch, and check the etag.

      Yes you still have the latency, on 204's but the moment there is a change you might be getting a 200 from cache, and only a diff at that. Smaller payload and potentialy faster response.

      On the server side the only thing that you're adding in is the diff of the last change....

      If the data is faster moving then "cache" should let you catch up. If the caches are stale then the first response will give you a patch and new etag that wont have the correct hash value and you know to grab a fresh copy as you have no path forward.

      • repelsteeltje 13 days ago
        Ultimately, the root cause of latency is the round trip time to figure out if the cache is valid. For large distance between client and source of truth there is not much you can do in terms of caching, diffs an other throughput conserving cleverness; it's the speed of light.

        Sometimes, however, it it possible to anticipate when content will become stale: when changes to the origin don't happen arbitrarily. For instance, with live media streaming, segment chunks are usually fixed duration (say 2s or 8s for Apple HLS) and so you know your content won't / shouldn't change until...

        In those cases, client (and caching proxies) can rely on expiration and do not need to revalidate which saves network traffic. But more importantly, it allows an edge server to instantly serve a cached copy to the client. The round-trip is essentially between the client device and edge server. Making that snappy might even reduce air time and save your phone battery.

    • RenThraysk 12 days ago
      Cache-Control: immutable will prevent roundtrips in atleast Firefox. This does mean that the path should be unique per version of the content. Like inserting a cryptographic hash in the path.
  • 8organicbits 13 days ago
    The varnish caching lifecycle [1] has some great additional features. Coallate and hold multiple requests, refresh the cache while immediately serving cached content to the requester, serve stale items when the backend is down.

    [1] https://docs.varnish-software.com/tutorials/object-lifetime/

    • kstrauser 12 days ago
      I’ve loved Varnish so much when I’ve used it. I worked for a gaming company where players competed to guess the next play in a live sports game. The simplest and most resilient architecture is to have those client apps run a GET every 5 seconds to see if there’s a new opportunity to guess. Easy, but now scale that up to the Super Bowl or World Cup.

      We were looking at a ferocious front-end web server horizontal scaling bill until we plopped Varnish in front of it, made sure the backend was setting caching headers correctly for .5 seconds, and let it rip.

      Our backend traffic dropped from a potential 20,000,000 requests per second to… 2.

      Side lesson there: it’s amazing how much easier it is to scale the machine that doesn’t exist. Computer science is your friend when engineering throws you to the wolves.

  • cryptonector 12 days ago
    For a file-based HTTP server's weak ETags I've used a concatenation of st_dev, st_ino, and inode generation number. For strong ETags I've used a SHA-512 hash.

    In combination with If-Match: and If-None-Match: this is very powerful.

  • baggy_trough 12 days ago
    I really wish HTTP had a no-build solution for caching static assets that didn't require either a request or a stale asset period. For example, ability to declare that static assets under a path won't change unless a build version at some other path changes.
    • remram 12 days ago
      The usual solution is: have your frontend use different URLs for the assets (example: put a hash of the content in the asset's URL) or add a build number to the asset's URL in a query parameter (example: a hash of the built frontend file, e.g. /static/chunk.min.js?v=1235abcde)
      • baggy_trough 12 days ago
        Yes, I'd like to not have to modify the asset names. That requires a build step for assets.
        • remram 12 days ago
          Use the query parameter trick then.
          • baggy_trough 12 days ago
            You can't for image assets in CSS for example.
  • alganet 13 days ago
    csvbase looks cool.

    Apache mod-cache is pretty good on correctedness, not so much in speed. I once did layering with Varnish on top of mod-cache on top of real backend. It was enough. It could even handle moderate traffic WebDAV (that was my main use case).

    If Windows hadn't dropped native support for WebDAV, I would recommend you to take a look at it. If I'm not mistaken, macos still supports it out of the box, so does GNOME through gvfs.

  • remram 12 days ago
    Is `no-cache` necessary if using `must-revalidate`?
  • 1oooqooq 13 days ago
    http1.1 was a mistake mostly driven by audience analytics (in other words, advertising)

    no complex cache (i.e. aggressive cache everything) with a user who can discern how to operate a simple refresh button, was the best solution.

    cache today is a joke. you cannot press back after going offline anywhere.