Stop writing CLI validation. Parse it right the first time

(hackers.pub)

189 points | by dahlia 1 day ago

37 comments

bschwindHN 10 hours ago
Rust with Clap solved this forever ago.
Also - don't write CLI programs in languages that don't compile to native binaries. I don't want to have to drag around your runtime just to execute a command line tool.
[-]
- MathMonkeyMan 9 hours ago
  Almost every command line tool has runtime dependencies that must be installed on your system.
```
    $ ldd /usr/bin/rg
    linux-vdso.so.1 (0x00007fff45dd7000)
    libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x000070764e7b1000)
    libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x000070764e6ca000)
    libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x000070764de00000)
    /lib64/ld-linux-x86-64.so.2 (0x000070764e7e6000)
```
  The worst is compiling a C program with a compiler that uses a more recent libc than is installed on the installation host.
  [-]
  - craftkiller 5 hours ago
    Don't let your dreams be dreams
```
  $ wget 'https://github.com/BurntSushi/ripgrep/releases/download/14.1.1/ripgrep-14.1.1-x86_64-unknown-linux-musl.tar.gz'
  $ tar -xvf 'ripgrep-14.1.1-x86_64-unknown-linux-musl.tar.gz'
  $ ldd ripgrep-14.1.1-x86_64-unknown-linux-musl/rg
  ldd (0x7f1dcb927000)
  $ file ripgrep-14.1.1-x86_64-unknown-linux-musl/rg
  ripgrep-14.1.1-x86_64-unknown-linux-musl/rg: ELF 64-bit LSB pie executable, x86-64, version 1 (SYSV), static-pie linked, stripped
```
    [-]
    - 3836293648 1 hour ago
      Which only works on linux. No other OS allows static binaries, you always need to link to libc for syscalls.
      [-]
      - craftkiller 15 minutes ago
        Also works on FreeBSD. FreeBSD maintains ABI compatibility within each major version (so 14.0 is compatible with 14.1, 14.2, and 14.3 but not 15.0): You also can install compatibility packages that make binaries compiled for older major versions run on newer major versions.
        $ pkg install git rust $ git clone https://github.com/BurntSushi/ripgrep.git $ cd ripgrep $ RUSTFLAGS='-C target-feature=+crt-static' cargo build --release $ ldd target/release/rg ldd: target/release/rg: not a dynamic ELF executable $ file target/release/rg target/release/rg: ELF 64-bit LSB executable, x86-64, version 1 (FreeBSD), statically linked, for FreeBSD 14.3, FreeBSD-style, with debug_info, not stripped
  - Sharlin 6 hours ago
    Sure, but Rust specifically uses static linking for everything but the very basics (ie. libc) in order to avoid the DLL hell.
  - bschwindHN 6 hours ago
    Yes but I've never had a native tool fail on a missing libc. I've had several Python tools and JS tools fail on missing the right version of their interpreter. Even on the right interpreter version Python tools frequently shit the bed because they're so fragile.
    [-]
    - mjevans 5 hours ago
      I have. During system upgrades, usually along unsupported paths.
      If you're ever living dangerously, bring along busybox-static. It might not be the best, but you'll thank yourself later.
  - sestep 6 hours ago
    I statically link all my Linux CLI tools against musl for this reason. Or use Nix.
  - 1718627440 41 minutes ago
    > The worst is compiling a C program with a compiler that uses a more recent libc than is installed on the installation host.
    This is only a problem, when the program USES a symbol that was only introduced in the newer libc. In other words, when the program made a choice to deliberately need that newer symbol.
  - dboon 7 hours ago
    That’s the first rule anyone writing portable binaries learns. Compile against an old libc, and stuff tends to just work.
    [-]
    - delta_p_delta_x 6 hours ago
      > Compile against an old libc
      This clause is abstracting away a ton of work. If you want to compile the latest LLVM and get 'portable C++26', you need to bootstrap everything, including CMake from that old-hat libc on some ancient distro like CentOS 6 or Ubuntu 12.04.
      I've said it before, I'll say it again: the Linux kernel may maintain ABI compatibility, but the fact that GNU libc breaks it anyway makes it a moot point. It is a pain to target older Linux with a newer distro, which is by far the most common development use case.
      [-]
      - dboon 5 hours ago
        Definitely, and I know this sounds like ignoring the problem, but in my experience the best solution is to just not use the bleeding edge.
        Write your code such that you can load it onto (for example) the oldest supported Ubuntu and compile cleanly and you’ll have virtually zero problems. Again, I know that if your goal is to truly ship something written in e.g. C++26 portably then it’s a huge pain. But as someone who writes plain C and very much enjoys it, I think it’s better to skip this class of problem.
        [-]
        delta_p_delta_x 2 hours ago
        > I think it’s better to skip this class of problem.
        I'll keep my templates, smart pointers, concepts, RAII, and now reflection, thanks. C and its macros are good for compile times but nothing much else. Programming in C feels like banging rocks together.
- majorbugger 10 hours ago
  I will keep writing my CLI programs in the languages I want, thanks. Have it crossed your mind that these programs might be for yourself or for internal consumption? When you know runtime will be installed anyway?
  [-]
  - dcminter 10 hours ago
    You do you, obviously, but "now let npm work its wicked way" is an offputting step for some of us when narrowing down which tool to use.
    My most comfortable tool is Java, but I'm not going to persuade most of the HN crowd to install a JVM unless the software I'm offering is unbearably compelling.
    Internal to work? Yeah, Java's going to be an easy sell.
    I don't think OP necessarily meant it as a political statement.
    [-]
    - goku12 9 hours ago
      There should be some way to define the CLI argument format and its constraints in some sort of DSL that can be compiled into the target language before the final compilation of the application. This way, it can be language agnostic (though I don't know why you would need this) without the need for another runtime. The same interface specification should be able to represent a customizable help/usage message with sane defaults, generate dynamic tab completions code for multiple shells, generate code for good quality customizable error messages in case of CLI argument errors and generate a neatly formatted man page with provisions for additional content, etc.
      In fact, I think something like this already exists. I just can't recollect the project.
      [-]
      - craftkiller 5 hours ago
        docopt: https://github.com/docopt/docopt
    - vvillena 6 hours ago
      This is not an issue with Java and the other JVM languages, it's simple to use GraalVM and package a static binary.
    - lazide 2 hours ago
      most java CLIs (well, non shitty ones), and most distributed java programs in general, package their own jvms in a hermetic environment. it’s just saner.
  - bschwindHN 6 hours ago
    That's fine, I'll be avoiding using them :)
    [-]
    - perching_aix 6 hours ago
      You'll avoid using his personal tooling he doesn't share, and his internal tooling he shares where you don't work?
      Are you stuck in write-only mode or something? How does this make any sense to you?
- jampekka 57 minutes ago
  > Also - don't write CLI programs in languages that don't compile to native binaries. I don't want to have to drag around your runtime just to execute a command line tool.
  And don't write programs with languages that depend on CMake and random tarballs to build and/or shared libraries to run.
  I usually have a lot less issues with dragging a runtime than fighting with builds.
- rs186 8 hours ago
  Apparently that ship has sailed. Claude Code and Gemini CLI requires Node.js installation, and Gemini README reads as if npm is a tool that everybody knows and has already installed.
  https://www.anthropic.com/claude-code
  https://github.com/google-gemini/gemini-cli
  [-]
  - dboon 4 hours ago
    Opencode is a great model agnostic alternative which does not require a separate runtime
  - Sharlin 6 hours ago
    That's terrible, but at the very least there's the tiny justification that those are web API clients rather than standalone/local tools.
- perching_aix 6 hours ago
  Like shell scripts? Cause I mean, I agree, I think this world would be a better place if starting tomorrow shell scripts were no longer a thing. Just probably not what you meant.
  [-]
  - ycombobreaker 4 hours ago
    Shell scripts are a byproduct of the shell existing. Generations of programmers have cut their teeth in CLI environments. Anything that made shell scripts "no longer a thing" would necessarily destroy the interactive environment, and sounds like a ladder-pull to the curiosity of future generations.
  - bschwindHN 5 hours ago
    > I think this world would be a better place if starting tomorrow shell scripts were no longer a thing.
    Pretty much agreed - once any sort of complicated logic enters a shell script it's probably better off written in C/Rust/Go or something akin to that.
- dcminter 10 hours ago
  The declarative form of clap is not quite as well documented as the programmatic approach (but it's not too bad to figure out usually).
  One of the things I love about clap is that you can configure it to automatically spit out --help info, and you can even get it to generate shell autocompletions for you!
  I think there are some other libraries that are challenging it now (fewer dependencies or something?) but clap sets the standard to beat.
- LtWorf 2 hours ago
  > Also - don't write CLI programs in languages that don't compile to native binaries. I don't want to have to drag around your runtime just to execute a command line tool.
  Go programs compile to native executables, they're still rather slow to start, especially if you just want to do --help
- ndsipa_pomu 3 hours ago
  > don't write CLI programs in languages that don't compile to native binaries. I don't want to have to drag around your runtime just to execute a command line tool.
  Well that's confused me. I write a lot of scripts in BASH specifically to make it easy to move them to different architectures etc. and not require a custom runtime. Interpreted scripts also have the advantage that they're human readable/editable.
jmull 21 hours ago
> Think about it. When you get JSON from an API, you don't just parse it as any and then write a bunch of if-statements. You use something like Zod to parse it directly into the shape you want. Invalid data? The parser rejects it. Done.
Isn’t writing code and using zod the same thing? The difference being who wrote the code.
Of course, you hope zod is robust, tested, supported, extensible, and has docs so you can understand how to express your domain in terms it can help you with. And you hope you don’t have to spend too much time migrating as zod’s api changes.
[-]
- MrJohz 16 hours ago
  I think the key part, although the author doesn't quite make it explicit, is that (a) the parsing happens all up front, rather than weaving validation and logic together, and (b) the parsing creates a new structure that encodes the invariants of the application, so that the rest of the application no longer needs to check anything.
  Whether you do that with Zod or manually or whatever isn't important, the important thing is having a preprocessing step that transforms the data and doesn't just validate it.
  [-]
  - 1718627440 38 minutes ago
    But when you parse all arguments first before throwing error messages, you can create much better error messages, since they can be more holistic. To do that you need to represent the invalid configuration as a type.
  - makeitdouble 13 hours ago
    The base assumption is parsing upfront cost less than validating along. I thinks it's a common case, but not common enough to apply it as a generic principle.
    For instance if validating parameter values requires multiple trips to a DB or other external system, weaving the calls in the logic can spare duplicating these round trips. Light "surface" validation can still be applied, but that's not what we're talking about here I think.
    [-]
    - MrJohz 13 hours ago
      It's not about costing less, it's about program structure. The goal should be to move from interface type (in this case a series of strings passed on the command line) to internal domain type (where we can use rich data types and enforce invariants like "if server, then all server properties are specified") as quickly as possible. That way, more of the application can be written to use those rich data types, avoiding errors or unnecessary defensive programming.
      Even better, that conversion from interface type to internal type should ideally happen at one explicit point in the program - a function call which rejects all invalid inputs and returns a type that enforces the invariants we're interested in. That way, we gave a clean boundary point between the outside world and the inside one.
      This isn't a performance issue at all, it's closer to the "imperative shell, functional core" ideas about structuring your application and data.
    - lmm 8 hours ago
      > if validating parameter values requires multiple trips to a DB or other external system, weaving the calls in the logic can spare duplicating these round trips
      Sure, but probably at the cost of leaving everything in a horribly inconsistent state when you error out partway through. Which is almost always not worth it.
- bigstrat2003 17 hours ago
  Yeah, the "parse, don't validate" advice seems vacuous to me because of this. Someone is doing that validation. I think the advice would perhaps be phrased better as "try to not reimplement popular libraries when you could just use them".
  [-]
  - lock1 13 hours ago
    When I first saw "Parse, don't validate" title, it struck me as a catchy but perhaps unnecessarily clever catchphrase. It's catchy, yes, but it felt too ambiguous to be meaningful for anyone outside of the target audience (Haskellers in this case).
    That said, I fully agree with the article content itself. It basically just boils down to:
    When you create a program, eventually you'll need to process & check whether input data is valid or not. In C-like language, you have 2 options
```
  void validate(struct Data d);
```
    or
```
  struct ValidatedData;
  ValidatedData validate(struct Data d);
```
    "Parse, don't validate" is just trying to say don't do `void validate(struct Data d)` (procedure with `void`), but do `ValidatedData validate(struct Data d)` (function returning `ValidatedData`) instead.
    It doesn't mean you need to explicitly create or name everything as a "parser". It also doesn't mean "don't validate" either; in `ValidatedData validate(struct Data d)` you'll eventually have "validation" logic similar to the procedure `void` counterpart.
    Specifically, the article tries to teach folks to utilize the type system to their advantage. Rather than praying to never forget invoking `validate(d)` on every single call site, make the type signature only accept `ValidatedData` type so the compiler will complain loudly if future maintainers try to shove `Data` type to it. This strategy offloads the mental burden of remembering things from the dev to the compiler.
    I'm not exactly sure why the "Parse, don't validate" catchphrase keeps getting reused in other language communities. It's not clear to non-FP community what the distinction between "parser" and "validate", let alone "parser combinator". Yet somehow other articles keep reusing this same catchphrase.
    [-]
    - Lvl999Noob 4 hours ago
      The difference, in my opinion, is that you received the cli args in the form
``` some_cli <some args> --some-option --no-some-option ```
Before parsing, the argument array contains both the flags to enable and disable the option. Validation would either throw an error or accept it as either enabled or disabled. But importantly, it wouldn't change the arguments. If the assumption is that the last option overwrites anything before it then the cli command is valid with the option disabled.
And now, correct behaviour relies on all the code using that option to always make the same assumption.
Parsing, on the other hand, would put create a new config where `option` is an enum - either enabled or disabled or not given. No confusion about multiple flags or anything. It provides a single view for the rest of the program of what the input config was.
Whether that parsing is done by a third party library or first party code, declaratively or imperatively, is besides the point.
- andreygrehov 4 hours ago
  What is ValidatedData? A subset of the Data that is valid? This makes no sense to me. The way I see it is you use ‘validate’ when the format of the data you are validating is the exact same format you are gonna be working with right after, meaning the return type doesn’t matter. The return type implies transformation – a write operation per se, whereas validation is always a read operation only.
- dwattttt 16 hours ago
  Sibling says this with code, but to distil the advice: reflect the result of your validation in the type system.
  Then instead of validating a loose type & still using the loose type, you're parsing it from a loose type into a strict type.
  The key point is you never need to look at a loose type and think "I don't need to check this is valid, because it was checked before"; the type system tracks that for you.
  [-]
  - 8n4vidtmkvmk 13 hours ago
    Everyone seems hung up on the type system, but I think the validity of the data is the important part. I'd still want to convert strings to ints, trim whitespace, drop extraneous props and all of that jazz even if I was using plain JS without types.
    I still wouldn't need to check the inputs again because I know it's already been processed, even if the type system can't help me.
    [-]
    - dwattttt 12 hours ago
      The type isn't just there to make it easy to understand when you do it, it's for you a year later when you need to make a change further inside a codebase, far from where it's validated. Or for someone else who's never even seen the validation section of code.
      I'm hung up on the type system because it's a great way to convey the validity of the data; it follows the data around as it flows through your program.
      I don't (yet) Typescript, but jsdoc and linting give me enough type checking for my needs.
      [-]
      k3vinw 6 hours ago
      jsdoc types are better than nothing. You could switch to using Typescript today and it will understand them.
    - Lvl999Noob 2 hours ago
      Pure js without typescript also has "types". Typescript doesn't give you nominal types either. It's only structural. So when you say that you "know it's already been processed", you just have a mental type of "Parsed" vs "Raw". With a type system, it's like you have a partner dedicated to tracking that. But without that, it doesn't mean you aren't doing any parsing or type tracking of your own.
- remexre 17 hours ago
  The difference between parse and validate is
```
      function parse(x: Foo): Bar { ... }
  
      const y = parse(x);
```
  and
```
      function validate(x: Foo): void { ... }
  
      validate(x);
      const y = x as Bar;
```
  Zod has a parser API, not a validator API.
- yakshaving_jgt 12 hours ago
  Parsing includes validation.
  The point is you don’t check that your string only contains valid characters and then continue passing that string through your system. You parse your string into a narrower type, and none of the rest of your system needs to be programmed defensively.
  To describe this advice as “vacuous” says more about you than it does about the author.
- akoboldfrying 18 hours ago
  Yes, both are writing code. But nearly all the time, the constraints you want to express can be expressed with zod, and in that case using zod means you write less code, and the code you do write is more correct.
  > Of course, you hope zod is robust, tested, supported, extensible, and has docs so you can understand how to express your domain in terms it can help you with. And you hope you don’t have to spend too much time migrating as zod’s api changes.
  Yes, judgement is required to make depending on zod (or any library) worthwhile. This is not different in principle from trusting those same things hold for TypeScript, or Node, or V8, or the C++ compiler V8 was compiled with, or the x86_64 chip it's running on, or the laws of physics.
  [-]
  - jmull 15 hours ago
    Sure... the laws of physics last broke backwards compatibility at the Big Bang, Zod last broke backwards compatibility a few months ago.
12_throw_away 18 hours ago
I like this advice, and yeah, I always try to make illegal states unrepresentable, possibly even to a fault.
The problem I run into here is - how do you create good error messages when you do this? If the user has passed you input with multiple problems, how do you build a list of everything that's wrong with it if the parser crashes out halfway through?
[-]
- ffsm8 15 hours ago
  I think you're looking at it too literally - what people usually mean with"making invalid state unrepresentable" is in the main application which has your domain code - which should be separate from your inputs
  He even gives the example of zod, which is a validation library he defines to be a parser.
  What he wants to say : "I don't want to write my own validation in a CLI, give me a good API already that first validates and then converts the inputs into my declared schema"
  [-]
  - MrJohz 7 hours ago
    > I don't want to write my own validation in a CLI, give me a good API already that first validates and then converts the inputs into my declared schema
    But that _is_ parsing, at least in the sense of "parse, don't validate". It's about turning inputs into real objects representing the domain code that you're about to be working with. The result is still going to be a DTO of some description, but it will be a DTO with guaranteed invariants that are useful to you. For example, a post request shouldn't be parsed into a user object just because it shares a lot of fields in common with a user. Instead it should become a DTO with the invariants fulfilled that makes sense for a DTO. Some of those invariants are simple (like "dates should be valid" -> the DTO contains Date objects not strings), and some will be more complex like the "if the server is active, then the port also needs to be provided" restriction from the article.
    This is one of the key ideas behind Zod - it isn't just trying to validate whether an object matches a certain schema, but it converts the result into a type that accurately expresses the invariants that must be in place if the object is valid.
    [-]
    - ffsm8 7 hours ago
      I dont disagree with the desire to get a good API like that. I was just pointing out that this was the core of the desire the author had, as 12_throw_away was correctly pointing out that _true_ parsing and making invalid state unrepresentable forces you to error out on the first missmatch, which makes it impossible to raise multiple issues. the only way around that is to allow invalid state during the input phase.
      zod also allows invalid state as input, then attempts to shoehorn them into the desired schema, which still runs these validations the author was complaining about - just not in the code he wrote.
      [-]
      - Lvl999Noob 2 hours ago
        Why does "true" parsing have to error out on the very first problem? It is more than possible (though maybe not easy) to keep parsing and collecting errors as they appear. Zod, as the given example in the post, does it.
        [-]
        1718627440 36 minutes ago
        Because then it would need to represent invalid data in its output type.
      - MrJohz 2 hours ago
        I don't know that I understand why parsing necessarily has to error out on the first mismatch. Good parsers will collect errors as they go along.
        Zod does take in invalid state as input, but that is what a parser does. In this case, the parser is `any -> T` as opposed to `string -> T`, but that's still a parsing operation.
  - 8n4vidtmkvmk 13 hours ago
    Zod might be a validation library, but it also does type coercion and transforms. I believe that's what the author means by a parser.
    [-]
    - goku12 9 hours ago
      Apparently not. The author cites the example of json parsing for APIs. You usually don't split it into a generic parsing into native data types and then validate the result in memory (unless you're on a dynamically typed language and don't use a validation schema). Instead, the expected native data type of the result (composed using structs, enums, unions, vectors, etc) is defined first and then you try to parse the json into that data type. Any json errors and schema violations will error out in a single step.
- mark38848 12 hours ago
  Just use optparse-applicative in PureScript. Applicatives are great for this and the library gives it to you for free.
  [-]
  - bradrn 10 hours ago
    > Just use optparse-applicative in PureScript.
    Or in Haskell!
- adinisom 14 hours ago
  If talking about UI, the flip side is not to harm the user's data. So despite containing errors it needs to representable, even if it can't be passed further along to back-end systems.
  For parsing specifically, there's literature on error recovery to try to make progress past the error.
- ambicapter 18 hours ago
  Most validation libraries worth their salt give you options to deal with this sort of thing? They'll hand you an aggregate error with an 'errors' array, or they'll let you write an error message "prettify-er" to make a particular validation error easier to read.
  [-]
  - pmarreck 15 hours ago
    Right, but that's validation, and this article is talking about parsing (not validating) into an already-correct structure by making invalid inputs unrepresentable.
    So maybe the reason why they were able to reduce the code is because they lost the ability to do good error reporting.
    [-]
    - jpc0 10 hours ago
      How is getting an error array not making invalid input unrepresentable.
      You either get the correctly parsed data or you get an error array. The incorrect input was never represented in code, vs a 0 value being returned or even worse random gibberish.
      A trivial example: 1/0 should return DivisionByZero not 0 or infinity or NaN or whatever else. You can then decide in your UI whether that is a case you want to handle as an error or as an edge case but the parser knows that is not possible to represent.
    - lmm 8 hours ago
      You parse into an applicative validation structure, combine those together, and then once you've brought everything together you handle that as either erroring out with all the errors or continuing with the correct config. It's easier to do that with a parsing approach than a validating approach, not harder.
    - Ygg2 8 hours ago
      Parsers can be made to not fail on first error. You return either a parsed structure or an array of found error.
      Html5 parser is notoriously friendly to errors. See adoption agency algorithm.
  - Thaxll 17 hours ago
    This work if all errors are self contained, stoping at the first one is fine too.
- geysersam 13 hours ago
  Maybe you can use his `or` construct to allow a `--server` without `--port`, but then also add a default `error_message` property.
  After parsing you check if `error_message` exists and raise that error.
- akoboldfrying 18 hours ago
  Agree. It should definitely be possible to get error messages on par with what TypeScript gives you when you try to assign an object literal to an incompatibly typed variable; whether that's currently the case, and how difficult it would be to get there if not, I don't know.
nine_k 21 hours ago
This is a recurring idea: "Parse, don't validate". Previously:
https://lexi-lambda.github.io/blog/2019/11/05/parse-don-t-va... (2019, using Haskell)
https://www.lelanthran.com/chap13/content.html (April 2025, using C)
[-]
- jetrink 20 hours ago
  The author credits Alexis King at the beginning and links to that post.
SloopJon 19 hours ago
I don't see anything in the post or the linked tutorial that gives a flavor of the user experience when you supply an invalid option. I tried running the example, but I've forgotten too much about Node and TypeScript to make it work. (It can't resolve the @optique references.) What happens when you pass --foo, --target bar, or --port 3.14?
[-]
- macintux 18 hours ago
  I had a similar question: to me, the output format “or” statement looks like it might deterministically pick one winner instead of alerting the user that they erred. A good parser is terrific, but it needs to give useful feedback.
  [-]
  - Dragging-Syrup 8 hours ago
    Absolutely; I think calling the function xor would be more appropriate.
esafak 16 hours ago
The "problem" is that some languages don't have rich enough type systems to encode all the constraints that people want to support with CLI options. And many programmers aren't that great at wielding the type systems at their disposal.
andrewguy9 18 hours ago
Docopt!
http://docopt.org/
Make use of the usage string be the specification!
A criminally underused library.
[-]
- tomjakubowski 15 hours ago
  A great example of "declaration follows use" outside of C syntax.
- fragmede 17 hours ago
  My favorite. A bit too much magic for some, but it seems well specified to me.
kiliancs 4 hours ago
Great project. Clear goal, well executed, very nice API (safe, terse, clear).
I use Effect CLI https://github.com/Effect-TS/effect/tree/main/packages/cli for the same reasons. It has the advantage of fitting within the ecosystem. For example, I can reuse existing schemas.
baroninthetrees 3 hours ago
I too got tired of dealing with cli arg parsing and am experimenting with passing a natural language description of the program and its args to a tiny LLM to sort out, offer suggestions (did you mean?), types conversions, etc. So far, it’s working great and given enough detail is deterministic.
SoftTalker 19 hours ago
I like just writing functions for each valid combination of flags and parameters. Anything that isn’t handled is default rejected. Languages like Erlang with pattern matching and guards make this a breeze.
foundart 5 hours ago
The author of the article also wrote a CLI parser library for Typescript, called Optique. I really appreciate them including a "When Optique makes sense" section in the docs. It would be great if more projects did that.
https://optique.dev/why#when-optique-makes-sense
bsoles 20 hours ago
>> // This is a parser
>> const port = option("--port", integer());
I don't understand. Why is this a parser? Isn't it just way of enforcing a type in a language that doesn't have types?
I was expecting something like a state machine that takes the command line text and parses it to validate the syntax and values.
[-]
- hansvm 17 hours ago
  The heavy lifting happens in the definitions of `option` and `integer`. Those will take in whatever arguments they take in and output some sort of `Stream -> Result<Tuple<T, Stream>>` function.
  That might sound messy but to the author's point about parser combinators not being complicated, they really don't take much time to get used to, and they're quite simple if you wanted to build such a library yourself. There's not much code (and certainly no magic) going on under the hood.
  The advantage of that parsing approach:
  It's reasonably declarative. This seems like the author's core point. Parser-combinator code largely looks like just writing out the object you want as a parse result, using your favorite combinator library as the building blocks, and everything automagically works, with amazing type-checking if your language has such features.
  The disadvantages:
  1. Like any parsing approach, you have to actually consider all the nuances of what you really want parsed (e.g., conditional rules around whitespace handling). It looks a little to me (just from the blog post, not having examined the inner workings yet) like this project side-stepped that by working with the `Stream` type as just the `argv` list, allowing you to be able to say things like "parse the next blob as a string" without also having to encode whitespace and blob boundaries.
  2. It's definitely slower (and more memory-intensive) than a hand-rolled parser, and usually also worse in that regard than other sorts of "auto-generated" parsing code.
  For CLI arguments, especially if they picked argv as their base stream type, those disadvantages mostly don't exist. I could see it performing poorly for argv parsing for something like `cp` though (maybe not -- maybe something like `git cp`, which has more potential parse failures from delimiters like `--`?), which has both options and potentially ginormous lists of files; if you're not very careful in your argument specification then you might have exponential backtracking issues, and where that would be blatantly obvious in a hand-rolled parser it'll probably get swept under the rug with parser combinators.
jappgar 6 hours ago
I really think parse don't validate gives people a false sense of security (particularly false in dynamic languages like javascript and python).
"Well, I already know this is a valid uuid, so I don't really need to worry about sql injection at this point."
Sure, this is a dumb thing to do in any case, but I've seen this exact thing happen.
Typesafety isn't safety.
[-]
- yakshaving_jgt 5 hours ago
  Type safety is absolutely some degree of safety. And I don’t know why anyone would think parsing a value into a type that has fewer inhabitants would absolve them of having to prevent SQL injection — these are orthogonal things.
  The quote here — which I suspect is a straw man — is such a weird non sequitur. What would logically follow from “I already know this is a valid UUID” is “so I don’t need to worry about this not being a UUID at this point”.
  [-]
  - jappgar 3 hours ago
    In python or typescript, the most popular languages in the world, it offers no runtime safety.
    Even in languages like Haskell, "safety" is an illusion. You might create a NumberGreaterThanFive type with smart constructors but that doesn't stop another dev from exporting and abusing the plain constructor somewhere else.
    For the most part it's fine to assume the names of types are accurate, but for safety critical operations it absolutely makes sense to revalidate inputs.
    [-]
    - yakshaving_jgt 3 hours ago
      > that doesn't stop another dev from exporting and abusing the plain constructor somewhere else.
      That seems like a pretty unfair constraint. Yes, you can deliberately circumvent safeguards and you can deliberately write bad code. That doesn't mean those language features are bad.
nickdothutton 4 hours ago
It’s been about 30 years but I seem to remember the compiler taking care of this for me (in Ada) with types.
AndrewDucker 8 hours ago
This is one of the things that makes me glad that PowerShell does all of this intrinsically. I define the parameters, it makes sure that the arguments make sense and match them (and their validation).
m463 16 hours ago
This kind of stuff is what makes me appreciate python's argparse.
It's a genuine pleasure to use, and I use it often.
If you dig a little deeper into it, it does all the type and value validation, file validation, it does required and mutually exclusive args, it does subargs. And it lets you do special cases of just about anything.
And of course it does the "normal" stuff like short + long args, boolean args, args that are lists, default values, and help strings.
[-]
- MrJohz 15 hours ago
  Actually, I think argparse falls into the same trap that the author is talking about. You can define lots of invariants in the parser, and say that these two arguments can't be passed together, or that this argument, if specified, requires these arguments to also be specified, etc. But the end result is a namespace with a bunch of key-value pairs on it, and argparse doesn't play well with typing systems like mypy or pyright. So the rest of the tool has to assume that the invariants were correctly specified up-front.
  The result is that you often still this kind of defensive programming, where argparse ensures that an invariant holds, but other functions still check the same invariant later on because they might have been called a different way or just because the developer isn't sure whether everything was checked where they are in the program.
  What I think the author is looking for is a combination of argparse and Pydantic, such that when you define a parser using argparse, it automatically creates the relevant Pydantic classes that define the type of the parsed arguments.
  [-]
  - bvrmn 11 hours ago
    In general case generating CLI options from app models leads to horrible CLI UX. Opposite is also true. Working with "nice" CLI options as direct app models is horrendous.
    You need a boundary to convert nice opts into nice types. Like pydantic models could take argparse namespace and convert it to something manageable.
    [-]
    - MrJohz 9 hours ago
      I mean, that's much the same as working with web APIs or any other kind of interface. Your DTO will probably be different from your internal models. But that doesn't mean it can't contain invariants, or that you can't parse it into a meaningful type. A DTO that's just a grab-bag of optional values is a pain to work with.
      Although in practice, I find clap's approach works pretty well: define an object that represents the parsed arguments as you want them, with annotations for details that can't be represented in the type system, and then derive a parser from that. Because Rust has ADTs and other tools for building meaningful types, and because the derive process can do so much. That creates an arguments object that you can quite easily pass to a function which runs the command.
  - js2 13 hours ago
    > What I think the author is looking for is a combination of argparse and Pydantic
    Not quite that, but https://typer.tiangolo.com/ is fully type driven.
  - sgarland 15 hours ago
    Precisely my thought. I love argparse, but you can really back yourself into a corner if you aren’t careful.
  - hahn-kev 14 hours ago
    It's almost like you want compile time type safety
    [-]
    - MrJohz 13 hours ago
      You can have that with Mypy and friends in Python, and Typescript in the JS world. The problem is that older libraries often don't utilise that type safety very well because their API wasn't designed for it.
      The library in the original post is essentially a Javascript library, but it's one designed so that if you use it with Typescript, it provides that type safety.
lihaoyi 19 hours ago
That's basically what my MainArgs Scala library does: take either a method definition or class structure and use it's structure to parse your command line arguments. You get the final fields you want immediately without needing to imperatively walk to args array (and probably getting it wrong!)
https://github.com/com-lihaoyi/mainargs
dvdkon 22 hours ago
I, for one, do think the world needs more CLI argument parsers :)
This project looks neat, I've never thought to use parser combinators for something other than left-to-right string/token stream parsing.
And I like how it uses Typescript's metaprogramming to generate types from the parser code. I think that would be much harder (or impossible) in other languages, making the idiomatic design of a similar similar library very different.
dcre 17 hours ago
Some other libraries I’ve been enjoying building CLIs with in TS that do more or less the same thing, though perhaps with slightly worse composability than Optique:
https://cliffy.io/
https://github.com/tj/commander.js
globular-toast 9 hours ago
Not all of this validation belongs in the same layer. A lot of the problems people seem to have is due to people thinking it all has to be done in the I/O layer.
A CLI and an API should indeed occupy the same layer of a program architecture, namely they are entry points that live on the periphery. But really all you should be doing there is lifting the low byte stream you are getting from users to something higher level you can use to call your internals.
So "CLI validation" should be limited to just "I need an int here, one of these strings here, optionally" etc. Stuff like "is this port out of range" or "if you give me this I need this too" should be handled by your internals by e.g. throwing an exception. Your CLI can then display that as an error message in a nice way.
adamddev1 14 hours ago
Yay for parser combinators in the JS/TS wild!
[-]
- brabel 11 hours ago
  Exactly, the author's library is just a parser combinator [1] that specializes in providing constructs mirrorring CLI options.
  [1] https://en.wikipedia.org/wiki/Parser_combinator
slifin 11 hours ago
So use Clojure Spec or better yet Malli to parse your input data at the edges of your program
Makes sense, I think a lot of developers would want to complect this problem with their runtime type system of choice without considering the set of downsides for the users
panzi 9 hours ago
No mention of yargs?
thealistra 22 hours ago
Isn’t this like argparse from Python for typescript?
[-]
- whilenot-dev 21 hours ago
  What OP calls an "combinatorial parser" I'd call object schema validation and that's more similar to pydantic[0] than argparse in python land.
  [0]: https://docs.pydantic.dev/latest/
  [-]
  - nhumrich 19 hours ago
    So, typer than
    [-]
    - mrugge 13 hours ago
      Or click
yakshaving_jgt 22 hours ago
I've noticed that many programmers believe that parsing is some niche thing that the average programmer likely won't need to contend with, and that it's only applicable in a few specific low-level cases, in which you'll need to reach for a parser combinator library, etc.
But this is wrong. Programmers should be writing parsers all the time!
[-]
- WJW 21 hours ago
  Last week my primary task was writing a github action that needed to log in to Heroku and push the current code on main and development branches to the production and staging environments respectively. The week before that, I wrote some code to make sure the type the object was included in the filters passed to an API call.
  Don't get me wrong, I actually love writing parsers. It's just not required all that often in my day-to-day work. 99% of the time when I need to write a parser myself it's for and Advent of Code problem, usually I just import whatever JSON or YAML parser is provided for the platform and go from there.
  [-]
  - yakshaving_jgt 21 hours ago
    Do you not write validation? Or handle user input? Or handle server responses? Surely there’s some data processing somewhere.
- eska 21 hours ago
  I think most security issues are just due to people not parsing input at all/properly. Then security consultants give each one a new name as if it was something new. :-)
- dkubb 21 hours ago
  The three most common things I think about when coding are DAGs, State Machines and parsing. The latter two come up all the time in regexps which I probably write at least once a day, and I’m always thinking about state transitions and dependencies.
- nine_k 21 hours ago
  I'd say that engineers should use the highest-level tools that are adequate for the task.
  Sometimes it's going down to machine code, or rolling your own hash table, or writing your own recursive-descent parser from first principles. But most of the time you don't have to reach that low, and things like parsing are but a minor detail in the grand scheme. The engineer should not spend time on building them, but should be able to competently choose a ready-made part.
  I mean, creating your own bolts and nuts may be fun, but mot of the time, if you want to build something, you just pick a few from an appropriate box, and this is exactly right.
  [-]
  - yakshaving_jgt 17 hours ago
    I don’t understand. Every mainstream language has libraries for parsing into general types, but none of them will have libraries for parsing values specific to your application.
    TFA links to Alexis King’s Parse, Don’t Validate article, which explains this well. Did you not read it?
ThinkBeat 20 hours ago
And that is why there are plenty of parser generators so you dont have to write the parser yourself every time.
einpoklum 8 hours ago
Exactly the opposite of this. We should parse the command-line using _no_ strict types. Not even integers. Nothing beyond parsing its structure, e.g. which option names get which (string) values, and which flags are enabled. This can be done without knowing _anything_ about the application domain, and provide a generic options structure which is no longer a sequence of characters.
This approach IMNSHO is much cleaner than the intrication of cmdline parser libraries with application logic and application-domain-related types.
Then one can specify validation logic declaratively, and apply it generically.
This has the added benefit - for compiled rather than interpreted library - of not having to recompile the CLI parsing library for each different app and each different definition of options.
[-]
- MrJohz 7 hours ago
  Can you give some examples of this working well? It certainly goes against all of my experience working with CLIs and with parsing inputs in general (e.g. web APIs etc). In general, I've found that the quicker I can convert strings into rich types, the easier that code is to work with and the less likely I am to have troubles with invalid data.
sudahtigabulan 18 hours ago
Is there no getopt implementation for Typescript? The input this library tries to handle better looks to me like bad design.
"options that depend on options" should not be a thing. Every option should be optional. Even if you have working code that can handle some complex situation, this doesn't make the situation any less unintuitive for the users.
If you need more complex relationships, consider using arguments as well. Top level, or under an option. Yes, they are not named, but since they are mandatory anyway, you are likely to remember their meaning (spaced repetition and all that). They can still be optional (if they come last). Sometimes an argument may need to have multiple parts, like user@host:port You can still parse it instead of validating, if you want.
> mutually exclusive --json, --xml, --yaml.
Use something like -t TYPE instead, where TYPE can be one of json, xml, or yaml. (Make illegal states unrepresentable.)
> debug: optional(option("--debug")),
Again, I believe it's called "option" because it's meant to be optional already.
```
  optional(optional(option("--common-sense")))
```
EOR
[-]
- dwattttt 16 hours ago
  > options that depend on options
  What would you do for "top level option, which can be modified in two other ways"?
```
  (--option | --option-with-flag1 | --option-with-flag2 | --option-with-flag1-and-flag2)
```
  would solve invalid representation, but is unwieldy.
  Something that results in the usage string
```
  [--option [--flag1 --flag2]]
```
  doesn't seem so bad at that point.
  [-]
  - sudahtigabulan 15 hours ago
    I think I've seen it done like that
```
  --option flag1,flag2
```
    (Maybe with another separator, as long as it doesn't need to be escaped.)
    Another possibility is to make the main option an argument, like the subcommands in git, systemctl, and others:
```
  command option --flag1 --flag2
```
    This depends on the specifics, though.
    [-]
    - dwattttt 12 hours ago
      > --option flag1,flag2
      Embedding a second parse step that the first parser doesn't deal with is done, but it's a rough compromise.
      It feels like the difficulty in dealing with
      [--option [--flag1 --flag2]]
      Is more to do with its expression in the language parsed to, than CLI elegance.
- Spivak 16 hours ago
  I think ultimately you're trying to tell a river that it's going the wrong way. Programs have had required options for decades at this point. I think they can make sense as alternatives to heterogeneously typed positional arguments. By making the user name them explicitly you remove ambiguity and let the user specify them in whatever order they please.
  In Python this was a motivating factor for letting functions demand their arguments be passed as named keywords. Something like send("foo", "bar") is easier to understand and call correctly when you have to say send(channel="foo", message="bar")
bvrmn 12 hours ago
A valid type for server and port should be a single value. Stop parse it separately please.
":3000" -> use port 3000 with a default host.
"some-host" -> use host with a default port.
"some-host:3000" -> you guess it.
It also allows to extend it to other sources/destinations like unix domain sockets and other stuff without cluttering your CLI options.
Also please consider to use DSN or URI to define database configurations. Host, port, dbname, credentials as separate options or environment variables are quite painful to use.
parhamn 21 hours ago
> Try to access it and TypeScript yells at you. No runtime validation needed.
I was recently thinking about type safety and validation strategies are particularly thorny in languages where the typings are just annotations. E.g. the Typescript/Zod or Python/Pydantic universes. Especially in IO cases where the data doesn't originate in the same type system.
In a language like Go (just an example, not endorsing) if you parse something into say a struct you know worst case you're getting that struct with all the fields set to zero, and you just have to handle the zero values. In typescript-likes you can get a totally different structure and run into all sorts of errors.
All that is to say, the runtime validation is always somewhere (perhaps in the library, as they often are?), and the feature here isn't no runtime validation but typed cli arguments. Which is cool and great.
[-]
- metaltyphoon 20 hours ago
  > worst case you're getting that struct with all the fields set to zero, and you just have to handle the zero values
  In the field I work, zero values are valid and doing it in Go would be a nightmare
  [-]
  - mjevans 5 hours ago
    Database NULL is a valid pattern that any parser SHOULD support and I do consider that a design bug in every parser Go has. Offhand most of them effectively 'update' an object, but make it difficult or impossible to tell if something was __set__ with a value, or merely inherited a default.
  - parhamn 18 hours ago
    Agreed, the pointer or "<field>_empty: bool" patterns are annoying. Point still stands though, you always get the structure you ask for.
jiggawatts 14 hours ago
This is one of the many reasons I like PowerShell: it parses strongly typed parameters for you and outputs human readable error messages for every kind of validation failure.
throwaway984393 12 hours ago
[dead]
suff 16 hours ago
[dead]
curtisszmania 17 hours ago
[dead]
HL33tibCe7 22 hours ago
Stopped reading after realising this is written by ChatGPT
[-]
- bfung 22 hours ago
  Looked human-ish to me, what signs did you see?
  [-]
  - bobbiechen 14 hours ago
    I thought the style was like ChatGPT in a "clever, casual, snarky" prompt flavor as well. I see it a lot on LinkedIn especially in sentence structures like these:
    "Invalid data? The parser rejects it. Done."
    "That validation logic that used to be 30% of my CLI code? Gone."
    "Mutually exclusive groups? Sure. Context-dependent options? Why not."
    For me this really piled on at the end of the blog post. But maybe it's just personal style too.
- akoboldfrying 18 hours ago
  I found the content novel and helpful (applying a known but underappreciated technique (Parse, Don't Validate) to a common problem where I hadn't thought to use it before) and the tone very enjoyable. In fact, it's so idiomatically written that I can't even believe it's just a machine translation of something written in another language.
  In short, a great article.
- cazum 21 hours ago
  What makes you think that and not that it's just an average auto-translate job from the author's native language (Korean)?
  [-]
  - urxvtcd 21 hours ago
    I’ll go one step further: what makes you think it’s an average auto-translate job? I didn’t notice anything weird, felt like your average, slightly ranty HN post. I’m not a native speaker though.
AfterHIA 19 hours ago
You've got to be careful; if you validate the CLI too much you might get URA in your validator. #chugalug #house