An analysis of module names inside top PyPI packages

(joshcannon.me)

41 points | by thejcannon 108 days ago

7 comments

nicwolff 103 days ago
I've got a fun issue right now – two packages with dashes in the package names but underscores in the module names:
https://pypi.org/project/xml-from-seq/ → xml_from_seq
https://pypi.org/project/cast-from-env/ → cast_from_env
Simple normalization, right? But `pip` installs one with underscores and one with dashes:
```
    >>> from importlib.metadata import metadata
    >>> metadata('xml_from_seq')['Name']
    'xml_from_seq'
    >>> metadata('cast_from_env')['Name']
    'cast-from-env' 
```
so that's what ends up in `pip freeze`.
I _think_ it's because there a bdist in PyPI for one, and not the other, so `pip` is using different "backends" that normalize the names into `METADATA` differently... ugh.
[-]
- woodruffw 103 days ago
  > I _think_ it's because there a bdist in PyPI for one, and not the other, so `pip` is using different "backends" that normalize the names into `METADATA` differently... ugh.
  That isn't why: it's because `cast-from-env`'s sdist is from March 2023, while PEP 625 (which strongly stipulates package name normalization) was adopted in setuptools a year later[1].
  But to take a step back: why does the difference in `pip freeze` affect you? It shouldn't matter to `pip`, since PyPI will happily serve from both the normalized and unnormalized names.
  [1]: https://github.com/pypa/setuptools/issues/3593
woodruffw 103 days ago
This is a great writeup on a perennially misunderstood topic in Python packaging (and namespacing/module semantics)! A lot of (bad) security tools begin with the assumption that a top-level module name can always be reliably mapped back to its PyPI package name, and this post's data concretely dispels that assumption.
It's a shame that there isn't (currently) a reliable way to perform this backwards link: the closest current things are `{dist}.dist-info/METADATA` (unreliable, entirely user controlled) and `direct_url.json` for URL-installed packages, which isn't present for packages resolved from indices.
Edit: PEP 710[1] would accomplish the above, but it's still in draft.
[1]: https://peps.python.org/pep-0710/
[-]
- staticautomatic 103 days ago
  It took me what seemed like ages to figure out how to auth into Google cloud because the name of the module in their example code isn’t the name of the package. You shouldn’t have to be a detective to figure out what to pip install from looking at an import.
  [-]
  - woodruffw 103 days ago
    I don't necessarily disagree, although note that this is true for just about every packaging ecosystem: Rust, Ruby, etc. are similar in making no guarantee that the index name is even remotely related to the importable/module name.
    Python gets the "worst" of it in the sense that it's big and has a large diversity of packages, but it's a general consequence of having a packaging ecosystem that's distinct from a given language's import machinery.
    [-]
    - Timon3 103 days ago
      This is one thing I really, really like about JavaScript - you explicitly import everything from packages using the same name you install them with.
      When viewing source code without a code editor, many modern languages have no way to know what comes from where. I don't understand why this seems to be the standard for new languages like Rust.
- ggm 103 days ago
  > This is a great writeup on a perennially misunderstood topic in Python packaging (and namespacing/module semantics)! A lot of (bad) security tools begin with the assumption that a top-level module name can always be reliably mapped back to its PyPI package name, and this post's data concretely dispels that assumption.
  The whole model of naming of apt install <thing> vs port install <thing> is a wargame all of it's own.
  Your general point is well made: how you get a distribution, and unpack and install it is quite distinct from how it names inside the language/system namespace it installs into.
  Even at the level of ssh vs sshd, there can be confusion. the daemon is configured from sshd_ files, but they live inside /etc/ssh alongside /etc/ssh/ssh_ files configuring the client side.
dheera 103 days ago
I hate this shit.
```
    yaml -> pip install pyyaml
    cv2 -> pip install opencv-contrib-python
    PIL -> pip install pillow (wtf, this should be a misdemeanor punishable by being forced to used windows for a year)
```
And can we please ban "py" and "python" from appearing inside the name of python packages?
Or else I'm going to start writing some python packages with ".js" in their name.
[-]
- woodruffw 103 days ago
  Banning "py" would catch "mypy" and "pydantic", both of which you probably don't intend to catch.
  pillow is imported as `PIL` because it's a fork of the original PIL[1]. There's a very strong argument that Python's ability to retain the same import name across package name changes like that is a valuable source of flexibility that has benefited the ecosystem as a whole.
  [1]: https://pypi.org/project/PIL/
  [-]
  - throw-the-towel 103 days ago
    > Python's ability to retain the same import name across package name changes...
    As in, `import pillow as PIL`?
    [-]
    - woodruffw 103 days ago
      > As in, `import pillow as PIL`?
      As in, not changing your imports at all, and just changing your dependency from PIL to pillow. This has two substantial advantages:
      1. You only have to change one line (the dependency), not an indefinite number of source files. This is less of an issue now that the Python community has high-quality refactoring tools, but it's still the past of least resistance.
      2. More importantly: `import pillow as PIL` is not referentially transparent: the `PIL` binding that it introduces is a `module` object, but that object can't be used in subsequent imports. In other words, blindly performing an `import X as Y` refactor would break code like this:
      import PIL from PIL import whatever
      You can observe this for yourself locally:
      >>> import ssl as lol >>> from lol import CERT_NONE ModuleNotFoundError: No module named 'lol' >>> from ssl import CERT_NONE
      This is arguably a defect in Python's import and module machinery, but that's how it currently is. Renaming the dependency and keeping the module name is far less fraught.
      [-]
      - dheera 103 days ago
        The related thing that bothers me deeply is that
        import PIL
        does not make PIL.Image available. What the hell else do you expect me to do with PIL? Why isn't PIL.Image included in importing PIL? You have to explicitly do either of these
        import PIL.Image from PIL import Image
        [-]
        woodruffw 103 days ago
        That’s because it’s a module within the PIL module, not an attribute of PIL. But that doesn’t really have anything to do with the original comment; that’s a different quirk of Python’s import machinery.
        (Understanding the difference between packages, module hierarchies, and module attributes is table stakes for architecting a large Python package correctly. PIL almost certainly does this to prevent hard-to-debug circular imports elsewhere in their codebase.)
        [-]
        Jasper_ 103 days ago
        It's a strange distinction, because the standard library sometimes eschews this. `os.path` is accessible through just `import os`, because they made os.py import it into the local namespace.
        I wish it was clearer sometimes what was a module, and what was an attribute in the core import syntax. `import foo; foo.bar` only breaks if it's a module, and `import foo.bar` only breaks if it's an attribute. If you do `from foo import bar`, the syntax works with both.
        [-]
        ciupicri 103 days ago
        Just because `os.path` is accessible through just `import os`, doesn't mean that you shouldn't import it explicitly. As the Zen of Python says, explicit is better than implicit. After all it's documented separately at https://docs.python.org/3/library/os.path.html
        If you see `os.path.basename` what could `os.path` be? It would be a module most of the time because it's written with lowercase. `itertools.chain.from_iterable` [1] would be a notable exception.
        [1]: https://docs.python.org/3/library/itertools.html#itertools.c...
- ziml77 103 days ago
  I have to look up PIL every time I use it to remember if I install PIL and import pillow or install pillow and import PIL.
  Imports can be aliased, so why allow this mismatch at all? PyPI should have enforced that each package contains one top-level module whose name is identical to the name used to install it.
  [-]
  - woodruffw 103 days ago
    Imports can be aliased as bindings; they can't be aliased at the import machinery layer, which makes the PIL/pillow distinction necessary. The adjacent subthread has an example of this.
  - cqqxo4zV46cp 103 days ago
    Starting any sentence in 2024 with “PyPI should have…” is a pretty ridiculous premise. We learn things over time, and PyPI itself wasn’t exactly operating on a green field.
  - remram 103 days ago
    There used to be a PIL, someone made a new compatible distribution. They had to use the same import name to be compatible with existing code, they had to pick another name on PyPI that wasn't taken. It's kind of an extreme case.
- cozzyd 103 days ago
  Unless something is a binding, baking a package after the programming language is super weird. Like what if you change the implementation language later?
  [-]
  - rty32 103 days ago
    > what if you change the implementation language later?
    I don't think that is a thing that happens in real life.
    * Practically, one package is associated with exactly one github repository, sometimes a few. You would see implementation switching from JavaScript to TypeScript, but almost never from python to Go. Normally people start a brand new project for that kind of thing. * The reality is that each language has its own library ecosystem, and people reinvent the wheel at least once for each language. I wish we live in a world where you could save the effort, instead implement everything only once and it runs efficiently and has idiomatic APIs everywhere. But that's not how it works. If you create a package for a language, that's it. You could reimplement the same thing like by line in another language, but that would be a different package for that language.
    [-]
    - cozzyd 103 days ago
      It's pretty common for e.g. old scientific software to get rewritten from Fortran to C++ with a version bump.
    - dheera 103 days ago
      Yeah but what is common in real life is writing multiple parallel libraries for {Python, NodeJS, ...} with a nearly identical API. In this case I would think that if the Python command is `pip install foo`, the NodeJS command should be `npm install foo`. It's redundant to do `pip install foo-python` when pip is only for Python, and opens the door for stealthy attacks where someone else creates `pip install foo` on PyPI that is forked from your repo and mirrors your API exactly but steals data and credentials and sends it to malicious servers.
      [-]
      - kortex 103 days ago
        > when pip is only for Python
        That's the neat part, it's not! You can distribute basically any kind of data with pip, within reason. Iirc Cmake can be pip-installed.
        [-]
        rented_mule 103 days ago
        `pip install nodejs-bin` gets you node, including npm, in your venv along with bindings for calling it all from Python.
- nilamo 103 days ago
  Pillow is a special case, in that it was always meant as a drop in replacement for the PIL, and you only changed the requirements.txt
  [-]
  - ziml77 103 days ago
    Feels to me like that was a deficiency in the package management tools. Like if your requirements file could define a global alias, it would allow people who want that easy one-line change to install pillow as PIL. But everyone else who was starting fresh or who was okay with doing a few edits to their Python files could install pillow and use it as pillow.
    I guess though that there could be an issue with some dependencies being written against PIL and others being written against pillow?
- RockRobotRock 103 days ago
  It's funny and sad how you remember the stupid aliases after a while.
formerly_proven 103 days ago
> There are 210 packages which include a top-level test or tests directory
Now there's a somewhat useful "make a pull request to an open source project" exercise.
[-]
- jononor 103 days ago
  That does not seem useful? Unless there is a bug in where the files end up, ie they are not namespaced by the package? Shipping tests is great, it allows downstream to verify the package works. Linux distributions now a days often runs test suites during packaging.
  [-]
  - formerly_proven 103 days ago
    The top-level directories in a wheel are packages, so this means they all clobber the top-level tests package name. If the wheel contains a "test" package, it even clobbers the "test" package from the standard library (which contains tests for Python itself, the built-in testing package is "unittest").
    I think that's just a misconfiguration due to the relatively common layout of
```
  - .git
  - pyproject.toml / setup.py / setup.cfg etc.
  - src/mypackage
  - tests/test_module1.py
  - tests/test_module2.py
```
    Depending on how you configure stuff you might accidentally include the tests directory as a separate top-level package next to all packages under "src". If you stick to the legacy ways, this does not happen if you just used the usual
```
    setup(
        ...,
        packages=find_packages('src'),
        package_dir={'': 'src'},
    )
```
    I think this is the default behavior of setuptools nowadays if you do not say anything at all in any of the config files about where your code is.
    If you actually intend to ship the tests, because they don't require a specialized environment to run, then the project layout should really be
```
  - .git
  - pyproject.toml / setup.py / setup.cfg etc.
  - src/mypackage
  - src/mypackage/tests/test_module1.py
```
    Downstream consumers who might want to ship this as part of something larger should ideally be able to just delete mypackage/tests without anything breaking.
    [-]
    - jononor 103 days ago
      Ah, right you are. Yeah, then packages really should not ship such directories.
      The practice of having tests inside the package being tested I remember as being discouraged, because it makes it hard to run one version of tests against another of the package. Which I guess can be useful for regression testing, though I have not really used it. An alternative layout that would preserve that be a mypackage_tests top-level.
      [-]
      - formerly_proven 103 days ago
        That's another good option, though I guess yodafying that (tests_mypackage) would have the added benefit that downstream consumers don't get mypackage_tests as an autocomplete suggestion.
bangaladore 103 days ago
Every single language with centralized dependency managers should, without a doubt require namespacing for package names.
user/package-name group/package-name
etc...
[-]
- remram 103 days ago
  That doesn't fix the problem, that just makes it so every package now has a random prefix. Instead of having to know that "yaml" is provided by "pyyaml", you will have to know it's "ingy/yaml".
  [-]
  - bangaladore 103 days ago
    Sure, but combined with other methods, you get something much better.
    Maybe I invent a protocol today called "hitta" and make a new package called
    "hitta"
    I'm pretty much automatically going to be the de facto standard, even if better, more updated implementations exist. Names matter.
    But if my implementation is called
    hittaorg/hitta
    Organizations and users (publishers) can be verified, and the tools integrate correctly; you gain better package context, increase trust, and reduce supply chain risks.
    Now, if user123 has a better version, they might make
    user123/hitta
    Instead of
    pyhitta-with-new-features
    or whatever garbage is used today
    [-]
    - remram 103 days ago
      You mean to encourage other user to make other packages with the same import name? Big no from me. This is taking us backwards!
      And I don't understand what's preventing users and organization from being verified now?
doctorpangloss 103 days ago
On the one hand, you could say it's a security issue, an installed Python package can make any module names importable, which would have surprising effects if say, it overwrote stuff like aiohttp or your postgres client or whatever.
On the other hand, you know, it's already source code, it can do whatever it wants...
wodenokoto 103 days ago
Shame there weren’t examples of the most different package and import names.