I _think_ it's because there a bdist in PyPI for one, and not the other, so `pip` is using different "backends" that normalize the names into `METADATA` differently... ugh.
> I _think_ it's because there a bdist in PyPI for one, and not the other, so `pip` is using different "backends" that normalize the names into `METADATA` differently... ugh.
That isn't why: it's because `cast-from-env`'s sdist is from March 2023, while PEP 625 (which strongly stipulates package name normalization) was adopted in setuptools a year later[1].
But to take a step back: why does the difference in `pip freeze` affect you? It shouldn't matter to `pip`, since PyPI will happily serve from both the normalized and unnormalized names.
This is a great writeup on a perennially misunderstood topic in Python packaging (and namespacing/module semantics)! A lot of (bad) security tools begin with the assumption that a top-level module name can always be reliably mapped back to its PyPI package name, and this post's data concretely dispels that assumption.
It's a shame that there isn't (currently) a reliable way to perform this backwards link: the closest current things are `{dist}.dist-info/METADATA` (unreliable, entirely user controlled) and `direct_url.json` for URL-installed packages, which isn't present for packages resolved from indices.
Edit: PEP 710[1] would accomplish the above, but it's still in draft.
It took me what seemed like ages to figure out how to auth into Google cloud because the name of the module in their example code isn’t the name of the package. You shouldn’t have to be a detective to figure out what to pip install from looking at an import.
I don't necessarily disagree, although note that this is true for just about every packaging ecosystem: Rust, Ruby, etc. are similar in making no guarantee that the index name is even remotely related to the importable/module name.
Python gets the "worst" of it in the sense that it's big and has a large diversity of packages, but it's a general consequence of having a packaging ecosystem that's distinct from a given language's import machinery.
This is one thing I really, really like about JavaScript - you explicitly import everything from packages using the same name you install them with.
When viewing source code without a code editor, many modern languages have no way to know what comes from where. I don't understand why this seems to be the standard for new languages like Rust.
> This is a great writeup on a perennially misunderstood topic in Python packaging (and namespacing/module semantics)! A lot of (bad) security tools begin with the assumption that a top-level module name can always be reliably mapped back to its PyPI package name, and this post's data concretely dispels that assumption.
The whole model of naming of apt install <thing> vs port install <thing> is a wargame all of it's own.
Your general point is well made: how you get a distribution, and unpack and install it is quite distinct from how it names inside the language/system namespace it installs into.
Even at the level of ssh vs sshd, there can be confusion. the daemon is configured from sshd_ files, but they live inside /etc/ssh alongside /etc/ssh/ssh_ files configuring the client side.
yaml -> pip install pyyaml
cv2 -> pip install opencv-contrib-python
PIL -> pip install pillow (wtf, this should be a misdemeanor punishable by being forced to used windows for a year)
And can we please ban "py" and "python" from appearing inside the name of python packages?
Or else I'm going to start writing some python packages with ".js" in their name.
Banning "py" would catch "mypy" and "pydantic", both of which you probably don't intend to catch.
pillow is imported as `PIL` because it's a fork of the original PIL[1]. There's a very strong argument that Python's ability to retain the same import name across package name changes like that is a valuable source of flexibility that has benefited the ecosystem as a whole.
As in, not changing your imports at all, and just changing your dependency from PIL to pillow. This has two substantial advantages:
1. You only have to change one line (the dependency), not an indefinite number of source files. This is less of an issue now that the Python community has high-quality refactoring tools, but it's still the past of least resistance.
2. More importantly: `import pillow as PIL` is not referentially transparent: the `PIL` binding that it introduces is a `module` object, but that object can't be used in subsequent imports. In other words, blindly performing an `import X as Y` refactor would break code like this:
import PIL
from PIL import whatever
You can observe this for yourself locally:
>>> import ssl as lol
>>> from lol import CERT_NONE
ModuleNotFoundError: No module named 'lol'
>>> from ssl import CERT_NONE
This is arguably a defect in Python's import and module machinery, but that's how it currently is. Renaming the dependency and keeping the module name is far less fraught.
does not make PIL.Image available. What the hell else do you expect me to do with PIL? Why isn't PIL.Image included in importing PIL? You have to explicitly do either of these
That’s because it’s a module within the PIL module, not an attribute of PIL. But that doesn’t really have anything to do with the original comment; that’s a different quirk of Python’s import machinery.
(Understanding the difference between packages, module hierarchies, and module attributes is table stakes for architecting a large Python package correctly. PIL almost certainly does this to prevent hard-to-debug circular imports elsewhere in their codebase.)
It's a strange distinction, because the standard library sometimes eschews this. `os.path` is accessible through just `import os`, because they made os.py import it into the local namespace.
I wish it was clearer sometimes what was a module, and what was an attribute in the core import syntax. `import foo; foo.bar` only breaks if it's a module, and `import foo.bar` only breaks if it's an attribute. If you do `from foo import bar`, the syntax works with both.
Just because `os.path` is accessible through just `import os`, doesn't mean that you shouldn't import it explicitly. As the Zen of Python says, explicit is better than implicit. After all it's documented separately at https://docs.python.org/3/library/os.path.html
If you see `os.path.basename` what could `os.path` be? It would be a module most of the time because it's written with lowercase. `itertools.chain.from_iterable` [1] would be a notable exception.
I have to look up PIL every time I use it to remember if I install PIL and import pillow or install pillow and import PIL.
Imports can be aliased, so why allow this mismatch at all? PyPI should have enforced that each package contains one top-level module whose name is identical to the name used to install it.
Imports can be aliased as bindings; they can't be aliased at the import machinery layer, which makes the PIL/pillow distinction necessary. The adjacent subthread has an example of this.
Starting any sentence in 2024 with “PyPI should have…” is a pretty ridiculous premise. We learn things over time, and PyPI itself wasn’t exactly operating on a green field.
There used to be a PIL, someone made a new compatible distribution. They had to use the same import name to be compatible with existing code, they had to pick another name on PyPI that wasn't taken. It's kind of an extreme case.
Unless something is a binding, baking a package after the programming language is super weird. Like what if you change the implementation language later?
> what if you change the implementation language later?
I don't think that is a thing that happens in real life.
* Practically, one package is associated with exactly one github repository, sometimes a few. You would see implementation switching from JavaScript to TypeScript, but almost never from python to Go. Normally people start a brand new project for that kind of thing.
* The reality is that each language has its own library ecosystem, and people reinvent the wheel at least once for each language. I wish we live in a world where you could save the effort, instead implement everything only once and it runs efficiently and has idiomatic APIs everywhere. But that's not how it works. If you create a package for a language, that's it. You could reimplement the same thing like by line in another language, but that would be a different package for that language.
Yeah but what is common in real life is writing multiple parallel libraries for {Python, NodeJS, ...} with a nearly identical API. In this case I would think that if the Python command is `pip install foo`, the NodeJS command should be `npm install foo`. It's redundant to do `pip install foo-python` when pip is only for Python, and opens the door for stealthy attacks where someone else creates `pip install foo` on PyPI that is forked from your repo and mirrors your API exactly but steals data and credentials and sends it to malicious servers.
Feels to me like that was a deficiency in the package management tools. Like if your requirements file could define a global alias, it would allow people who want that easy one-line change to install pillow as PIL. But everyone else who was starting fresh or who was okay with doing a few edits to their Python files could install pillow and use it as pillow.
I guess though that there could be an issue with some dependencies being written against PIL and others being written against pillow?
That does not seem useful? Unless there is a bug in where the files end up, ie they are not namespaced by the package?
Shipping tests is great, it allows downstream to verify the package works. Linux distributions now a days often runs test suites during packaging.
The top-level directories in a wheel are packages, so this means they all clobber the top-level tests package name. If the wheel contains a "test" package, it even clobbers the "test" package from the standard library (which contains tests for Python itself, the built-in testing package is "unittest").
I think that's just a misconfiguration due to the relatively common layout of
Depending on how you configure stuff you might accidentally include the tests directory as a separate top-level package next to all packages under "src". If you stick to the legacy ways, this does not happen if you just used the usual
Downstream consumers who might want to ship this as part of something larger should ideally be able to just delete mypackage/tests without anything breaking.
Ah, right you are. Yeah, then packages really should not ship such directories.
The practice of having tests inside the package being tested I remember as being discouraged, because it makes it hard to run one version of tests against another of the package. Which I guess can be useful for regression testing, though I have not really used it.
An alternative layout that would preserve that be a mypackage_tests top-level.
That's another good option, though I guess yodafying that (tests_mypackage) would have the added benefit that downstream consumers don't get mypackage_tests as an autocomplete suggestion.
That doesn't fix the problem, that just makes it so every package now has a random prefix. Instead of having to know that "yaml" is provided by "pyyaml", you will have to know it's "ingy/yaml".
Sure, but combined with other methods, you get something much better.
Maybe I invent a protocol today called "hitta" and make a new package called
"hitta"
I'm pretty much automatically going to be the de facto standard, even if better, more updated implementations exist. Names matter.
But if my implementation is called
hittaorg/hitta
Organizations and users (publishers) can be verified, and the tools integrate correctly; you gain better package context, increase trust, and reduce supply chain risks.
Now, if user123 has a better version, they might make
On the one hand, you could say it's a security issue, an installed Python package can make any module names importable, which would have surprising effects if say, it overwrote stuff like aiohttp or your postgres client or whatever.
On the other hand, you know, it's already source code, it can do whatever it wants...
https://pypi.org/project/xml-from-seq/ → xml_from_seq
https://pypi.org/project/cast-from-env/ → cast_from_env
Simple normalization, right? But `pip` installs one with underscores and one with dashes:
so that's what ends up in `pip freeze`.I _think_ it's because there a bdist in PyPI for one, and not the other, so `pip` is using different "backends" that normalize the names into `METADATA` differently... ugh.
That isn't why: it's because `cast-from-env`'s sdist is from March 2023, while PEP 625 (which strongly stipulates package name normalization) was adopted in setuptools a year later[1].
But to take a step back: why does the difference in `pip freeze` affect you? It shouldn't matter to `pip`, since PyPI will happily serve from both the normalized and unnormalized names.
[1]: https://github.com/pypa/setuptools/issues/3593
It's a shame that there isn't (currently) a reliable way to perform this backwards link: the closest current things are `{dist}.dist-info/METADATA` (unreliable, entirely user controlled) and `direct_url.json` for URL-installed packages, which isn't present for packages resolved from indices.
Edit: PEP 710[1] would accomplish the above, but it's still in draft.
[1]: https://peps.python.org/pep-0710/
Python gets the "worst" of it in the sense that it's big and has a large diversity of packages, but it's a general consequence of having a packaging ecosystem that's distinct from a given language's import machinery.
When viewing source code without a code editor, many modern languages have no way to know what comes from where. I don't understand why this seems to be the standard for new languages like Rust.
The whole model of naming of apt install <thing> vs port install <thing> is a wargame all of it's own.
Your general point is well made: how you get a distribution, and unpack and install it is quite distinct from how it names inside the language/system namespace it installs into.
Even at the level of ssh vs sshd, there can be confusion. the daemon is configured from sshd_ files, but they live inside /etc/ssh alongside /etc/ssh/ssh_ files configuring the client side.
Or else I'm going to start writing some python packages with ".js" in their name.
pillow is imported as `PIL` because it's a fork of the original PIL[1]. There's a very strong argument that Python's ability to retain the same import name across package name changes like that is a valuable source of flexibility that has benefited the ecosystem as a whole.
[1]: https://pypi.org/project/PIL/
As in, `import pillow as PIL`?
As in, not changing your imports at all, and just changing your dependency from PIL to pillow. This has two substantial advantages:
1. You only have to change one line (the dependency), not an indefinite number of source files. This is less of an issue now that the Python community has high-quality refactoring tools, but it's still the past of least resistance.
2. More importantly: `import pillow as PIL` is not referentially transparent: the `PIL` binding that it introduces is a `module` object, but that object can't be used in subsequent imports. In other words, blindly performing an `import X as Y` refactor would break code like this:
You can observe this for yourself locally: This is arguably a defect in Python's import and module machinery, but that's how it currently is. Renaming the dependency and keeping the module name is far less fraught.(Understanding the difference between packages, module hierarchies, and module attributes is table stakes for architecting a large Python package correctly. PIL almost certainly does this to prevent hard-to-debug circular imports elsewhere in their codebase.)
I wish it was clearer sometimes what was a module, and what was an attribute in the core import syntax. `import foo; foo.bar` only breaks if it's a module, and `import foo.bar` only breaks if it's an attribute. If you do `from foo import bar`, the syntax works with both.
If you see `os.path.basename` what could `os.path` be? It would be a module most of the time because it's written with lowercase. `itertools.chain.from_iterable` [1] would be a notable exception.
[1]: https://docs.python.org/3/library/itertools.html#itertools.c...
Imports can be aliased, so why allow this mismatch at all? PyPI should have enforced that each package contains one top-level module whose name is identical to the name used to install it.
I don't think that is a thing that happens in real life.
* Practically, one package is associated with exactly one github repository, sometimes a few. You would see implementation switching from JavaScript to TypeScript, but almost never from python to Go. Normally people start a brand new project for that kind of thing. * The reality is that each language has its own library ecosystem, and people reinvent the wheel at least once for each language. I wish we live in a world where you could save the effort, instead implement everything only once and it runs efficiently and has idiomatic APIs everywhere. But that's not how it works. If you create a package for a language, that's it. You could reimplement the same thing like by line in another language, but that would be a different package for that language.
That's the neat part, it's not! You can distribute basically any kind of data with pip, within reason. Iirc Cmake can be pip-installed.
I guess though that there could be an issue with some dependencies being written against PIL and others being written against pillow?
Now there's a somewhat useful "make a pull request to an open source project" exercise.
I think that's just a misconfiguration due to the relatively common layout of
Depending on how you configure stuff you might accidentally include the tests directory as a separate top-level package next to all packages under "src". If you stick to the legacy ways, this does not happen if you just used the usual I think this is the default behavior of setuptools nowadays if you do not say anything at all in any of the config files about where your code is.If you actually intend to ship the tests, because they don't require a specialized environment to run, then the project layout should really be
Downstream consumers who might want to ship this as part of something larger should ideally be able to just delete mypackage/tests without anything breaking.The practice of having tests inside the package being tested I remember as being discouraged, because it makes it hard to run one version of tests against another of the package. Which I guess can be useful for regression testing, though I have not really used it. An alternative layout that would preserve that be a mypackage_tests top-level.
user/package-name group/package-name
etc...
Maybe I invent a protocol today called "hitta" and make a new package called
"hitta"
I'm pretty much automatically going to be the de facto standard, even if better, more updated implementations exist. Names matter.
But if my implementation is called
hittaorg/hitta
Organizations and users (publishers) can be verified, and the tools integrate correctly; you gain better package context, increase trust, and reduce supply chain risks.
Now, if user123 has a better version, they might make
user123/hitta
Instead of
pyhitta-with-new-features
or whatever garbage is used today
And I don't understand what's preventing users and organization from being verified now?
On the other hand, you know, it's already source code, it can do whatever it wants...