We know how to calculate the necessary sample size because the Central Limit Theorem tells us that the sampling distribution of the mean (or proportion) becomes approximately normal as the sample size grows. This justifies using Z-scores from the normal distribution to derive confidence intervals - and from there, solve for the sample size needed to achieve a desired margin of error and confidence level.
Article used a Python library without really understanding the reason or science behind the result. Knowing this can help when you read an article or watch a news report where they quote a study that says “a study of 300 people…” well why 300 people? You can reasonably assume that the researchers used the CLT.
The real world constraints of time and money are non trivially involved in sample size decisions. CLT may be invoked but I do not give the benefit of the doubt to studies being announced at this point
You are right that time, cost, and feasibility often drive sample size decisions more than statistical ideals. My point was just that when researchers cite a specific number (like 300), there’s often a statistical basis tied to confidence levels and margin of error. Skepticism is healthy since not all studies follow best practices
There is one key element missing from this explanation: the statistical model that you would like to use when you’re estimating. Without this you cannot do sample size estimation.
If you go to the extreme, say you sample 100M people of 100M and 100M of 8B then the first number is exact and the second has a very small error. So the error is a function of sample size and population.
In not extreme cases when your population is much bigger than the sample size you're correct that it doesn't really make any difference.
The whole point of sample sizing is to try and make the sample unbiased. If you're an alien, you don't know anything about your sample. As an example, say the alien wants to know the ratio of female to male lions. They sample one hunting pack and conclude that all lions are female.
you could for example be sampling in a country or geography that favors male over female offspring for cultural and social reasons. then you have to refine your research question to further clarify what you are really trying to estimate.
When I was an undergrad first learning statistics I asked my stats instructor (a grad student) about this issue and they responded with something like "the population size doesn't matter because for the assumptions of the test to be met... such and such..." I kind of accepted that answer — we were talking about asymptotic inferences — but it never seemed quite right to me.
The example I gave was actually motivated in part by a sort of real-world problem I was dealing with: let's say you only want to make inferences about a population of 20 individuals. Certainly if you have a sample of 19, the confidence about the population will be much stronger than if your population is 100 million.
One thing he did say which is probably right, is that that 1/20 you didn't sample might throw things off, so it's more influential in a sense than a single member of a population of 100 million.
At the time I hadn't learned about exact and Jaynesian-permutation statistics, but that's probably the right way to think about finite populations. That is, something like "what are all the outcomes you could observe, and what proportion of those does my observed result represent?"
It's just that usually our population is so large that the exact test approach becomes infeasible to deal with without approximations, and you end up with the typical classical asymptotic statistics.
It's all maybe a moot point but it's always a good idea to think about the population you're trying to make inferences about. I think that probably includes the population size, and I think population size is probably bigger than you might initially think sometimes.
As for your last question, obtaining an unbiased sample is kind of harder as the number of attributes you're being unbiased with regard to increases. It's a permutation problem again, probably implicit usually with regard to sampling representativeness.
How does this relate to things like nation-wide elections? For a 100 million population, a 99% confidence at 0.1 interval needs a sample size of only a million. Does this mean it's not really important that every one must vote? Even an abysmal voter turnout is ultimately representative of the entire population?
A good sample needs to be random and representative of the population as a whole, otherwise you introduce sampling bias. Imagine trying to do a survey of what people's favourite fast food restaurants are, but doing it inside a McDonald's — it doesn't matter how large your sample is, it's going to be heavily biased. This is why survey companies spend a lot of effort trying to find random, representative samples of the population, and often weighting their samples so that they match the target population even more.
If we treat elections like a survey, then they have a massive inherent bias to the sampling method: the people who will get "surveyed" are the ones who are engaged enough to get registered, and then willing to go to a physical polling station and vote. This will naturally bias towards certain types of people.
In practice, we don't treat elections like a survey. If we did, we'd spend a lot of time afterwards weighting the results to figure out what the entire country really thought. But that has its own flaws, and ultimately voting is a civic exercise. You can do it, you can avoid it: that choice is yours, and ultimately part of your vote. In a way, you could argue that the sample size for an election is 100% of the population, where "for whatever reason, I didn't cast a vote" is a valid box to check on this survey.
That said, the whole "samples can be biased" thing is very much relevant for elections because many political groups have an incentive to add additional bias to the samples. That could be as simple as organising pick-ups to allow their voters to get to the polls, or teaching people how to register to vote if they're eligible, but it could also involve making it significantly harder or slower for certain groups (or certain regions) to register or vote.
A 100% sample is unattainable, not just practically, but fundamentally. Even if you made voting mandatory and ensured collection of every single vote, there will always be people who will fudge their vote because they are not interested in the process. I argue that any election is only representative of people engaged with the process and that fundamentally cannot change. Within that subset, you shouldn't need 100% sampling for high confidence.
But agree that random distribution is key to this, but I don't see how that could change with the messaging that every one must vote, versus saying just vote if you're interested.
I mean that an election is (theoretically) an 100% sample because every eligible has the ability to interact with the voting process at the level that they choose. So the decision for some people to invalidate their vote, or to vote tactically, or not to vote at all, or whatever else: that's part of the act of taking part in an election. In that sense, you can't not take part in an election, if you're eligible to vote.
This is important, because normally, once you take a sample, you need to analyse that sample to ensure that it is representative, and potentially weight different responses if you want to make it more representative. For example, if you got a sample that was 75% women, you might weight the male responses more strongly to match the roughly 50/50 split between men and women in the general population. But in an election, we don't do this, because the assumption is that if you spoil your ballot or don't take part, that is part of your choice as a citizen.
But I think we're saying the same sort of thing, but in different ways: you can either see "the sample of an election is every citizen, regardless of whether they voted" or "the population of an election is everyone who voted", and in either case the sample is the same as the population, and we can therefore assume that it is representative of the population.
Beyond simple statistics and random sampling, everyone voting is important because let's say 1000 people are required to determine the result beyond reasonable uncertainty. Then those 1000 consequently hold too much power and thus can be easily bribed and the result affected far more easily than if everyone voted.
Of course, psychologically, everyone needs to vote to have a say. But beyond even that psychological thing, everyone voting is really a security measure against tampering.
Democratic systems already have the problem of not accounting for how strongly each voter feels about something. Is it really fair for 51 people weakly in favour of something to overrule 49 who are very strongly against and consider the issue extremely important? That's surely a net negative decision.
Forcing people to vote who aren't interested only makes this effect even worse.
I never said anything about forcing. Only giving people the chance to vote. As for your comments about democracy, well, I don't think democracy really works on a large scale. It's just probably the best system we have at the moment.
This is a problem even with a large turn out because swing voters are a thing and generally the target of manipulation. You may only need to target like a 1000 key swing voters to get your nose ahead of the other contestants. In fact, I would argue that making it easy to bribe would level out the playing field for contestants, otherwise the one with deepest pockets will tend to have the advantage.
Article used a Python library without really understanding the reason or science behind the result. Knowing this can help when you read an article or watch a news report where they quote a study that says “a study of 300 people…” well why 300 people? You can reasonably assume that the researchers used the CLT.
Say you are an alien, and you want to know roughly the male-to-female ratio of people. Let's say the true ratio is 50%.
Wouldn't this be done by an unbiased sample that's quite small, regardless of whether there's 100M or 8B people on the planet?
In not extreme cases when your population is much bigger than the sample size you're correct that it doesn't really make any difference.
When I was an undergrad first learning statistics I asked my stats instructor (a grad student) about this issue and they responded with something like "the population size doesn't matter because for the assumptions of the test to be met... such and such..." I kind of accepted that answer — we were talking about asymptotic inferences — but it never seemed quite right to me.
The example I gave was actually motivated in part by a sort of real-world problem I was dealing with: let's say you only want to make inferences about a population of 20 individuals. Certainly if you have a sample of 19, the confidence about the population will be much stronger than if your population is 100 million.
One thing he did say which is probably right, is that that 1/20 you didn't sample might throw things off, so it's more influential in a sense than a single member of a population of 100 million.
At the time I hadn't learned about exact and Jaynesian-permutation statistics, but that's probably the right way to think about finite populations. That is, something like "what are all the outcomes you could observe, and what proportion of those does my observed result represent?"
It's just that usually our population is so large that the exact test approach becomes infeasible to deal with without approximations, and you end up with the typical classical asymptotic statistics.
It's all maybe a moot point but it's always a good idea to think about the population you're trying to make inferences about. I think that probably includes the population size, and I think population size is probably bigger than you might initially think sometimes.
As for your last question, obtaining an unbiased sample is kind of harder as the number of attributes you're being unbiased with regard to increases. It's a permutation problem again, probably implicit usually with regard to sampling representativeness.
If we treat elections like a survey, then they have a massive inherent bias to the sampling method: the people who will get "surveyed" are the ones who are engaged enough to get registered, and then willing to go to a physical polling station and vote. This will naturally bias towards certain types of people.
In practice, we don't treat elections like a survey. If we did, we'd spend a lot of time afterwards weighting the results to figure out what the entire country really thought. But that has its own flaws, and ultimately voting is a civic exercise. You can do it, you can avoid it: that choice is yours, and ultimately part of your vote. In a way, you could argue that the sample size for an election is 100% of the population, where "for whatever reason, I didn't cast a vote" is a valid box to check on this survey.
That said, the whole "samples can be biased" thing is very much relevant for elections because many political groups have an incentive to add additional bias to the samples. That could be as simple as organising pick-ups to allow their voters to get to the polls, or teaching people how to register to vote if they're eligible, but it could also involve making it significantly harder or slower for certain groups (or certain regions) to register or vote.
But agree that random distribution is key to this, but I don't see how that could change with the messaging that every one must vote, versus saying just vote if you're interested.
This is important, because normally, once you take a sample, you need to analyse that sample to ensure that it is representative, and potentially weight different responses if you want to make it more representative. For example, if you got a sample that was 75% women, you might weight the male responses more strongly to match the roughly 50/50 split between men and women in the general population. But in an election, we don't do this, because the assumption is that if you spoil your ballot or don't take part, that is part of your choice as a citizen.
But I think we're saying the same sort of thing, but in different ways: you can either see "the sample of an election is every citizen, regardless of whether they voted" or "the population of an election is everyone who voted", and in either case the sample is the same as the population, and we can therefore assume that it is representative of the population.
Of course, psychologically, everyone needs to vote to have a say. But beyond even that psychological thing, everyone voting is really a security measure against tampering.
Forcing people to vote who aren't interested only makes this effect even worse.