Jump to content

Wikipedia:Reference desk/Archives/Mathematics/2017 March 8

From Wikipedia, the free encyclopedia
Mathematics desk
< March 7 << Feb | March | Apr >> Current desk >
Welcome to the Wikipedia Mathematics Reference Desk Archives
The page you are currently viewing is a transcluded archive page. While you can leave answers for any questions shown below, please ask new questions on one of the current reference desk pages.


March 8

[edit]

Local or national statistics as predictor of future events

[edit]

A is a geographical region (say, a country) with population p (for the sake of simplicity, let's assume that the population is constant). It consists of a number of subregions Bi, each with population pi.

The number of times a certain event occurs each year in A follows Po(a), where a is an unknown constant.

The number of times a certain event occurs each year in Bi follows Po(bi), where bi is an unknown constant. We can assume that (this is not exactly true, but probably the best available model – maybe a binomial distribution would be just as correct).

Statistical data are available: the number of times the event occurred in A during n years are α0, α1, ..., αn–1, and the number of times the event occurred in Bi during the same years are βi,0, βi,1, ..., βi,n-1.

The best estimate of a clearly is . But what is the best estimate of bi?

Intuitively, local statistics should be better at predicting local events. But if pi << p, the values of βi,j can be very small (perhaps even many of them zero) and subject to relatively large random fluctuations, so at what point might this insecurity dominate over the difference between subregions? —JAOTC 09:22, 8 March 2017 (UTC)[reply]

The crucial point is your assertion that "We can assume that ", which (as written) does not mean much but I suspect what you meant is , which simply means (i.e. the rate of the event is proportional to population, with the same proportionality constant in all subregions; cf. Poisson_distribution#Sums_of_Poisson-distributed_random_variables).
If you know that from theoretical arguments, then indeed using the global estimate is better because of the law of large numbers. You don't even care about the sampling by subregions.
However, if you are even looking at the reporting by subregions, it is likely that this assertion is merely the null hypothesis waiting to be disproved. In that case, there is a famous quote that applies (Judging Books by Their Covers, Richard P. Feynman):

Nobody was permitted to see the Emperor of China, and the question was, What is the length of the Emperor of China's nose? To find out, you go all over the country asking people what they think the length of the Emperor of China's nose is, and you average it. And that would be very "accurate" because you averaged so many people. But it's no way to find anything out; when you have a very wide range of people who contribute without looking carefully at it, you don't improve your knowledge of the situation by averaging.

The situation is a bit more complex here, but if you have 10 estimates from people who saw the emperor's nose and 1000 from people who did not, adding the last 1000 will not "improve" your estimate by any reasonable meaning of the word "improve". TigraanClick here to contact me 12:15, 8 March 2017 (UTC)[reply]
Actually, Feynman uses the wrong term here. If you have a large sample size, then your answer is very precise, but without good data to sample, the answer would not be very accurate. Precision is the closeness of a set of measurements to their average value; a larger sample size should become closer and closer to an ideal distribution, so larger sample sizes are more precise. Accuracy is the closeness of a set of measurements to the true value (not the average value), so since no one of the billion polled Chinese people actually knew the size of the Emperor's nose, the average is not likely to be very accurate (even if it were very precise). Most introductory textbooks in the sciences or statistics will cover the difference between accuracy and precision, but even very smart people confuse the two concepts, as the usually astute Mr. Feynman does above. --Jayron32 18:31, 8 March 2017 (UTC)[reply]
[[1]]. Bo Jacoby (talk) 21:02, 8 March 2017 (UTC).[reply]
To whom, and in what context, is your self-aggrandizing link being provided? --Jayron32 02:52, 9 March 2017 (UTC)[reply]
To mr JAO, in the context of statistical prediction, which is what his question is about. Bo Jacoby (talk) 06:31, 9 March 2017 (UTC).[reply]
Thank you for your answer. You are quite right that the crucial point is my third line. I had difficulty wording it, which in my experience probably means I had difficulty thinking it. For the record, does not in general hold – there are systematic demographic differences in the subregions. But they are also not uncorrelated. The heart of the problem is that we don't really know the nature or strength of that correlation.
What you are saying makes a lot of sense. But there's still something unsettling about the results, if it leads to predicting that the risk of a certain event is 0 in a subregion just because it hasn't happened yet in that particular subregion. Of course, this can (and should) be alleviated by computing confidence intervals, making it clear that the risk is not exactly 0, so possibly this is not really a problem. But where does the argument end? Presumably, if you've lived in a house since it was built, you know how frequent fires have been in that house historically, which is probably 0 times per year. You have seen the Emperor's nose, but is this really a better measurement of the fire risk in your house than the statistics from your local fire brigade?
Also, your idea of looking at as a null hypothesis may be a better way forward than guessing something about the distribution of bi. If, for a certain i, local statistics disprove this hypothesis, then we know for sure that a is irrelevant. —JAOTC 07:36, 9 March 2017 (UTC)[reply]
there are systematic demographic differences in the subregions, [but] they are also not uncorrelated. The heart of the problem is that we don't really know the nature or strength of that correlation.
All depends also about what "correlated" means. If you have only one data point (how many events) per subregion, you cannot do much - at best you can reject the null hypothesis that the rate is the same everywhere, but you cannot correct the observed occurence in some region to deduce the distribution parameter. (You can compute confidence intervals but you make assumptions along the way - in effect you are usually assuming silently a prior probability distribution.)
As for your second question, statistics of low-probability events are notoriously difficult to estimate (in some cases, you can handwave an argument about how the central limit theorem is a bad approximation on the tails). The natural hypothesis would be that your house or its inhabitants have nothing particular and thus it has the same statistics as the ensemble-average house. But if (for instance) the witch-doctor gives you a powerful anti-fire charm, your house may well have different statistics (it is more likely to catch fire because you will cook carelessly). By how much, it is not data from other (non-witched) houses that will tell it... TigraanClick here to contact me 10:17, 9 March 2017 (UTC)[reply]
If a variable i has a poisson distribution with mean value m, then the standard deviation is √m. So estimate im±√m. If you do know the value, i, but not the mean value m, then m has a gamma distribution, m≃(i+1)±√(i+1). So if you observe the value i=0, estimate m≃1±1. Bo Jacoby (talk) 09:47, 9 March 2017 (UTC).[reply]