Monday, February 26, 2007

Guinness is good for you!

Students taking my course on Introductory Econometrics probably won’t recognise this slogan but it was to be found on all bottles of Guinness when I was a student first studying econometrics at the end of the nineteen-sixties. Actually there is a connection between Guinness and econometrics and an interesting story to go with it.

The link is Student’s t-distribution and its originator Mr W S Gosset who worked as a chemist and mathematician for the Guinness brewery a century ago.

In econometrics we regularly use the t-distribution to assess the significance of individual regression coefficients or to compute confidence intervals based on our sample estimates. We divide the coefficient estimate by its standard error and then check whether this calculated value exceeds (in absolute value) the relevant critical t value from the tables (based on the available degrees of freedom and the agreed significance level for the test – usually 5%). To get a 95% confidence interval for a parameter we take the point estimate plus or minus the estimated standard error multiplied by the appropriate t-value from the table with 0.025 in each tail.

As you may know the t-distribution is rather like the normal distribution in that it is symmetrical around its mean and with most of the values falling quite close to the mean with only a small amount of the area out in the extreme tails. Actually the tails are a bit “fatter” than those of the normal distribution, so if you were to use the normal distribution critical values of + and – 1.96 to separate extreme values in the tails from those in the middle of the distribution you would be slightly out in your assessment of the significance level of your hypothesis test or in setting 95% confidence limits for parameter estimates. However, as you perhaps also know the exact t-values depend also on the number of degrees of freedom available to you in estimating the parameter(s) and the t-distribution does approach the normal distribution as the number of degrees of freedom increases – so with a big enough sample size it perhaps wouldn’t make much difference.

Where does this t-distribution come from and why is it important? Let’s suppose first of all that a factory production line is filling bottles with an amount of liquid (Guinness maybe!) supposed to be 1 pint. Now it is not really possible for the technology to guarantee an exact amount of 1 pint each time – in reality the quantity dispensed will be a random variable which sometimes puts a bit more than a pint into a bottle and sometimes a bit less. It may be OK to assume that this random variable has a constant mean and variance, and even that the distribution is normal. In that case, if these two parameters were known in advance it would be possible to set up the machinery in such a way that we could ensure that say 95% of the time there would be at least 1 pint in each bottle. The mean amount dispensed would have to exceed 1 pint, the gap between this figure and 1 (pint) obviously being smaller the lower is the variance of the distribution of liquid dispensed. The problem is that in most cases we won’t be able to know either the mean value of the distribution or its variance in advance. We will have to take a sample of values and use the sample mean to estimate the population mean, and then base our estimate of the population standard deviation on the standard deviation of our sample. (Actually the estimate of the standard deviation of the sampling distribution of the mean – called the standard error – will be the sample standard deviation divided by the square root of the sample size; see any basic statistics text). In a situation like this it is quite likely that the sample size that we work with will be small. After all if we have to remove a number of bottles of liquid from the production line in order to measure how much liquid is in it, that bottle and its contents cannot be sold. Taking a large sample would just be too costly. Fortunately Gosset discovered how the sampling distribution would be affected by having to use an estimate of its variance from a small sample rather than working with the known population value or an estimate based on a very large sample. This distribution has now come to be known as the t-distribution, or more formally, Student’s t-distribution. It is worth noting that Gosset had no computers to help him undertake the necessary simulations to arrive at his result. All his calculations had to be done by hand.

Having completed this work Gosset naturally wanted to share his findings with other members of the statistics community through the usual method of a published journal article. However his employers at Guinness were not at all keen on this and he had to resort to the ruse of publishing under the pseudonym A. Student. Hence “Student’s” t –distribution. (Actually he didn’t originally call the distribution the t-distribution. It was his fellow statistician Fisher who gave it this name.)

You can read the original paper online if you wish at
Student [W S Gosset] (1908)The probable error of a mean. Biometrika, (1): 1–25.

And you can read more on William Sealy Gosset and his work for Guinness (as well as other leading statisticians of the early days such as Fisher, Pearson and others) in a fascinating book called The Lady Tasting Tea by David Salsburg.

0 Comments:

Post a Comment

<< Home