Excel's Easiest and Most Robust Normality Test - The Chi-Square Goodness-Of-Fit Test
This article shows you in step-by-step, easy-to-follow instructions exactly how to do the Chi-Square Goodness-of-Fit Test in Excel. Anytime that you are running a t Test, and regression, a correlation, or ANOVA, you should make sure you're working with normally distributed data, or your analysis will probably not be valid. The easiest and most robust Excel test for normality is the Chi-Square Goodness-Of-Fit Test. Here's how to do it.
Introduction to the Chi-Square Goodness-Of-Fit Test
As a marketer, anytime that you are running a t Test, and regression, a correlation, or ANOVA, you should make sure you're working with normally distributed data, or your test results might not be valid . The quick-and-dirty Excel test is simply to throw the data into an Excel histogram and eyeball the shape of the graph. If there is a still a question, the next (and easiest) normality test is the Chi-Square Goodness-Of-Fit test.
The Chi-Square Goodness-Of-Fit test is less well known than some other normality test such as the Kolmogorov-Smirnov test, the Anderson-Darling test, or the Shapiro-Wilk test. The Chi-Square Goodness-Of-Fit test is, however, a lot less complicated, every bit as robust, and a whole lot easier to implement in Excel (by far) than any of the more well known normality tests. Let's run through an example:
The Initial Step of Normality Testing Is To Graph the Data In an Excel Histogram - Here is the initial data that we are testing for normality:
Creating an Excel Histogram From the Data - The Excel Histogram From the Above Data Is As Follows:
Excel Histogram Numerical Output
The histogram above somewhat resembles a normal distribution, but we should still apply a more robust test to it to be sure. The Chi-Square Goodness-of-Fit test in Excel is both robust and easy to perform, understand, and explain to others. Here is how to perform this test on the above data.
The 1st Step of the Chi-Square Goodness-Of-Fit Test In Excel - Apply Excel's "Descriptive Statistics" Function to the Sample Data
We need to know the mean, standard deviation, and sample size of the data that we are about to test for normality. Use the Descriptive Statistics Excel tool to obtain this information. In Excel 2003, this tool can be found at Tools / Data Analysis / Descriptive Statistics. The resulting output for this test is as follows:
How the Chi-Square Goodness-Of-Fit Test Works
Now that we have the sample mean, standard deviation, and sample size, we are ready to perform the Chi-Square Goodness-Of-Fit test on the data in excel.
The Chi-Square Goodness-Of-Fit test is a hypothesis test. The Null and Alternative Hypotheses being tested are:
H0 = The data follows the normal distribution.
H1 = The data does not follow the normal distribution.
A quick summary of the test is as follows:
We divide the observed samples into groups that have the same boundaries as the bins that were established when the Histogram was created in Excel. In this case, the observed samples fell into the following bins:
3 to 4 - 1 sample had a value in this range
4 to 5 - 1 sample had a value in this range
5 to 6 - 2 samples had a value in this range
6 to 7 - 4 samples had a value in this range
7 to 8 - 6 samples had a value in this range
8 to 9 - 7 samples had a value in this range
9 to 10 - 7 samples had a value in this range
10 to 11 - 4 samples had a value in this range
11 to 12 - 4 samples had a value in this range
12 to 13 - 3 samples had a value in this range
13 to 14 - 1 sample had a value in this range
The figures above represent the observed number of samples in each bin range. We now need to calculate how many sample we would expect to occur in each bin if the sample was normally distributed with the same mean and standard deviation as the sample taken (mean = 8.634 and standard deviation = 2.5454).
The expected number of sample in each bin is calculated by the following formula:
(Area of the normal curve bounded by the bin's upper and lower boundaries) x (Total number of samples taken)
For example, if there were only 2 bins that meet at the mean, then the corresponding normal curve would have 2 regions with a boundary at the mean of the normal curve. Each of the two regions of the normal curve would contain 50% of the area under the entire normal curve. We would therefore expect 50% of the total number of samples taken to fall in each bin. If, for example, 42 samples were taken, we would expect 21 samples to occur in each bin if the samples were normally distributed.
Given the bin ranges we have established for the Excel Histogram and the number of observed samples in each bin, we now need to calculate the number of samples we would expect to find in each bin. We assume that the samples are normally distributed with the same mean and standard deviation as measured from the actual sample. Given these assumptions, we use the method described above to calculate how many samples would be expected to occur in each bin.
Once we know the observed and expected number of samples in each bin, we calculate the Chi-Square Statistic.
A Chi-Square Statistic is created from the data using this formula:
Chi-Square Statistic = Σ [ [ ( Expected num. - Observed num.)^2 ] / (Expected num.) ]
A p Value is calculated in Excel from this Excel formula:
p Value = CHIDIST ( Chi-Square Statistic, Degrees of Freedom )
We take all of the samples and divide them up into groups. These groups are called bins. We will use the same bins as was used when creating the Histogram in Excel. The bins are as follows:
The size of the p Value determines whether or not we go with the assumption that the samples are normally distributed.
The Decision Rule
If the resulting p Value is less than the Level of Significance, we reject the Null Hypothesis and state that we cannot state within the required Degree of Certainty that the data is normally distributed. In other words, if we would like to state within 95% certainty that the data can be described by the normal distribution, the Level of Significance is 5%. The Level of Significance = 1 - Required Degree of Certainty. If the resulting p Value is greater than 0.05, we can state with at least 95% certainty that the data is normally distributed.
Breaking the Normal Curve into Regions
The Chi-Square Goodness-Of-Fit test requires that the normal distribution be broken into sections. In each section we count how many occur. This is our Observed # for each section. The Excel Histogram function has already done this for us. Once again, here is the Excel Histogram output:
The Resulting Excel Histogram
When we created the Excel Histogram from the data, we had to specify how many "bins" the samples would be divided into. Excel counted the number of observed samples in each bin and then plotted the results in the above histogram.
Since Excel has already counted how many observed samples are in each bin, we wil also use the bins as our sections for the Chi-Square Goodness-Of-Fit test. We know how many actual samples have been observed in each bin. We now need to calculate how many samples would have been expected to occur in each bin.
Calculating the Expected Number of Samples in Each Bin
The size of each bin determines how many samples would have been expected to occur in that bin. Each bin represents a percentage of the total area under the distribution curve that we are evaluating. That percentage of the total area that is associated with a bin represents the probability that each observed sample will be drawn from that bin.
Here is a simple example that will hopefully clarify the above paragraph. If we were evaluating a data set for normality, we would be trying to determine whether the data fits the normal curve. We have to determine what the bins ranges that we will divide the data into. The simplest bin arrangement would be to place all the data into only two bins on either side of the sample's mean. If the data were normally distributed, we would expect half of the samples to occur in each bin.
In other words, if the bins were placed along the x-axis relative to the sample's mean so each bin would be directly under 50% of a normal curve with the same mean, then we would expect 50% of the samples to occur in each bin. If there were 60 total samples taken, we would expect 30 samples to occur in each bin.
The expected number of samples for a single bin = Exp.
Exp. = (Area under the normal curve over the top of the bin) x (Total number of samples)
Calculating the CDF
We can obtain the normal curve area over each bin by using the Cumulative Distribution Function (CDF). The CDF at any point on the x-axis is the total area under the curve to the left of that point. We can obtain the percentage of area in normal curve for each bin by subtracting the CDF at the x-Value of bin's lower boundary from the CDF at the x-Value of the bin's upper boundary.
The normal distribution that we are trying to fit data has as its two and only parameters the sample's mean and standard deviation.
The CDF of this normal distribution at any point on the x-Axis can be determined by the following Excel formula:
CDF = NORMDIST ( x Value, Sample Mean, Sample Standard Deviation, TRUE )
Once again, this formula calculate the CDF at that x Value, which is the area under the normal curve to the left of the x Value. That normal curve has as its parameters the sample's mean and standard deviation.
Graphical Interpretation of the CDF - CDF (65% of Curve Area From Upper Boundary of Bin)
Minus - CDF (25% of Curve Area From Lower Boundary of Bin)
Equals - CDF (40% of Curve Area Inside of Bin)
Calculating Area in Bins - Excel Calculations of Area in Bins
Above are these calculations performed in Excel using the Histogram bin ranges and a sample mean of 8.643 and standard deviation of 2.5454.
Calcuating Expected Number of Samples in Each Bin - Excel Calculations for Expected Number of Samples in Each Bin
We can now calculate the Expected number of samples in each bin by the following formula:
Exp. number of samples in each bin =
( Percentage of Curve Area in that Bin ) x Total number of samples
This calculation for each bin is completed in the 1st column below. There are 42 total samples taken for this exercise.
Calculation of the Chi-Square Statistic - Excel Calculations of the Chi-Square Statistic
The end result of the above Excel calculations is the final column of (Exp. - Obs.)^2 / Exp. for each bin. These figures are then summed as follows to give us the overall Chi-Square Statistic for the sample data. In this case, the sample data's Chi-Square Statistics is 4.653.
Degrees of Freedom - Excel Calculation of Degrees of Freedom
The Chi-Square-Goodness-Of-Fit test requires the number of Degrees of Freedom be calculated for the specific test being run. The formula for this is as follows:
Degrees of Freedom = df = (number of filled bins) - 1 - (number of parameters calculated from the sample)
The number of filled bins = 12
We calculated the mean and standard deviation from the sample. This is 2 parameters.
df = 12 - 1 - 2 = 9
We can now calculate the p Value from Chi-Square Statistics and the Degrees of Freedom as shown directly above.
The p Value's Graphical Interpretation - An Excel Graph Showing the p Value
The p Value's graphical interpretation is shown below. The p Value represents the percentage of area (in red) to the right of X = 4.653 under a Chi-Square distribution with 9 Degrees of Freedom. If the p Value (.8634) is greater than the Level of Significance (0.05), we do not reject the Null Hypothesis.
In this case, we state that we do not reject the Null Hypothesis and do not have sufficient evidence that the data is not normally distributed.
This article is accurate and true to the best of the author’s knowledge. Content is for informational or entertainment purposes only and does not substitute for personal counsel or professional advice in business, financial, legal, or technical matters.
Your Opinions, Questions, and Comments Are Very Important To Us. We Are Looking Forward To Hearing From You !
Nik on April 26, 2019:
I'm not sure how you came up with the Lower and Upper Bin Ranges. It would make more sense to me if the lowest bin range started at a large negative number and the uppermost bin number ended with a large positive number (e.g. -10^(-7) and 10^7). Then, the actual bin numbers would be used to construct the intermediate bin ranges. For example, BR_1 would read [-10^(-7), 3], BR_2 would read [3, 4], and so on until the final row BR_13 read [14, 10^7]. Why is this not the case? It seems to me that the prescribed method slightly distorts the normal area each bin would be expected to contain.