Being NORMAL is Overrated!

Being NORMAL is Overrated!

Normality

It is a major assumption for many statistical analyses, but I’m going to tell you why worrying about your data being normally distributed is overrated. First, let’s cover a few of the basics.


The Normal Distribution

Many biological variables, when measured and plotted as a frequency distribution, form a bell shape. The normal distribution is a continuous probability distribution that describes a bell shaped curve which can be used as an approximation of the frequency distribution. There are two parameters that are used to describe the shape of a normal distribution and they are the mean (µ) and standard deviation (σ), which describe the location and the spread of the curve, respectively. There are an infinite number of normal distributions, but they all share several properties:

 1)      Values range from -∞ to ∞

2)      They are continuous, therefore probability is measured by the area under the curve rather than the height of the curve

3)      They are symmetric around the mean

4)      They have a single mode

5)      The mean = median = mode

6)      The probability density is highest at the mean

7)      68.3% of the area under the curve lies within one standard deviation of the mean and 95% of the area under the curve lies within 1.96 standard deviations of the mean

The Sampling Distribution of Sample Means (SDSM)

The normal distribution can also be used to describe the sampling distribution of some estimates, including the mean. The sampling distribution of sample means (SDSM) is the distribution of means of all possible samples of size n taken from the population. When the variable being measured has a normal distribution in the population, then the SDSM will also be normal. The mean of the SDSM will equal the mean of the population, but the spread of the distribution will be smaller, such that σmeans < σpopulation. If the variable being measured does not have a normal distribution in the population, the SDSM will be more normal, and when n (sample size) is large, the SDSM will be approximately normal. This phenomenon is described by The Central Limit Theorem.

The Central Limit Theorem (CLT)

As sample size increases, the probability distribution of means of samples from any distribution will approach a normal distribution.

The Normality Assumption

When you analyze data using any type of parametric linear model, one of the major assumptions for that analysis is “normality”. I believe that most researchers are far too concerned about meeting this assumption. The majority of these analyses are robust to departures of normality. Most often, hypotheses are being tested about means, so it is the distribution of the means that needs to be normal. Well, if you have designed your study well and you have a sufficient sample size, you don’t have to be concerned about a variable that has a non-normal distribution because the distribution of sample means will be more, if not approximately normal!

Data Transformation

It is commonly a knee jerk reaction to transform data that come from a non-normal distribution, but because of the CLT, it is often not actually necessary. Of course there are some exceptions or cases when data transformation may be required (e.g. variable has a Poisson distribution). An important caveat to remember is that any results from statistics that are performed on transformed data are only applicable to the transformed version of the variable, and should only be discussed as such. P-values do not hold for back-transformed values. This can often make discussing the results complicated and is one reason why it is not advantageous to transform your data if it is not needed. Unfortunately, even data transformation cannot fix the distribution of data that has few non-zero points. No matter what kind of transformation you attempt, they will always remain a peak in the distribution. In these cases, non-parametric data analysis could be performed, but even then all of the zero values will be assigned the same (tied) rank, so this is often still not particularly helpful. Other possible solutions are to look at the variable as binary, zero vs. non-zero, or only compare the level of responses among instances where a response was measured (non-zero values only).

Tests for Normality

When your sample size is large, plotting and visually inspecting the frequency distribution of the data from your sample can give you an indication if your variable of interest tends to follow a normal distribution. Again, this is not the real assumption you need to meet. If the sample data do appear to form a normal distribution that is ideal, because you know the SDSM will also be normal. But, if it does not, once again, it does not necessarily mean you need to transform your data, because the CLT has got your back. It is important to note that if you have more than a single group or multiple treatments, if those groups/treatments are different, you would not expect them to be normally distributed, because individual values will not be centered around the same µ. In this case, you would want to plot the data as their deviations from their predicted means (predicted means are the estimate of µ(s) based on sample data). These deviations are otherwise known as residuals. There are some statistical tests available, such as the Shapiro-Wilk Test, which will test the null hypothesis that the data follow a normal distribution. These tests are only valid if you have a sufficient sample size (and also should be performed on the residuals). If you fail to reject the null hypothesis, you can feel very comfortable that you have met the normality assumption. But I will emphasize once again, that even if these tests result in rejecting the null hypothesis, it only gives you information about the distribution of the variable in the parent population, and not the distribution of sample means, so it does not automatically indicate a need for data transformation.

Basic Descriptive Statistics and Tests for Normality in SAS®

david-clode-483879-unsplash.jpg

In SAS®, PROC UNIVARIATE is used to obtain descriptive statistics for any numeric variable in a data set. It operates in a similar manner as PROC MEANS but produces, by default, 16 measures, listed under “Moments” and “Basic Statistical Measures”. There are many other pieces of information provided as well by default, including a t-test for the mean under “Tests for Location: H0: μ=0”. You may also use PROC UNIVARIATE to verify if the variable(s) that you are studying come from a normal distribution.

 Copy and paste the following codes into your SAS program editor:

DATA SNAKES; 
DO SNAKE=1 TO 28; 
INPUT LENGTH @@; 
OUTPUT; END; 
CARDS; 
44.3 41 45.8 42.4 40.3 44 39.8 
43.7 40.8 38.9 44.6 40.9 46.9 43.9 
50.1 48 47.1 49.2 47 45.2 45.2 
47.7 49.7 50 48.3 51.3 44.1 46.6 
; RUN; 
 
PROC UNIVARIATE DATA=SNAKES NORMAL PLOT; 
VAR LENGTH; 
QUIT;

You will see there is a section of the output that provides four statistical tests of normality. In all of these tests, the null hypothesis is that the variable has a normal distribution. The alternative hypothesis is that the variable does not have normal distribution (or that it has another distribution). The Shapiro-Wilk test is generally the test of choice when the data set has 2000 records or less, because it has good power compared to alternative tests. With samples larger than size 2000, the Kolmogorov-Smirnov test is recommended as a substitute. In the present example, both yield the same result, i.e. failure to reject the null hypothesis. This suggests that the variable LENGTH does have normal distribution. If you want more supportive information, gather the values of asymmetry (a3) and kurtosis (a4) from the first section of the PROC UNIVARIATE output. For a normal distribution a3 (skewness) = 0 and a4 (kurtosis) = 3. Important: SAS automatically subtracts 3 from a4 so you should use a4 = 0 as your reference. This is a common source of mistakes, so make sure that you are aware of this important difference when interpreting the output from PROC UNIVARIATE. Another useful option for PROC UNIVARIATE is PLOT, which creates three diagrams: a stem-and-leaf diagram (or horizontal bar chart in ODS graphics), a boxplot, and the normal probability plot. In the latter, a straight line, sometimes indicated by + symbols, shows the theoretical placement of the data if the sample comes from a normal distribution. The * or ° represent patterns of the actual data. In our example, there is good agreement between theoretical and actual placement of the data, suggesting the data come from a normal distribution.

The Statistical Perspective on DEFLATEGATE

The Statistical Perspective on DEFLATEGATE

Reproductive Viability Analysis (RVA) as a new tool for ex situ population management

Reproductive Viability Analysis (RVA) as a new tool for ex situ population management