Statistical data analysis is a tool to help us better understand the world around us and make sense of the infinite data with which we are constantly bombarded. In the business world, statistical data analysis can be used across the organization to help managers and others make better decisions and optimize the effectiveness and profitability of the organization. In particular, inferential statistical techniques are used to test hypotheses to determine if the results of a study occur at a rate that is unlikely to be due to chance. Statistical techniques are applied to the analysis of data collected through various types of research paradigms. However, although it would be comforting to assume that the application of statistical tools to the analysis of empirical data would yield unequivocal answers to aid in decision making, it does not. Without understanding the principles behind statistical methods, it is difficult to analyze data or to correctly interpret the results.
Mathematical statistics is a branch of mathematics that deals with the analysis and interpretation of data and provides the theoretical underpinnings for various applied statistical disciplines. Although statistical manipulation for the sake of learning more about stochastic processes or expanding the understanding of statistical principles is fine for theorists or the classroom, most people use statistics as a tool, a means rather than an end. In particular, statistics is a way to help us better understand the world around us and make sense of the infinite data with which we are constantly bombarded. To that end, in the business world, statistics tends to be used to organize and analyze data so that it can be interpreted and applied to solving business problems. Through statistical data analysis, marketing analysts can better predict future trends in the marketplace or understand how best to market to specific market segments. Through statistical data analysis, logisticians can better understand how to manage the supply chain so that it is both more effective and efficient, with supplies, raw materials, and components being received just before they are needed and products finished just before they are to be delivered in order to cut down on wasted time, money, and storage. Through statistical data analysis, engineers can determine ways to better control the quality of manufacturing processes or design products that will meet the needs of the marketplace while lowering costs for the organization.
The best way to perform these and other tasks is determined through the application of inferential statistics to the analysis of empirical data. Inferential statistics is a collection of techniques that allow one to make inferences about data, including drawing conclusions about a population from a sample. Inferential statistics is used to test hypotheses to determine if the results of a study have statistical significance, meaning they occur at a rate that is unlikely to be due to chance. A hypothesis is an empirically testable declarative statement that the independent and dependent variables and their corresponding measures are related in a specific way as proposed by the theory. The independent variable is manipulated by the researcher. For example, an organization might want to determine which of two new designs it should bring to market. The independent variable is the design of the product. The dependent variable, so called because its value depends on which level of the independent variable the subject receives, is the subject's response to the independent variable -- in this case, whether people prefer Design A or Design B. The researcher may set up an experiment to test the hypothesis that one design is preferred over the other. The results of the analysis would give the company support for making an empirically based decision about which product to bring to market.
For purposes of data analysis, a hypothesis is stated in two ways. The null hypothesis (H0) is a statement that there is no statistical difference between the status quo and the experimental condition. For example, a null hypothesis about people's preference for the two new product designs would be that there is no preference for one design over the other. The alternative hypothesis (H1) would be that there is, in fact, a preference for one design over the other. After the hypothesis has been formulated, an experimental design is developed that allows the hypothesis to be empirically tested. Data is then collected and statistically analyzed to determine whether the null hypothesis should be accepted or rejected.
There are a number of different statistical methods for testing hypotheses, each appropriate for a different type of experimental design. One frequently used technique is the t-test, which is used to analyze the mean of a population or compare the means of two different populations. When one wishes to compare the means of two populations, a z statistic may be used. Another useful technique is analysis of variance (ANOVA), a family of techniques used to analyze the joint and separate effects of multiple independent variables on a single dependent variable to determine the statistical significance of the effect. Other statistical tools allow the prediction of one variable from the knowledge of another variable. Correlation coefficients allow analysts to determine whether two variables are positively related (e.g., the older people become, the more they prefer a certain brand of cereal), negatively related (e.g., the older people become, the less they prefer that brand cereal), or not related at all. Regression is a family of techniques that are used to develop mathematical models for use in predicting one variable from the knowledge of another variable. In general, statistical techniques can be applied to a wide range of business problems, including marketing research, quality control, prediction of marketplace trends or sales volume, and comparing the relative efficiency of the various operations in a multinational organization.
It would be comforting to assume that the application of statistical tools to the analysis of empirical data would yield definitive answers that would unequivocally indicate what decision should be made. Unfortunately, it does not. Without understanding the principles behind statistical methods, it is difficult to analyze data or to correctly interpret the results.
Limitations to Real-World Statistical Data Analysis
Even if these limitations could be overcome, there are also practical limitations to real-world statistical data analysis that need to be taken into account. As complicated as human behavior is and as confounding as extraneous variables can be, the data that is collected in a laboratory is pristine compared with the data that can be collected in...
Descriptive statistics try to describe the relationship between variables in a sample or population. Descriptive statistics provide a summary of data in the form of mean, median and mode. Inferential statistics use a random sample of data taken from a population to describe and make inferences about the whole population. It is valuable when it is not possible to examine each member of an entire population. The examples if descriptive and inferential statistics are illustrated in Table 1.
The extent to which the observations cluster around a central location is described by the central tendency and the spread towards the extremes is described by the degree of dispersion.
Measures of central tendency
The measures of central tendency are mean, median and mode. Mean (or the arithmetic average) is the sum of all the scores divided by the number of scores. Mean may be influenced profoundly by the extreme variables. For example, the average stay of organophosphorus poisoning patients in ICU may be influenced by a single patient who stays in ICU for around 5 months because of septicaemia. The extreme values are called outliers. The formula for the mean is
where x = each observation and n = number of observations. Median is defined as the middle of a distribution in a ranked data (with half of the variables in the sample above and half below the median value) while mode is the most frequently occurring variable in a distribution. Range defines the spread, or variability, of a sample. It is described by the minimum and maximum values of the variables. If we rank the data and after ranking, group the observations into percentiles, we can get better information of the pattern of spread of the variables. In percentiles, we rank the observations into 100 equal parts. We can then describe 25%, 50%, 75% or any other percentile amount. The median is the 50th percentile. The interquartile range will be the observations in the middle 50% of the observations about the median (25th -75th percentile). Variance is a measure of how spread out is the distribution. It gives an indication of how close an individual observation clusters about the mean value. The variance of a population is defined by the following formula:
where σ2 is the population variance, X is the population mean, Xi is the ith element from the population and N is the number of elements in the population. The variance of a sample is defined by slightly different formula:
where s2 is the sample variance, x is the sample mean, xi is the ith element from the sample and n is the number of elements in the sample. The formula for the variance of a population has the value ‘n’ as the denominator. The expression ‘n−1’ is known as the degrees of freedom and is one less than the number of parameters. Each observation is free to vary, except the last one which must be a defined value. The variance is measured in squared units. To make the interpretation of the data simple and to retain the basic unit of observation, the square root of variance is used. The square root of the variance is the standard deviation (SD). The SD of a population is defined by the following formula:
where σ is the population SD, X is the population mean, Xi is the ith element from the population and N is the number of elements in the population. The SD of a sample is defined by slightly different formula:
where s is the sample SD, x is the sample mean, xi is the ith element from the sample and n is the number of elements in the sample. An example for calculation of variation and SD is illustrated in Table 2.
Example of mean, variance, standard deviation
Normal distribution or Gaussian distribution
Most of the biological variables usually cluster around a central value, with symmetrical positive and negative deviations about this point. The standard normal distribution curve is a symmetrical bell-shaped. In a normal distribution curve, about 68% of the scores are within 1 SD of the mean. Around 95% of the scores are within 2 SDs of the mean and 99% within 3 SDs of the mean [Figure 2].
It is a distribution with an asymmetry of the variables about its mean. In a negatively skewed distribution [Figure 3], the mass of the distribution is concentrated on the right of Figure 1. In a positively skewed distribution [Figure 3], the mass of the distribution is concentrated on the left of the figure leading to a longer right tail.
Curves showing negatively skewed and positively skewed distribution
In inferential statistics, data are analysed from a sample to make inferences in the larger collection of the population. The purpose is to answer or test the hypotheses. A hypothesis (plural hypotheses) is a proposed explanation for a phenomenon. Hypothesis tests are thus procedures for making rational decisions about the reality of observed effects.
Probability is the measure of the likelihood that an event will occur. Probability is quantified as a number between 0 and 1 (where 0 indicates impossibility and 1 indicates certainty).
In inferential statistics, the term ‘null hypothesis’ (H0 ‘H-naught,’ ‘H-null’) denotes that there is no relationship (difference) between the population variables in question.
Alternative hypothesis (H1 and Ha) denotes that a statement between the variables is expected to be true.
The P value (or the calculated probability) is the probability of the event occurring by chance if the null hypothesis is true. The P value is a numerical between 0 and 1 and is interpreted by researchers in deciding whether to reject or retain the null hypothesis [Table 3].
P values with interpretation
If P value is less than the arbitrarily chosen value (known as α or the significance level), the null hypothesis (H0) is rejected [Table 4]. However, if null hypotheses (H0) is incorrectly rejected, this is known as a Type I error. Further details regarding alpha error, beta error and sample size calculation and factors influencing them are dealt with in another section of this issue by Das S et al.
Illustration for null hypothesis
PARAMETRIC AND NON-PARAMETRIC TESTS
Numerical data (quantitative variables) that are normally distributed are analysed with parametric tests.
Two most basic prerequisites for parametric statistical analysis are:
The assumption of normality which specifies that the means of the sample group are normally distributed
The assumption of equal variance which specifies that the variances of the samples and of their corresponding population are equal.
However, if the distribution of the sample is skewed towards one side or the distribution is unknown due to the small sample size, non-parametric statistical techniques are used. Non-parametric tests are used to analyse ordinal and categorical data.
The parametric tests assume that the data are on a quantitative (numerical) scale, with a normal distribution of the underlying population. The samples have the same variance (homogeneity of variances). The samples are randomly drawn from the population, and the observations within a group are independent of each other. The commonly used parametric tests are the Student's t-test, analysis of variance (ANOVA) and repeated measures ANOVA.
Student's t-test is used to test the null hypothesis that there is no difference between the means of the two groups. It is used in three circumstances:
To test if a sample mean (as an estimate of a population mean) differs significantly from a given population mean (this is a one-sample t-test)The formula for one sample t-test is
where X = sample mean, u = population mean and SE = standard error of mean
To test if the population means estimated by two independent samples differ significantly (the unpaired t-test). The formula for unpaired t-test is:
where X1 − X2 is the difference between the means of the two groups and SE denotes the standard error of the difference.
To test if the population means estimated by two dependent samples differ significantly (the paired t-test). A usual setting for paired t-test is when measurements are made on the same subjects before and after a treatment.
The formula for paired t-test is:
where d is the mean difference and SE denotes the standard error of this difference.
The group variances can be compared using the F-test. The F-test is the ratio of variances (var l/var 2). If F differs significantly from 1.0, then it is concluded that the group variances differ significantly.
Analysis of variance
The Student's t-test cannot be used for comparison of three or more groups. The purpose of ANOVA is to test if there is any significant difference between the means of two or more groups.
In ANOVA, we study two variances – (a) between-group variability and (b) within-group variability. The within-group variability (error variance) is the variation that cannot be accounted for in the study design. It is based on random differences present in our samples.
However, the between-group (or effect variance) is the result of our treatment. These two estimates of variances are compared using the F-test.
A simplified formula for the F statistic is:
where MSb is the mean squares between the groups and MSw is the mean squares within groups.
Repeated measures analysis of variance
As with ANOVA, repeated measures ANOVA analyses the equality of means of three or more groups. However, a repeated measure ANOVA is used when all variables of a sample are measured under different conditions or at different points in time.
As the variables are measured from a sample at different points of time, the measurement of the dependent variable is repeated. Using a standard ANOVA in this case is not appropriate because it fails to model the correlation between the repeated measures: The data violate the ANOVA assumption of independence. Hence, in the measurement of repeated dependent variables, repeated measures ANOVA should be used.
When the assumptions of normality are not met, and the sample means are not normally, distributed parametric tests can lead to erroneous results. Non-parametric tests (distribution-free test) are used in such situation as they do not require the normality assumption. Non-parametric tests may fail to detect a significant difference when compared with a parametric test. That is, they usually have less power.
As is done for the parametric tests, the test statistic is compared with known values for the sampling distribution of that statistic and the null hypothesis is accepted or rejected. The types of non-parametric analysis techniques and the corresponding parametric analysis techniques are delineated in Table 5.
Analogue of parametric and non-parametric tests
Median test for one sample: The sign test and Wilcoxon's signed rank test
The sign test and Wilcoxon's signed rank test are used for median tests of one sample. These tests examine whether one instance of sample data is greater or smaller than the median reference value.
This test examines the hypothesis about the median θ0 of a population. It tests the null hypothesis H0 = θ0. When the observed value (Xi) is greater than the reference value (θ0), it is marked as+. If the observed value is smaller than the reference value, it is marked as − sign. If the observed value is equal to the reference value (θ0), it is eliminated from the sample.
If the null hypothesis is true, there will be an equal number of + signs and − signs.
The sign test ignores the actual values of the data and only uses + or − signs. Therefore, it is useful when it is difficult to measure the values.
Wilcoxon's signed rank test
There is a major limitation of sign test as we lose the quantitative information of the given data and merely use the + or – signs. Wilcoxon's signed rank test not only examines the observed values in comparison with θ0 but also takes into consideration the relative sizes, adding more statistical power to the test. As in the sign test, if there is an observed value that is equal to the reference value θ0, this observed value is eliminated from the sample.
Wilcoxon's rank sum test ranks all data points in order, calculates the rank sum of each sample and compares the difference in the rank sums.
It is used to test the null hypothesis that two samples have the same median or, alternatively, whether observations in one sample tend to be larger than observations in the other.
Mann–Whitney test compares all data (xi) belonging to the X group and all data (yi) belonging to the Y group and calculates the probability of xi being greater than yi: P (xi > yi). The null hypothesis states that P (xi > yi) = P (xi < yi) =1/2 while the alternative hypothesis states that P (xi > yi) ≠1/2.
The two-sample Kolmogorov-Smirnov (KS) test was designed as a generic method to test whether two random samples are drawn from the same distribution. The null hypothesis of the KS test is that both distributions are identical. The statistic of the KS test is a distance between the two empirical distributions, computed as the maximum absolute difference between their cumulative curves.
The Kruskal–Wallis test is a non-parametric test to analyse the variance. It analyses if there is any difference in the median values of three or more independent samples. The data values are ranked in an increasing order, and the rank sums calculated followed by calculation of the test statistic.
In contrast to Kruskal–Wallis test, in Jonckheere test, there is an a priori ordering that gives it a more statistical power than the Kruskal–Wallis test.
The Friedman test is a non-parametric test for testing the difference between several related samples. The Friedman test is an alternative for repeated measures ANOVAs which is used when the same parameter has been measured under different conditions on the same subjects.
Tests to analyse the categorical data
Chi-square test, Fischer's exact test and McNemar's test are used to analyse the categorical or nominal variables. The Chi-square test compares the frequencies and tests whether the observed data differ significantly from that of the expected data if there were no differences between groups (i.e., the null hypothesis). It is calculated by the sum of the squared difference between observed (O) and the expected (E) data (or the deviation, d) divided by the expected data by the following formula:
A Yates correction factor is used when the sample size is small. Fischer's exact test is used to determine if there are non-random associations between two categorical variables. It does not assume random sampling, and instead of referring a calculated statistic to a sampling distribution, it calculates an exact probability. McNemar's test is used for paired nominal data. It is applied to 2 × 2 table with paired-dependent samples. It is used to determine whether the row and column frequencies are equal (that is, whether there is ‘marginal homogeneity’). The null hypothesis is that the paired proportions are equal. The Mantel-Haenszel Chi-square test is a multivariate test as it analyses multiple grouping variables. It stratifies according to the nominated confounding variables and identifies any that affects the primary outcome variable. If the outcome variable is dichotomous, then logistic regression is used.