## Friday, 23 March 2012

### Four important metrics to be reported in any research finding

Any research work or study should essentially report four metrics

(1) Statistical Significance (p value or α level): A result is termed as "statistically significant" if it is unlikely to have occurred by chance. Following six common mis-perceptions about the meaning of statistical significance may be clearly noted
• Means important
• Effect size tells the importance. The result may not be socially important and change in sample size may change the significance level.
• Informs the likelihood that given result is replicable
• Replication is at best informed by cross validation, jack knifing and bootstrapping etc.
• Informs the likelihood that given results are due to chance
• It rather provides the probability of results occurring by chance in the long run under Null hypothesis with random sampling. It provides no basis for conclusion about the probability that the given result is attributable to chance. e.g. r = 0.40 significant at 5% means there is 95% likelihood that correlation in population is not zero assuming sample is representative. It would be inappropriate to conclude that there is 95% likelihood that correlation is 0.40 in the population or there is only a 5% likelihood that the result is due to chance.
• Informs the likelihood that the sample is representative of the population
• The only way to estimate that the sample is representative is to carefully select the sample. Statistical significance, however, does mean that if the sample represents the population , how likely is the obtained result.
• The best way to evaluate statistical results.
• This is one of the essential metrics.
• Statistically significant reliability and validity coefficients will yield reliable and valid scores with different sample
• Reliability and validity coefficients are the characteristics of test scores and not the test per se. The statistical tests for these coefficients are nonsensical as these coefficients depend on the size of sample and statistical tests do not offer generalizability of these.
In short Statistical significance simply tells that the Null hypothesis is true in the population !

(2) Effect size:  It is the magnitude of the result. The researchers estimate effect sizes by observing representative samples, thereby generating an effect size estimate. The sign of the effect size reveals the direction of the effect size. No knowledge of statistical significance is necessary to find out the effect size. The effect size does not determine the significance level, or vice-versa. If we take a sufficiently large sample size, a statistical result will always be significant unless the population effect size is exactly zero. Without an estimate of the effect size, no meaningful interpretation can take place. It is even more important when you compare different studies for same effect so that these can be quantitatively compared as done in Meta analysis. e.g. e.g. r = 0.40 gives the magnitude of relationship between the variables. The primary result of a research process is one or more measures of effect size, not p values. There are more than hundred effect size metrics divided broadly into two groups viz d-family (cohen's d, odds ratio etc) and r-family (r, R2 etc.).
• Example 1: Let's consider two studies to measure the relationship between class marks and the intelligence score.
• Study 1: N = 100, r = 0.44, p > .05
Study 2: N = 101, r = 0.44, p < .05

The non-significance (i.e.  p > .05) of the Study 1 may lead the researcher to erroneously conclude there is no relation between the two variables whereas Study 2 says that the intelligence and the class marks are positively correlated. Essentially both the studies are giving identical estimates of effect size. The conclusion drawn from first study ignores the effect size and examines only p-value. A statistically non significant result does not mean no effect, it only means that the study is inconclusive and lacked statistical power to detect the effect. If you see carefully, the second study has one more observation, which might have pushed its statistical power to get past the threshold level in order to detect the effect. We should look into the design of our study in these kinds of inconsistencies. So a non-significant result does not mean no effect and significant result does not necessarily mean that the effect is real, by chance alone about five in hundred significant findings will be spurious.
(3) Confidence Interval (CI)  : It is an interval estimate (as opposed to the point estimate) of population parameter, which is used to indicate the reliability of the estimate. It differs from sample to sample and how frequently it contains the population parameter, depends on the confidence level.  Common choices for the confidence level (C)  are 0.90, 0.95, and 0.99. For 95% probability that the CI should contain the population parameter, we construct 95% CI with C = 0.95 and p-value becomes (1-C)/2 = 0.025 for a 2-tail test. The critical z value (z*), such that P(Z > z*) = 0.025 and  P(Z < z*) = 1-0.025, is equal to 1.96, assuming normal distribution. The 95 % CI = (x̄-1.96*σm , x̄+1.96*σm ) where  x̄ is sample mean and σm is standard error of mean. If we repeatedly take out the sample and construct the 95% CIs for each sample then we should be able to say that CIs of 95% of the samples would contain the population mean. This is often stated simplistically but erroneously that probability of  CI of sample contains the population parameter is 0.95 because we know that population parameter is a fixed quantity and not a random variable. We can always say with certainty that the said CI contains the population parameter or not. So parameter does not jump from experiment to experiment, it is CI which does that. Note that if we reject the Null hypothesis, the 95% CI can not contain the the population parameter value corresponding to the Null hypothesis or in other words if CI contains the population parameter value corresponding to the Null hypothesis, the Null hypothesis can't be rejected. A confidence interval that contains the value of no difference between treatments indicates that the treatment under investigation is not significantly different from the control.
• Example: If the sample mean is 100 and the standard error of mean is 20.The CI is (100-1.96*20, 100+1.96*20) ie (60.8,139.2).
As the level of confidence decreases, the size of the corresponding interval will decrease. An increase in sample size will decrease the length of the confidence interval without reducing the level of confidence. This is because the standard error decreases as n increases.

(4) Statistical Power (1- β) : The power of any test of statistical significance is defined as the probability that it will reject a false null hypothesis. This is numerically equal to 1- β, where β is the probability of making a type II error.  Statistical power depends on the size of the effect and the sample size. Besides these the α and β values affect the power calculations. Given these values, power analysis is conducted to determine the chance of getting statistical significant result or to calculate the minimum sample size required to enable us to detect the effect.   Bigger effects are easier to detect than smaller effects and large samples give better test sensitivity than small samples. So statistical power is the likelihood that a study will detect an effect when there is an effect there to be detected. In an underpowered study genuine effects can go undetected leading to type I error, but in an overpowered study everything is statistically significant leading to type II error. The studies should be designed in a way such that they have an 80% probability of detecting an effect when there is an effect there to be detected. thereby have no more than 20% probability of making a Type II error (recall that power = 1 – β). This keeps a nice balance between α and β values (type I errors are considered four times serious as compared to type II errors). Alpha and beta levels are related, that as one goes up, the other goes down. The three ways to increase the statistical power are search for bigger effects, increase sample size, reduce α significance value. When a statistically non-significant result is erroneously interpreted as evidence of no effect, and there really is an effect, a Type II error is said to have been committed.