Thursday, February 11, 2010

### Correlation and Regression Coefficients

Correlation coefficient:

It measures the closeness of the linear (or “straight line”) association between two continuous variables. The correlation coefficient values are always number between –1 and +1. It will be zero if the variables are not correlated. The maximum value of 1 is obtained if there is a straight line in the scatter plot. The association is positive if the values of X axis and Y axis tend to be high or low together (positive relationship). Conversely, the association is negative if the high Y axis values tend to go with low values of X axis (i.e inverse relationship).

Whether correlation coefficient (i.e., “r” value) is significantly different from zero can be tested. The significance depends on the size of r value and the number of observations(n). Larger the r, stronger is the association. A weak correlation may be statistically significant if the number of observations is large.

Sometimes, r value may be artificially low (if the relationship between two variables is curved), or high (due to few extreme observations). For this reason, it is desirable to draw a scatter plot of the data before drawing conclusion on the significance or importance of the correlation coefficient value.

It is to be remembered that a correlation between 2 variables does not necessarily suggest a “cause and effect” relationship. Correlation tests are normally used for forming a hypothesis or suggesting areas of further research.

This test
• assesses the strength of association between two variables
• is suitable for assessing the linear correlation only
• requires x and y variables to be normally distributed
• requires scatter plot to be made for visual assessment of the linearity (to rule out curved relationship)
• does not indicate cause and effect
• is used to form hypothesis rather than testing it.

Simple Regression analysis:

It gives the equation of the straight line and enables prediction of one variable value from the other. Normally, the dependent variable is plotted in Y axis and the independent variable in X axis. There are 3 major assumptions. First, any value of x and y are normally distributed. Second, the variability of y should be the same for each value of y. Third, the relationship between the two variables is linear.

The equation of a regression line is: “y=a + bx” where ‘a’ is the intercept, ‘b’ is the slope, ‘x’ is the independent variable and ‘y’ is the dependent variable. The slope ‘b’ is sometimes called regression coefficient and it has the same sign as correlation co-efficient (i.e., ‘r’).

The above equation can be used for predicting ‘y’ variable from ‘x’ variable. Some of the improper usages of the above equation are predicting the ‘y’ value from outside the range of the original data set (i.e., extrapolation), fitting of a straight line when the data shows curvature, prediction of ‘x’ value from ‘y’ and use of simple regression where there are heterogenous subgroups.

This test
• is used to estimate a dependence relationship
• is used to predict one variable (dependent) from another (independent) within a range
• is suitable if the relationship is linear
• requires y variables to be normally distributed
• requires the variability of all y values to be similar
• not suitable where there are heterogenous subgroup(s)