Monday, June 12, 2017

Assumptions of linear regression

There are some assumptions that a OLS (ordinary least square) or  linear regression makes about data and if these assumptions are not satisfied.the results of the linear regression are not valid.In this post, we will look into the assumptions made by linear regression and how R helps us to check if these assumptions are met by our model.

We will use the same model that we created in the last post between "female literacy rate" and "age at first marriage in females".

Assumptions of linear model

The assumptions are about the data but in OLS, we typically talk about assumptions in terms of residuals produced by our model.

1.Normality of residuals: The linear regression assumes that the residuals are normally distributed.

2. Linearity of residuals: The relation between our predictor and the response variable must be linear.In other word,the points (xi, yi) fall in an almost straight line .If there exists a non-linear relation between a response and the predictor ,it will not captured by our LINEAR model and that leaks into residuals in the form of patterns.

3.Homoscedasticity of residuals or homogeneous variance of y over the entire range of x:

4.Independence of data: This assumption requires that all observations in data frame must be independent of each other.As an example, lets say we regress health of cows on food eaten by them and we generate a fitted regressions line.The regression line represents the relation between health of cows(response variable) and food eaten (predictor variable).
In case our target population is quite narrow, say a tiny town, the chances are high that all the cows eat the same kind of food.The observations so collected are not independent of each other and thus violate this assumption.

Lets learn about these assumptions in little more depth and see how R helps us to figure out the violation.

To recall, our model has the following specification:
p2<-lm(data=litrecy_and_ageatmarriage_grouped_by_Country,female_litrecy_rate~Age )


Lets plot residuals for the above model:

par(mfrow=c(2,2)) ## This will arrange the plots in 2*2 matrix

plot(p2)

The above gives the residual plots as shown in the figure 1,below


1.Check "normality of residuals" assumptions: This assumption is checked using top right graph.Also called probability graph or Q-Q graph or quantile-quantile graph, it compares residuals to ideal normal observation.Assuming our observation comes from normal distribution with mean μ and standard deviation δ, the standardized data yi - μ  ≈ qi  ⇒ yi=qδ + μ
                                                                                           δ
where qi is taken from theoretical standard normal data.
This graph should be an approximate straight line.
As the Q-Q plot of our model p2 looks like an almost straight line, it conforms to the normality assumption.

2.Check "linearity of residuals" and "homoscedasticity" assumption: These assumptions are checked using top-left plot.The graph plots fitted values(predicted values) on x-axis and residuals on y-axis.
If the data points are scattered around in curve or parabola, it indicates a non-linear relation ship between the predictor or independent variable  and response variable or dependent variable.In our plot, the data points don't seem to follow any specific pattern, rather the spread is random.This randomness in the spread of the data conforms the linear relationship between predictor i.e age at first marriage of females and the response variable i.e. female literacy rate.

Now, to validate the assumption of homoscedasticity, we look for almost the same deviation of each data point from the line at zero. The data points in the center of our plot seem to have more deviation than the ones towards the ends.This indicates a slight heteroscedasticity (unequal variance)in our data.

3.Check "assumption of independence of observations" assumption: This assumption is not checked using any plot, instead fulfilled during the research design (in simple terms, the way the data has been collected).

No comments:

Assumptions of linear regression

There are some assumptions that a OLS (ordinary least square) or  linear regression makes about data and if these assumptions are not satisf...