Monday, June 12, 2017

Assumptions of linear regression

There are some assumptions that a OLS (ordinary least square) or  linear regression makes about data and if these assumptions are not satisfied.the results of the linear regression are not valid.In this post, we will look into the assumptions made by linear regression and how R helps us to check if these assumptions are met by our model.

We will use the same model that we created in the last post between "female literacy rate" and "age at first marriage in females".

Assumptions of linear model

The assumptions are about the data but in OLS, we typically talk about assumptions in terms of residuals produced by our model.

1.Normality of residuals: The linear regression assumes that the residuals are normally distributed.

2. Linearity of residuals: The relation between our predictor and the response variable must be linear.In other word,the points (xi, yi) fall in an almost straight line .If there exists a non-linear relation between a response and the predictor ,it will not captured by our LINEAR model and that leaks into residuals in the form of patterns.

3.Homoscedasticity of residuals or homogeneous variance of y over the entire range of x:

4.Independence of data: This assumption requires that all observations in data frame must be independent of each other.As an example, lets say we regress health of cows on food eaten by them and we generate a fitted regressions line.The regression line represents the relation between health of cows(response variable) and food eaten (predictor variable).
In case our target population is quite narrow, say a tiny town, the chances are high that all the cows eat the same kind of food.The observations so collected are not independent of each other and thus violate this assumption.

Lets learn about these assumptions in little more depth and see how R helps us to figure out the violation.

To recall, our model has the following specification:
p2<-lm(data=litrecy_and_ageatmarriage_grouped_by_Country,female_litrecy_rate~Age )


Lets plot residuals for the above model:

par(mfrow=c(2,2)) ## This will arrange the plots in 2*2 matrix

plot(p2)

The above gives the residual plots as shown in the figure 1,below


1.Check "normality of residuals" assumptions: This assumption is checked using top right graph.Also called probability graph or Q-Q graph or quantile-quantile graph, it compares residuals to ideal normal observation.Assuming our observation comes from normal distribution with mean μ and standard deviation δ, the standardized data yi - μ  ≈ qi  ⇒ yi=qδ + μ
                                                                                           δ
where qi is taken from theoretical standard normal data.
This graph should be an approximate straight line.
As the Q-Q plot of our model p2 looks like an almost straight line, it conforms to the normality assumption.

2.Check "linearity of residuals" and "homoscedasticity" assumption: These assumptions are checked using top-left plot.The graph plots fitted values(predicted values) on x-axis and residuals on y-axis.
If the data points are scattered around in curve or parabola, it indicates a non-linear relation ship between the predictor or independent variable  and response variable or dependent variable.In our plot, the data points don't seem to follow any specific pattern, rather the spread is random.This randomness in the spread of the data conforms the linear relationship between predictor i.e age at first marriage of females and the response variable i.e. female literacy rate.

Now, to validate the assumption of homoscedasticity, we look for almost the same deviation of each data point from the line at zero. The data points in the center of our plot seem to have more deviation than the ones towards the ends.This indicates a slight heteroscedasticity (unequal variance)in our data.

3.Check "assumption of independence of observations" assumption: This assumption is not checked using any plot, instead fulfilled during the research design (in simple terms, the way the data has been collected).

Simple regression model

A linear regression model that involves one predictor  and one response variable is called simple regression model.This statistical method allows us to understand the relationship between two variables in greater depth and thereby help us make models for predictions.The predictor variable on x-axis is also call dependent variable and the response variable on y-axis is also call independent variable.

In this post, we are going to use the same data as we had used for our previous post-Bivariate analysis of female literacy rate and age at first marriage.

##Data frame litrecy_and_ageatmarriage contains our data.

p2<-lm(data=litrecy_and_ageatmarriage,female_litrecy_rate~Age )
summary(p2)
In the above code,lm is R function for linear model and we assign the result of linear model to p2. The predictor variable in our example is Age in Years (age at first marriage in females) and response variable is female literacy rate as percentage of general literate population.Lets check the details of p2 using summary(p2) which gives the following results

Call:

lm(formula = female_litrecy_rate ~ Age, data = litrecy_and_ageatmarriage)

Residuals:

    Min      1Q  Median      3Q     Max

-47.545 -20.735  -3.359  21.140  52.330

Coefficients:

             Estimate Std. Error t value Pr(>|t|)   

(Intercept)   -79.49      26.11  -3.044  0.00368 **

Age             6.23       1.18   5.280 2.69e-06 ***

---

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 23.85 on 51 degrees of freedom

Multiple R-squared:  0.3534, Adjusted R-squared:  0.3407

F-statistic: 27.87 on 1 and 51 DF,  p-value: 2.687e-06

Lets understand the output.


Residuals: Residuals are the difference between the observed values of response variable and its value predicted by our model.In our example it is the difference between the value of female literacy rate in our data frame (observed value)and the value predicted by our model.Simple regression model minimizes this difference because the model which predicts values closer to our observed values will predict better for unobserved values.We should look for almost symmetric distribution of residuals around the mean. Any asymmetry indicates that some values predicted by our model are too high or too low or there might be some thing else going on that is not captured by our model.

Coefficients Estimates: The line of simple regression is the one which minimizes the residuals and is represented mathematically by y=b1x+b0.Here y is the response variable, female literacy rate in our case.The  two coefficients of simple regression line  are:
1.b0, the intercept.
2.b1, the slope

If we substitute the values in this equation from coefficient estimates in our output, we get
        y=-79.49 + 6.23x

This line of regression has an intercept of -79.49 and slope of 6.23.
Lets first understand the intercept.

The line of regression has the intercept of point which passes through x-bar and y-bar i.e mean of x and mean of y.If the predictor (x) in simple linear regression is set to 0, the predicted value of y will be equal to the intercept.(As a side note, the model with no predictors is the mean model and is also called intercept-only model.)
     y= -79.49 + 6.23(0)
     y=-79.49

In other words,intercept is the estimated value of y when we consider average (mean )value of of x in our data frame.If however, its not possible for x to have a value of zero, the estimated value of y will have no meaning at all.Our example does not have data available for x=0 i.e our data frame has no observation when age at first marriage in females is zero, therefore y=-79.49 has no meaning at all.The intercept in our example just fixes the regression line on y-axis .

Lets understand the slope

Slope of a regression line gives the rate of change in the conditional mean of y if we change x by one unit.In our example,if we consider the change in "age at first marriage of females" by one unit ,the estimated change in the conditional mean of "female literacy rate" will be 6.23 .Having said that, if the "age at first marriage in female" is 20, the mean"female literacy rate" at x=20 would be 6.23% more than if the "age at first marriage in females" were 19.

Coefficient standard error:Our sample output gives the point estimates of the slope of the true regression line of the population .But its better to consider the standard error of the estimate as well in case we ran the model again and again with different samples of the same size from the same population to estimates the true population slope.Standard error measures how precisely the coefficients estimate the true population slope (or for that matter any population parameter).In other word, standard error gives the amount variation we may see in the coefficients in case we ran the sample gain.
For a large sample with normal distribution, 95% times the true population parameter lies within 1.96 times standard deviation.So ,from our sample output ,assuming that our data is normally distributed, the slope coefficient will lie within 6.23 ± (1.96)1.18.

t-value:The t-value in the output is the value of t-statistic.This value gives the number of standard deviations our coefficient lies away from 0.The farther its away from 0.the lesser the chance that it falls in the 95% confidence interval of the mean,the more the chance of rejecting the null hypothesis.In our sample output, the t-statistic value of our slope coefficient is 6.23/1.18=5.280 which is away from mean of 0 in positive direction.

p-value: t-value and p-value are used in significance testing of the coefficients of our sample.p-value gives the probability of observing a coefficient with absolute value of t.The probability of observing a coefficient with t-value of 5.280 is quite low, 2.69-e06 in our case.According to null hypotheses the coefficient is 0 or in other words, there is no relation between the predictor and response, but such a low probability overthrows that assumption and indicates coefficients are not zero and that there must be a relation between the predictor(age at first marriage in females) and response (female literacy rate) variable.

Residual standard error: Residuals of a model are unexplained and random.Its the difference between the predicted value and observed value. Residual standard error is the square root of the sum of these squared errors divided by the residual degree of freedom.Residual degree of freedom is the total degree of freedom - model degree of freedom.In our example, number of observations is 53, so total degree of freedom is 52 and since there are two coefficients including intercept, the model degree of freedom is 2-1=1.So residual degree of freedom is 52-1=51.
In other words, the residual standard error tells us the how wrong on average our model predicts.The smaller the value the better our model.Its worth to note that it has the same unit as the response variable.

Multiple R-squared: R2  is also called the co-efficient of determination .In simple regression model like our example's, its is equal to the square of correlation between the predictor and response variable.In other words, R-squared gives the variation in response variable that is due to its regression on the predictor..According to ANOVA (ANalysis Of VAriance),total sum of squares(total variation in the response variable)=regression sum of squares (variation explained by the predictor)+ Error sum of squares(random variation due to error) .R=1-Regression sum of errors/Total sum of squares.In our example, R-squares=.3534 i.e 35.34 % of the variation in our response variable is due to the predictor variable.
There is no good measure of how high R2 should be .Its depends on the system we are analyzing.

Adjusted R-squared: This measure is same as the r-squared except that it adjusts for the number of predictor variables used.R-squared increases with each addition of predictor variables to the model.Adjusted R-squared on the other hand increases ONLY if the addition of the predictor variable increases the fit of the model thats is more that we would expect by chance.
When one has more than one predictor, its better to looks at this measure than R-squared.

F-statistic: F-statistic of overall significance measures fit of our model to intercept-only model.If the p-value of the F-statistic is lesser than chosen significance level, it means our model is provides better fit that mean model also called intercept-only model.

Its very importance to check if the assumptions of linear regression are met before we trust these results


Sunday, May 21, 2017

Bivariate analyses of age at the time of first marriage of females and female literacy rate

The data for these examples is taken from gapminder.We are going to analyze(learn to analyze) bivariate analysis of  female literacy rate (aged 15+)and age of females at the time of their first marriage.

We have data in two separate excel file  and we are going to read each file into a data frame and join them to make a single data frame fit to be analyzed.

female_litrecy<-read_excel('indicatorSE_ADT_LITR_FE_ZS.xls.xlsx')

colnames(female_litrecy)[1]<-"Country"

gatheredfemalelitrecy <- female_litrecy %>% gather(Years,female_litrecy_rate,-Country,na.rm = TRUE,convert=TRUE)

In the above code, na.rm means remove the rows with NA value.This makes sense because the rows which have no value will not contribute to the analysis.Convert=true in the above code converts the data type to correct form.The default is NOT to convert.Since we are reading from excel, some columns may be mistakenly stored as type char but reading them after setting convert=TRUE automatically converts it into correct type. e.g in the above code. column "Years"will be read as char if we don’t set convert=true
Now lets read the file of age at the time of their first marriage

ageofmarrigeinexcel<-read_excel("indicator age of marriage.xlsx")

## name the first column

colnames(ageofmarrigeinexcel)[1]<-"Country"

gatheredAgeDatafromExcel <- ageofmarrigeinexcel %>% gather(Years,Age,-Country,na.rm = TRUE,convert=TRUE)

Lets inner_join the two data frame on "Country" and "Years" fields

litrecy_and_ageatmarriage<-inner_join(gatheredfemalelitrecy,gatheredAgeDatafromExcel,by=c('Country','Years'))

Now that we have we have our data frame ready, lets plot a graph using ggplot which is in the package GGplot2.Please ensure we have that loaded first.

ggplot(data=litrecy_and_ageatmarriage, aes(x=female_litrecy_rate,y=Age)) +

geom_point(aes(color=Country,size=Years)) +

scale_y_continuous(breaks=seq(10,35,2)) +

scale_x_continuous(breaks=seq(0,100,5))+

xlab('%age of literate females aged 15 and above') +

ylab('Age at first marriage of females')   + geom_text(aes(label=Country)) +

geom_smooth(method='lm', formula=y~x)

In the last line of above code, I have added regression line to analyze the relation between the two variables.The above code generates the following graph

The above graph indicates that  as the female literacy rate increases, the age at which the females marry also increases.
Lets quantify this relation between the two variables using Pearson's coefficient r.Null hypothesis for our analyses is r=0 indicating there is no relation between the two variables.Alternative hypothesis is that r is not equal to 0.We use Pearson's coefficient r of our sample to estimate the true correlation between the variables in population.We start the analysis with the assumption that Null hypothesis is true.

with(litrecy_and_ageatmarriage,cor.test(female_litrecy_rate,Age))
The above code gives the following result
Pearson's product-moment correlation

data:  female_litrecy_rate and Age
t = 5.2795, df = 51, p-value = 2.687e-06
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.3862388 0.7450490
sample estimates:
     cor 
0.594471 

Lets understand the above output.

Pearson's correlation coefficient- r : The value of r is .5038696.This indicates a positive albeit not a strong relation between female literacy rate and age at first marriage of females.The correlation coefficient can take a value from -1 which indicates strong negative relationship to 1 which indicates strong positive relationship.The value of 0 indicates no relationship.

95% confidence interval: In a normal distribution, 95% of data lies within 2 standard deviations of the mean.95% confidence interval gives just that ;range within which the parameter we are analyzing should fall. Confidence interval in our case is between .3862388 and .7450490 which indicates that the parameter we are analyzing i.e r falls within this range. As we begin with that assumption that null hypothesis is true and r=0 , there is NO relation between female literacy rate and age at first marriage of females But the confidence interval we got overthrows this assumption and does NOT contain 0 which is the value of r of our null hypothesis.Therefore there is something going on and there might be relation between these two variables.

df: Degrees of freedom is number of data values in a sample that can be varied to achieve a specific result.For example: If sample size in 10 integers and we want them to add up to 100,we have the freedom to assign any value to 9 numbers but the last number has to be of specific value so that they all add up to 100.So degree of freedom is 10-1=9. df for Pearson's r is n-2 because we have a pair of variables.In this output, df =51, so number of values in our sample in 51+2=53.The bigger the sample size, the more precise our estimates are.

t (statistic) : The value of t gives the number of standard deviations our r lies away from 0.The value of 5.2795 indicates that it is this much away from 0 in positive direction.The farther its value is away from 0 in either direction, the more the chance of rejecting the null hypothesis which states there is no relation.

p-value: This gives the probability of any value equal to greater than the absolute value of t.The p-value of 2.687-e06 indicates the probability of observing value of r with t statistic of 5.2795 and this probability is quite low.

Based on the above result of our sample, we can confirm that there exists a relationship between "age at first marriage of females" and "female literacy rate" in the population as well.

Lets now make a simple regression model to help us predict the female literacy rate based on  the age at first marriage of females.Simple regression also helps us understand the relations between these two in more details.

Assumptions of linear regression

There are some assumptions that a OLS (ordinary least square) or  linear regression makes about data and if these assumptions are not satisf...