Monday, June 12, 2017

Simple regression model

A linear regression model that involves one predictor  and one response variable is called simple regression model.This statistical method allows us to understand the relationship between two variables in greater depth and thereby help us make models for predictions.The predictor variable on x-axis is also call dependent variable and the response variable on y-axis is also call independent variable.

In this post, we are going to use the same data as we had used for our previous post-Bivariate analysis of female literacy rate and age at first marriage.

##Data frame litrecy_and_ageatmarriage contains our data.

p2<-lm(data=litrecy_and_ageatmarriage,female_litrecy_rate~Age )
summary(p2)
In the above code,lm is R function for linear model and we assign the result of linear model to p2. The predictor variable in our example is Age in Years (age at first marriage in females) and response variable is female literacy rate as percentage of general literate population.Lets check the details of p2 using summary(p2) which gives the following results

Call:

lm(formula = female_litrecy_rate ~ Age, data = litrecy_and_ageatmarriage)

Residuals:

    Min      1Q  Median      3Q     Max

-47.545 -20.735  -3.359  21.140  52.330

Coefficients:

             Estimate Std. Error t value Pr(>|t|)   

(Intercept)   -79.49      26.11  -3.044  0.00368 **

Age             6.23       1.18   5.280 2.69e-06 ***

---

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 23.85 on 51 degrees of freedom

Multiple R-squared:  0.3534, Adjusted R-squared:  0.3407

F-statistic: 27.87 on 1 and 51 DF,  p-value: 2.687e-06

Lets understand the output.


Residuals: Residuals are the difference between the observed values of response variable and its value predicted by our model.In our example it is the difference between the value of female literacy rate in our data frame (observed value)and the value predicted by our model.Simple regression model minimizes this difference because the model which predicts values closer to our observed values will predict better for unobserved values.We should look for almost symmetric distribution of residuals around the mean. Any asymmetry indicates that some values predicted by our model are too high or too low or there might be some thing else going on that is not captured by our model.

Coefficients Estimates: The line of simple regression is the one which minimizes the residuals and is represented mathematically by y=b1x+b0.Here y is the response variable, female literacy rate in our case.The  two coefficients of simple regression line  are:
1.b0, the intercept.
2.b1, the slope

If we substitute the values in this equation from coefficient estimates in our output, we get
        y=-79.49 + 6.23x

This line of regression has an intercept of -79.49 and slope of 6.23.
Lets first understand the intercept.

The line of regression has the intercept of point which passes through x-bar and y-bar i.e mean of x and mean of y.If the predictor (x) in simple linear regression is set to 0, the predicted value of y will be equal to the intercept.(As a side note, the model with no predictors is the mean model and is also called intercept-only model.)
     y= -79.49 + 6.23(0)
     y=-79.49

In other words,intercept is the estimated value of y when we consider average (mean )value of of x in our data frame.If however, its not possible for x to have a value of zero, the estimated value of y will have no meaning at all.Our example does not have data available for x=0 i.e our data frame has no observation when age at first marriage in females is zero, therefore y=-79.49 has no meaning at all.The intercept in our example just fixes the regression line on y-axis .

Lets understand the slope

Slope of a regression line gives the rate of change in the conditional mean of y if we change x by one unit.In our example,if we consider the change in "age at first marriage of females" by one unit ,the estimated change in the conditional mean of "female literacy rate" will be 6.23 .Having said that, if the "age at first marriage in female" is 20, the mean"female literacy rate" at x=20 would be 6.23% more than if the "age at first marriage in females" were 19.

Coefficient standard error:Our sample output gives the point estimates of the slope of the true regression line of the population .But its better to consider the standard error of the estimate as well in case we ran the model again and again with different samples of the same size from the same population to estimates the true population slope.Standard error measures how precisely the coefficients estimate the true population slope (or for that matter any population parameter).In other word, standard error gives the amount variation we may see in the coefficients in case we ran the sample gain.
For a large sample with normal distribution, 95% times the true population parameter lies within 1.96 times standard deviation.So ,from our sample output ,assuming that our data is normally distributed, the slope coefficient will lie within 6.23 ± (1.96)1.18.

t-value:The t-value in the output is the value of t-statistic.This value gives the number of standard deviations our coefficient lies away from 0.The farther its away from 0.the lesser the chance that it falls in the 95% confidence interval of the mean,the more the chance of rejecting the null hypothesis.In our sample output, the t-statistic value of our slope coefficient is 6.23/1.18=5.280 which is away from mean of 0 in positive direction.

p-value: t-value and p-value are used in significance testing of the coefficients of our sample.p-value gives the probability of observing a coefficient with absolute value of t.The probability of observing a coefficient with t-value of 5.280 is quite low, 2.69-e06 in our case.According to null hypotheses the coefficient is 0 or in other words, there is no relation between the predictor and response, but such a low probability overthrows that assumption and indicates coefficients are not zero and that there must be a relation between the predictor(age at first marriage in females) and response (female literacy rate) variable.

Residual standard error: Residuals of a model are unexplained and random.Its the difference between the predicted value and observed value. Residual standard error is the square root of the sum of these squared errors divided by the residual degree of freedom.Residual degree of freedom is the total degree of freedom - model degree of freedom.In our example, number of observations is 53, so total degree of freedom is 52 and since there are two coefficients including intercept, the model degree of freedom is 2-1=1.So residual degree of freedom is 52-1=51.
In other words, the residual standard error tells us the how wrong on average our model predicts.The smaller the value the better our model.Its worth to note that it has the same unit as the response variable.

Multiple R-squared: R2  is also called the co-efficient of determination .In simple regression model like our example's, its is equal to the square of correlation between the predictor and response variable.In other words, R-squared gives the variation in response variable that is due to its regression on the predictor..According to ANOVA (ANalysis Of VAriance),total sum of squares(total variation in the response variable)=regression sum of squares (variation explained by the predictor)+ Error sum of squares(random variation due to error) .R=1-Regression sum of errors/Total sum of squares.In our example, R-squares=.3534 i.e 35.34 % of the variation in our response variable is due to the predictor variable.
There is no good measure of how high R2 should be .Its depends on the system we are analyzing.

Adjusted R-squared: This measure is same as the r-squared except that it adjusts for the number of predictor variables used.R-squared increases with each addition of predictor variables to the model.Adjusted R-squared on the other hand increases ONLY if the addition of the predictor variable increases the fit of the model thats is more that we would expect by chance.
When one has more than one predictor, its better to looks at this measure than R-squared.

F-statistic: F-statistic of overall significance measures fit of our model to intercept-only model.If the p-value of the F-statistic is lesser than chosen significance level, it means our model is provides better fit that mean model also called intercept-only model.

Its very importance to check if the assumptions of linear regression are met before we trust these results


No comments:

Assumptions of linear regression

There are some assumptions that a OLS (ordinary least square) or  linear regression makes about data and if these assumptions are not satisf...