We have data in two separate excel file and we are going to read each file into a data frame and join them to make a single data frame fit to be analyzed.
female_litrecy<-read_excel('indicatorSE_ADT_LITR_FE_ZS.xls.xlsx')
colnames(female_litrecy)[1]<-"Country"
gatheredfemalelitrecy <- female_litrecy %>% gather(Years,female_litrecy_rate,-Country,na.rm = TRUE,convert=TRUE)
In the above code, na.rm means remove the rows with NA value.This makes sense because the rows which have no value will not contribute to the analysis.Convert=true in the above code converts the data type to correct form.The default is NOT to convert.Since we are reading from excel, some columns may be mistakenly stored as type char but reading them after setting convert=TRUE automatically converts it into correct type. e.g in the above code. column "Years"will be read as char if we don’t set convert=true
Now lets read the file of age at the time of their first marriage
ageofmarrigeinexcel<-read_excel("indicator age of marriage.xlsx")
## name the first column
colnames(ageofmarrigeinexcel)[1]<-"Country"
gatheredAgeDatafromExcel <- ageofmarrigeinexcel %>% gather(Years,Age,-Country,na.rm = TRUE,convert=TRUE)
Lets inner_join the two data frame on "Country" and "Years" fields
litrecy_and_ageatmarriage<-inner_join(gatheredfemalelitrecy,gatheredAgeDatafromExcel,by=c('Country','Years'))
Now that we have we have our data frame ready, lets plot a graph using ggplot which is in the package GGplot2.Please ensure we have that loaded first.
ggplot(data=litrecy_and_ageatmarriage, aes(x=female_litrecy_rate,y=Age)) +
geom_point(aes(color=Country,size=Years)) +
scale_y_continuous(breaks=seq(10,35,2)) +
scale_x_continuous(breaks=seq(0,100,5))+
xlab('%age of literate females aged 15 and above') +
ylab('Age at first marriage of females') + geom_text(aes(label=Country)) +
geom_smooth(method='lm', formula=y~x)
In the last line of above code, I have added regression line to analyze the relation between the two variables.The above code generates the following graph
The above graph indicates that as the female literacy rate increases, the age at which the females marry also increases.
Lets quantify this relation between the two variables using Pearson's coefficient r.Null hypothesis for our analyses is r=0 indicating there is no relation between the two variables.Alternative hypothesis is that r is not equal to 0.We use Pearson's coefficient r of our sample to estimate the true correlation between the variables in population.We start the analysis with the assumption that Null hypothesis is true.
with(litrecy_and_ageatmarriage,cor.test(female_litrecy_rate,Age))
The above code gives the following result
Pearson's product-moment correlation
data: female_litrecy_rate and Age
t = 5.2795, df = 51, p-value = 2.687e-06
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.3862388 0.7450490
sample estimates:
cor
0.594471
Lets understand the above output.
Pearson's correlation coefficient- r : The value of r is .5038696.This indicates a positive albeit not a strong relation between female literacy rate and age at first marriage of females.The correlation coefficient can take a value from -1 which indicates strong negative relationship to 1 which indicates strong positive relationship.The value of 0 indicates no relationship.
95% confidence interval: In a normal distribution, 95% of data lies within 2 standard deviations of the mean.95% confidence interval gives just that ;range within which the parameter we are analyzing should fall. Confidence interval in our case is between .3862388 and .7450490 which indicates that the parameter we are analyzing i.e r falls within this range. As we begin with that assumption that null hypothesis is true and r=0 , there is NO relation between female literacy rate and age at first marriage of females But the confidence interval we got overthrows this assumption and does NOT contain 0 which is the value of r of our null hypothesis.Therefore there is something going on and there might be relation between these two variables.
df: Degrees of freedom is number of data values in a sample that can be varied to achieve a specific result.For example: If sample size in 10 integers and we want them to add up to 100,we have the freedom to assign any value to 9 numbers but the last number has to be of specific value so that they all add up to 100.So degree of freedom is 10-1=9. df for Pearson's r is n-2 because we have a pair of variables.In this output, df =51, so number of values in our sample in 51+2=53.The bigger the sample size, the more precise our estimates are.
t (statistic) : The value of t gives the number of standard deviations our r lies away from 0.The value of 5.2795 indicates that it is this much away from 0 in positive direction.The farther its value is away from 0 in either direction, the more the chance of rejecting the null hypothesis which states there is no relation.
p-value: This gives the probability of any value equal to greater than the absolute value of t.The p-value of 2.687-e06 indicates the probability of observing value of r with t statistic of 5.2795 and this probability is quite low.
Based on the above result of our sample, we can confirm that there exists a relationship between "age at first marriage of females" and "female literacy rate" in the population as well.
Lets now make a simple regression model to help us predict the female literacy rate based on the age at first marriage of females.Simple regression also helps us understand the relations between these two in more details.

No comments:
Post a Comment