## Statistics concepts of Correlation & Regression with some R Programming

**Converting quantitative into Ordinal**when you have lots of quatitative data then in order to analyse them you can make them into categorial one by making Frequency Table . For example you have weight of 100 students of a class then it's a good practice to present it into categorical one by making a frequency table as shown below.

**weight frequency percentages**

40-45 4 5

45-50 7 9

50 -55 14 16

.......

so this way a large quantitative data can be easily anlayzed with more sense of understanding using frequency table & converting into categorical data.

**Types of Graphs for different type of data****For Nominal/Ordinal data:**It is always advisable to make pie-chart & Bar graph for Nominal & Ordinal data

**For Interval & Ratio:**To define the data in this category one need to show their data in Histogram and check for the skewness of the curve.

**Note:**One cannot apply mean and median at Nominal data because it can't be arranged in ascending order For example. Donkey, Monkey, Cat, Rabbit & Mouse can't be arranged in ascending order nor their mean is possible

Note: When there are outliers that can seriously impact the central tendency than mean is not a good choice. In case of outliers we prefer median.

**Range****:**It's a measure of variability in a data more the range the more the variability. To describe the Range in a more informative way we can use boxplot. In boxplot we can use show the variability of data in each quartile.

**Z- Score:**z-score is a number that will tell us that how far a value is from the mean value of the data.

to learn more In detail I will refer to a great website that will discuss the concept in detail.

http://www.statisticshowto.com/probability-and-statistics/z-score/

**Correlation Analysis:**There are three types of Correlation Analysis

a. Pearson's Correlation: It is a parametric linear association between 2 numeric variables.

b. Spearman's Correlation: It is a non-parametric measure of the monotonic association between 2 numeric variables

c. Kendall 's Rank correlation is a non-parametric measure of the association based on concordance of discordance of X- Y Pairs.

Analysis using R Programming

> cor(Age,LungCap,method = "pearson")

[1] 0.8196749

> cor(Age,LungCap,method = "spearman")

[1] 0.8172464

> cor(Age,LungCap,method = "kendall")

[1] 0.639576

Pearson's R correlation:

Pearson's correlation test is a test that discuss the linear relation between two variable. It is one a parametric test. The value of "r" lies between -1 & 1. The more it is toward 1 the more are the chances of variable to be linearly correlated.If it approaches toward -1 than there is inverse relationship.

If you want to test correlation along with hypothesis testing you can use

> cor.test(Age,LungCap,method = "pearson")

Pearson's product-moment correlation

data: Age and LungCap

t = 38.476, df = 723, p-value < 2.2e-16

alternative hypothesis: true correlation is not equal to 0

95 percent confidence interval:

0.7942660 0.8422217

sample estimates:

cor

0.8196749

To read in detail about this great concept we can refer to a great article on the topic as linked below

https://statistics.laerd.com/statistical-guides/pearson-correlation-coefficient-statistical-guide.php

To know the basic steps of calculation on can refer the link

http://www.statisticshowto.com/how-to-compute-pearsons-correlation-coefficients

Correlation tells us two important things the strength of relationship among variables and whether the strength is positive or negative.

For pearson's R correlation both the variables should be normally distributed

Various Research Problems tackled in R correlation analysis.

For pearson's R correlation both the variables should be normally distributed

Various Research Problems tackled in R correlation analysis.

- If we want to measure how different stocks are related to each other?
- Is there a statistically significant relationship between height in feet and ages in years?
- Relationship among job satisfaction and salary ?

**Covariance**It is a measure that tells how two random variables vary together. It is like variance, but variance tells you how single variable vary but covariance tells how two variable vary together. The value lies from -infinity to +infinity, more positive means more more dependency on each other. There is no range of Covariance like that of correlation from -1 to +1.

> cov(Age,Height)

[1] 24.10498

**Regression Analysis**- It is the process of estimating relationship among dependent & independent ,
**dependent**(target) and**independent variable (s)**(predictor) - Used in forecasting & Prediction of dependent variable. For Example: Using certain data we have identified that growth in sales is around two and half times the growth in economy.

Using the above relation we can easily predict the future sales - Used to identify the strength of the effect that the independent variables have on a dependent variable

**a. Simple Linear Regression**- In which there is one dependent variable and one independent variable

**b. Multiple Linear Regression**- In which there are many independent variables and one dependent variable, for example - The Impact of Rain, soil, fertilisers, temperature, sunlight (all independent variables) on the yield of Rice (Dependent Variable)

Used to estimate the relationship between a dependent variable and two or more independent variables; example, the relationship between the salaries of employees and their experience and education.

Note: Dependent Variable is Y and Independent variable is X.

Let us see below a documentation that can help us to grasp the things better

After a & b are calculated we can put them in slope intecept form (y=bx+a), which will be the line of regression. After then for any value of x, y can be evaluated and thus value can be found out.

Correlation tells & discuss the relationship between variables. It lay emphasis on how strong & impactful is the dependent & independent variable on each other. It tells the strength of linear relationship.

while

Regression is the way to evaluate the value of dependent or independent variable using mathematical equations, usually using a line equation of slope intercept form, where if one variable is known other can be found out.

It's again pretty simple, here are the command below

> cor(CO2$conc, CO2$uptake) #correlation between two variable basic method

[1] 0.4851774

> cor.test(CO2$conc, CO2$uptake) #complete method which tell all details relation to correlation

Pearson's product-moment correlation

data: CO2$conc and CO2$uptake

t = 5.0245, df = 82, p-value = 2.906e-06

alternative hypothesis: true correlation is not equal to 0

95 percent confidence interval:

0.3022189 0.6336595

sample estimates:

cor

0.4851774

Correlation among various variable

> cor(data[,5:9]) # Just choose all the columns which need to be correlated

Expense.Ratio Return.2006 Z.Values X3.Year.Return X5.Year.Return

Expense.Ratio 1.00000000 -0.1335501 NA -0.1099471 -0.05741696

Return.2006 -0.13355013 1.0000000 NA 0.6975430 0.59339807

Z.Values NA NA 1 NA NA

X3.Year.Return -0.10994715 0.6975430 NA 1.0000000 0.83728490

X5.Year.Return -0.05741696 0.5933981 NA 0.8372849 1.00000000

Normality Test

>install.packages("nortest") # Anderson-Darling normality test

>library(nortest)

> ad.test(data[,7])

Anderson-Darling normality test

data: data[, 7]

A = 1.2491, p-value = 0.002968

>library(moments) # install package moments

>skewness(time)

[1] -0.01565162

>kurtosis(time)

[1] 2.301051

Some of the best youtube videos explaining the above concepts are as here

https://www.youtube.com/watch?v=XaNKst8ODEQ&list=PLNPQb2RADnZbl-dyI5fb40uBD8eFJHImr&index=

To calculate Regression in R we use lm() function. It is used for liner regression as well multiple regression.

for example to find the endurance from the age

> lm(yahoo$endurance~yahoo$age)

Call:

lm(formula = yahoo$endurance ~ yahoo$age) #First variable is always Dependent Variable (Y) and second is/are Independent Variables (X).

Coefficients:

(Intercept) yahoo$age

33.1567 -0.1347

> #grades = absence + SAT Score ; where grade is Dep.variable & other two are Independent variable

> #equation will be- grades(y) = a + b1(absence) + b2(SAT Score)

> # let us do the stuff in R itself

> # Independent Variable

> Grades <- c(82,98,76,68,84,99,67,58,50,78)

>

> # Dependent Variables

> Absences <- c(4,2,2,3,1,0,4,8,7,3)

> SAT_Score <- c(620,750,500,520,540,690,590,490,450,560)

>

> # Creating Regression Equation

> Regression <- lm(Grades ~ Absences + SAT_Score)

>

> #Show the results

> summary(Regression)

Call:

lm(formula = Grades ~ Absences + SAT_Score)

Residuals:

Min 1Q Median 3Q Max

-8.791 -1.809 1.060 2.691 5.016

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 33.42231 13.58416 2.460 0.04344 *

Absences -3.34018 0.77323 -4.320 0.00348 **

SAT_Score 0.09446 0.02067 4.569 0.00258 **

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.729 on 7 degrees of freedom

Multiple R-squared: 0.9308, Adjusted R-squared: 0.911

F-statistic: 47.07 on 2 and 7 DF, p-value: 8.724e-05

Let us discuss the result -

Estimates are the coefficient of the equation grades(y) = a + b1(absence) + b2(SAT Score)

where a=33.42231, b1=-3.34018, b2=0.09446. Each coefficient has a p-value followed by * showing the significance level at below .

There is a p-value of the entire equation "p-value: 8.724e-05" which is pretty less here showing statistical significance.

Some more example in R

https://www.r-bloggers.com/simple-linear-regression-2/

A nice example showing multiple regression analysis

http://www.stat.columbia.edu/~martin/W2024/R6.pdf

All discussion about R- Squared and what it predicts in the result

http://blog.minitab.com/blog/adventures-in-statistics-2/regression-analysis-how-do-i-interpret-r-squared-and-assess-the-goodness-of-fit

**Difference between Co-rrelation & Re-ggression**Correlation tells & discuss the relationship between variables. It lay emphasis on how strong & impactful is the dependent & independent variable on each other. It tells the strength of linear relationship.

while

Regression is the way to evaluate the value of dependent or independent variable using mathematical equations, usually using a line equation of slope intercept form, where if one variable is known other can be found out.

**Calculation of Correlation in R**It's again pretty simple, here are the command below

> cor(CO2$conc, CO2$uptake) #correlation between two variable basic method

[1] 0.4851774

> cor.test(CO2$conc, CO2$uptake) #complete method which tell all details relation to correlation

Pearson's product-moment correlation

data: CO2$conc and CO2$uptake

t = 5.0245, df = 82, p-value = 2.906e-06

alternative hypothesis: true correlation is not equal to 0

95 percent confidence interval:

0.3022189 0.6336595

sample estimates:

cor

0.4851774

Correlation among various variable

> cor(data[,5:9]) # Just choose all the columns which need to be correlated

Expense.Ratio Return.2006 Z.Values X3.Year.Return X5.Year.Return

Expense.Ratio 1.00000000 -0.1335501 NA -0.1099471 -0.05741696

Return.2006 -0.13355013 1.0000000 NA 0.6975430 0.59339807

Z.Values NA NA 1 NA NA

X3.Year.Return -0.10994715 0.6975430 NA 1.0000000 0.83728490

X5.Year.Return -0.05741696 0.5933981 NA 0.8372849 1.00000000

Normality Test

>install.packages("nortest") # Anderson-Darling normality test

>library(nortest)

> ad.test(data[,7])

Anderson-Darling normality test

data: data[, 7]

A = 1.2491, p-value = 0.002968

**Skewness / kurtosis**>library(moments) # install package moments

>skewness(time)

[1] -0.01565162

>kurtosis(time)

[1] 2.301051

Some of the best youtube videos explaining the above concepts are as here

https://www.youtube.com/watch?v=XaNKst8ODEQ&list=PLNPQb2RADnZbl-dyI5fb40uBD8eFJHImr&index=

**4****Calculating Regression in R**To calculate Regression in R we use lm() function. It is used for liner regression as well multiple regression.

for example to find the endurance from the age

> lm(yahoo$endurance~yahoo$age)

Call:

lm(formula = yahoo$endurance ~ yahoo$age) #First variable is always Dependent Variable (Y) and second is/are Independent Variables (X).

Coefficients:

(Intercept) yahoo$age

33.1567 -0.1347

**> #multiple linear regression exercise in R**> #grades = absence + SAT Score ; where grade is Dep.variable & other two are Independent variable

> #equation will be- grades(y) = a + b1(absence) + b2(SAT Score)

> # let us do the stuff in R itself

> # Independent Variable

> Grades <- c(82,98,76,68,84,99,67,58,50,78)

>

> # Dependent Variables

> Absences <- c(4,2,2,3,1,0,4,8,7,3)

> SAT_Score <- c(620,750,500,520,540,690,590,490,450,560)

>

> # Creating Regression Equation

> Regression <- lm(Grades ~ Absences + SAT_Score)

>

> #Show the results

> summary(Regression)

Call:

lm(formula = Grades ~ Absences + SAT_Score)

Residuals:

Min 1Q Median 3Q Max

-8.791 -1.809 1.060 2.691 5.016

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 33.42231 13.58416 2.460 0.04344 *

Absences -3.34018 0.77323 -4.320 0.00348 **

SAT_Score 0.09446 0.02067 4.569 0.00258 **

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.729 on 7 degrees of freedom

Multiple R-squared: 0.9308, Adjusted R-squared: 0.911

F-statistic: 47.07 on 2 and 7 DF, p-value: 8.724e-05

Let us discuss the result -

Estimates are the coefficient of the equation grades(y) = a + b1(absence) + b2(SAT Score)

where a=33.42231, b1=-3.34018, b2=0.09446. Each coefficient has a p-value followed by * showing the significance level at below .

There is a p-value of the entire equation "p-value: 8.724e-05" which is pretty less here showing statistical significance.

Some more example in R

https://www.r-bloggers.com/simple-linear-regression-2/

A nice example showing multiple regression analysis

http://www.stat.columbia.edu/~martin/W2024/R6.pdf

All discussion about R- Squared and what it predicts in the result

http://blog.minitab.com/blog/adventures-in-statistics-2/regression-analysis-how-do-i-interpret-r-squared-and-assess-the-goodness-of-fit

div id="fb-root">