# Introduction to Linear Regression

When you think of **Prediction**, think Regression. Regression is a technique that uses the historical relationship between an independent (e.g. Salary) and a dependent variable (e.g. years of experience) to predict the future values of the dependent variable. ** **In a simple case you could predict salary on the basis of the number of years of experience by using Linear Regression

**Application of Linear Regression**

Businesses use regression to predict many things like future sales, stock prices, currency exchange rates, and productivity gains resulting from a training program.

**Types of Regressions**

A regression models the past relationship between variables to predict their future behaviour. As an example. How can we formally test that there is a relationship between Wages and education spend in years. More importantly how can we expect our wage to increase in every year spent on our education i.e. is it even worth of studying in high school.

The **dependent** variable in this instance is Wages and the **independent** variable is Education.

Usually, more than one independent variable influences the dependent variable. You can imagine in the above example that Wages are influenced by Education, also if we include other factors as well, such as age, gender, work experience, and sector. When one independent variable is used in a regression, it is called a **simple regression**; when two or more independent variables are used, it is called a **multiple regression**.

The general formula for simple and multiple linear regression is given as:

Simple linear regression:

Wages(dependent variable) = (Y-Intercept) + Education(Independent Variable)

Y= Y_{ }+ X

Multiple regression equation:

Wages(dependent variable) = (Y-Intercept) + (Education) + (age) + (Gender) + (Work Experience) + (Sector)

Y = Y + X1 + X2 + X3 + X4 + X5

So the best way to know the relationship between independent and dependent variable is by scatter plot.

**Scatter plot:**

Consider an above example of wages and education:

Let us consider data of 20 professionals of their years of education and Wages in dollars per hour.

**Note : Make sure the collected data is a representation of the population.**

In statistics we must ensure that our sample of individuals must represents our population. That means we must ensure the random sampling, this will allow us the make the inferences of our population at large.

So to represent the above individuals on their Wages and Education, the best way is the scatter plot.

This Scatter plot allowed us to accommodate all the individuals with their wages and years in Education.

Now to know the relationship between our variables or the pattern between them we use the **line of best fit.**

The line of Best fit is the line which represents the general pattern of the sample. **A regression line** is simply the line of best fit for a given sample.

Now we know that the equation of line is

**Y=mx + c**

Where m =slope

C= intercept of the line.

In regression analysis we represent the best fit line with

**Y=** Y_{ }**+ ****X**

(Pronounced as Beta not) Y= Intercept

(Pronounced as Beta one) X=Slope of the line

Here Y= Wages and X = Education

so if X >0 it has a **positive relationship.**

Wages = Y + X(Education)So **Y=** Y_{ }**+ ****X**

The above Shows the positive relationship between Wages and Education. The more Education a person attains the higher the wage it gets.

If X <0 it has a **negative relationship**. The regression line is in a downward direction.

There is an negative relationship between the Wages and Education. It has a general trend that the more educated is any individual the less pay they would get.

In this case the slope of regression line X_{ }is negative.

If X =0 it has a **No relationship**. The regression line is in a Straight direction.

There may be no relationship between Wages and Education. The Slope of the regression line X is zero.

**Estimation of regression line:**

Let suppose we get an estimated regression line as:

**Y=2.372 + 1.267x**

Means: Wages = 2.372 + 1.267(Education)

This means that the line cuts the Y-Axis at 2.372 (Dollars) and slope of the line is 1.267 (in Years)

**Now lets make a prediction:**

Suppose that for a Professional who is having an work experience of 12 years and we wanted to know about the wage of that person per hour in dollars then we simply replace x by 12 in the above equation as:

Wages = 2.372+1.267*12

Wages = $17.57 per hour

Lets take another example:

To know about the Wage of a person who is having a 14 years of Education.

Wages = 2.372 + 1.267*14

Wages = $20.11 Per hour

2) When education is Zero i.e (X=0) , the Wages is expected to be $2.372 per hour.

1) This means that for every 1 year addition of education the wages is expected to increase by $1.5 approx.**Inference from Prediction:**

**Residuals:**

Residuals are the difference between the actual value and the predicted value.

Suppose as per our predictions, the wage for a professional who has a 12 years of education(Let say #11 from table) which is $17 per hour. Actually the wage of that professional is $20 per hour.

So difference between the actual and the predicted wages which is $3 are the residuals.

Thus Residuals =Actual Value- Predicted Value

Residuals =$20-$17 = $3

So Residuals are the other factors which doesn’t include into the regression equation. These are the factors that does have an effects on the wages but not contained into the model.

**Wages = Y + X(Education) + µ(Residuals)**

**Summary:**

1) The Regression line is the “Line of Best Fit”

2) X is slope of the line. A 1 unit increase in X will lead to X increase in Y

3) Y is the value of Y when X is equals to Zero

4) X>0 means that there is an positive relationship with X and Y

5) The Estimated regression can be used to make the prediction for Y given X. Example with 12 years of education gives wage of $17 per hour

6) The Residuals are the actual value of Y minus the predicted value

7) The Residuals terms contains all the factors(other than X) that impact Y