# Regression Part 1: Linear Regression

Before we discuss regression, let us review a few technical terms:

Dependent variable: This is the outcome variable. It is called the dependent variable because its value depends upon (is influenced by) other variables (in the equation).

Independent variable: This is so called because its values are not dependent upon the outcome variable- it is independent of the outcome variable.

So what is regression?

We saw in the post on correlation, that correlation is a measure of how one continuous variable changes with respect to another continuous variable. The correlation coefficient tells us three things

1. If there is a correlation or not (if the correlation coefficient is zero, there is no correlation; all other values between -1 and +1 indicate (at least some) correlation)

2. Whether the correlation is positive or negative (if the sign of the correlation coefficient is +, it indicates positive correlation; if the sign is -, it indicates negative correlation)

3. The magnitude of correlation (as reflected by the value of the correlation coefficient)

However, correlation does not tell us what the value of the dependent variable would be, if the independent variable had a certain value. Let me illustrate this point using an example:

In the previous post I had given the example of age and height to illustrate positive correlation- as age increases, so does height.

What correlation does not allow us is to predict the height for a given age. Here, age is the independent variable, and height is the dependent variable. The value of height depends upon the value of age, but not vice-versa.

Regression enables us to predict the height at a given age by expressing the relationship between age and height as an equation. This equation is called a regression equation.

The simplest form of a regression equation is:

y= a + bx

where

y= Dependent/ outcome variable

a= intercept/ constant

b= slope

x= independent variable

Substituting the variables from our example, we have:

Height= a + b(age)

The above equation, with only one independent variable, is an example of a Simple Linear Regression.

It is called ‘simple’ because there is only one independent variable.

It is called ‘linear’ because the form of the equation (y= a + bx) is the same as that for a straight line.

If we introduced some more independent variables, then the equation may look like this:

Height= a + b0(age) + b1(sex) + b2(SES)

Such an equation is an example of Multiple Linear Regression.

You might have noticed that the only changes from the previous equation are:

1. There is more than one independent variable.

2. ‘b’ now has numbers attached to it. These numbers increase sequentially with the addition of each additional variable.

3. The independent variables are not all continuous- sex has only two levels, male and female (binary variable); while SES (Socio Economic Score) has more than two levels.

However, the dependent variable is a continuous variable (just as in the simple linear regression equation).

Thus, in a simple/ multiple linear regression equation the dependent variable should always be a continuous variable. The independent variables may be continuous; categorical (ordinal), or binary.