Linear Regression
This tool calculates statistical linear regression. Namely, it computes the following elements :
- Linear regression line,
- Total sum of squares (TSS or SST),
- Explained sum of squares (ESS),
- Residual sum of squares (RSS),
- Mean square residual,
- degrees of freedom,
- Residual standard deviation,
- Correlation coefficient,
- Coefficient of determination (R² or r²),
- Regression variance,
- 95% confidence interval,
- 95% prediction interval.
Simple Linear Regression Line
The purpose of a simple linear regression is to establish a linear relationship between a single variable Y called dependent variable and a single variable X called independent variable X.
Graphical representation of a linear regression :
Variable `X = {x_1, x_2,...,x_n}` in x-axis
Variable `Y = {y_1, y_2,...,y_n}` in y-axis
Computing a linear regression is equivalent to estimate two parameters `beta_0` and `beta_1` that define the regression line :
`y = beta_1 . x + beta_0`
The most commonly used method for estimating `beta_0` and `beta_1` is the least-squares method.
Estimators for `beta_0` and `beta_1`:
We note `bar x` the arithmetic mean of the X series, `bar x = 1/N.sum_{i=1}^{i=N}x_i`
We denote `bar y` the arithmetic mean of the Y series, `bar y = 1/N.sum_{i=1}^{i=N}y_i`
`hat beta_1 = \frac{\text{cov}(X,Y)}{\text{var}(X)} = \frac{sum_{i=1}^{i=n} (x_i - bar x) (y_i - bar y)}{sum_{i=1}^{i=n} (x_i - bar x)^2}`
`hat beta_0 = bar y - hat beta_1 . bar x`
Estimate y0 for x0
Once the regession line is calculated as explained above, the variable Y can be estimated for any value of variable X using the line equation and estimators of `beta_1` and `beta_0`:
`hat y_0 = hat beta_1 . x_0 + hat beta_0`
ESS, RSS, TSS and coefficient of determination (R²)
To qualify the quality of a linear regression, ie its ability to predict the dependent variable (Y), several parameters are used including,
- ESS or Explained Sum of Squares : this is the variation explained by the regression. It is calculated as follows,
`ESS = sum_{i=1}^{i=n} (hat y_i - bar y)^2`
- RSS or Residual Sum of Squares: this is the variation non-explained by the regression. It is calculated as follows,
`RSS = sum_{i=1}^{i=n} (y_i - hat y_i)^2`
- TSS or Total Sum of Squares : this is the total variation. It is calculated as follows,
`TSS = ESS + RSS = sum_{i=1}^{i=n} (y_i - bar y)^2`
- R² or coefficient of determination defined by,
`R^2 = \frac{ESS}{TSS} = 1 - \frac{RSS}{TSS}`
We see that `0 <= R^2 <= 1`.
The closer R² is to 1, the better the quality of the prediction by the linear regression model : the cloud of points is tightened around the regression line. Conversely, the closer R² is to 0, the worse the quality of the prediction.