Simple Linear Regression and Hypothesis Testing

Published

September 24, 2025

What is the point of linear regression?

Given an experiment with some independent variables and one dependent variable, linear regression is a way to estimate how much the dependent variable relates to the independent variables. Let’s write out a simple linear regression below with an intercept, \(\beta_0\), and one independent variable (i.e. covariate), \(X_i\) with a slope, \(\beta_1\), that models the relationship to the dependent variable, \(Y_i\).

\[ Y_i = \beta_0 + \beta_1 X_i + \epsilon_i , \epsilon_i \sim N(0, \sigma^2) \] The four model assumptions that we have with linear regression are…

Linearity. The dependent variable and independent variable have a linear relationship.
Homoskedasticity. This means that the variance of the errors is constant across the range of \(X_i\).
Normality. The errors are Normally distributed.
Independence. This assumes that \(Y_i\) is independent of \(Y_j\).

Let’s explore a simulated set of data to better visualize this. Simulations are nice because we can make sure our models are working as we think they should before applying it to real data. For the next bit, we will be using \(\beta_0 = 1\), \(\beta_1 = 1\), \(\sigma^2 = 20\), and with 100 data points.

# simulating our data
N <- 100
Xi <- runif(min = 1, max = 10, n = N)

# getting our parameters 
beta0 <- 1
beta1 <- 1
var <- 20

# putting it together 
ei <- rnorm(N, mean = 0, sd = sqrt(var))
Yi <- beta0 + beta1*Xi + ei  

# note that this formulation is equivalent to the above. 
# Yi <- rnorm(N, mean = beta0 + beta1*Xi, sd = sqrt(var)) 

plot(Xi,Yi, pch = 16)

What’s our goal in linear regression? We have an independent variable \(X_i\), and we want to see how it relates to our dependent variable \(Y_i\) in this linear relationship, of an intercept and a slope. The intercept, \(\beta_0\) is the average value of our outcome, \(Y_i\) when our independent variable, or predictor, is set to 0. Our slope, on the other hand, can be understood as the magnitude of the change for the dependent variable given a 1 unit increase of the predictor (independent) variable.

So, our goal is to find a line best minimizes the difference between it and all data \((X_i, Y_i)\), given our choice of \(\beta_0\) and \(\beta_1\). One way of thinking about how to do it is to turn it into a minimization problem. We want to minimize the distance between the line (our model-predicted values in a sense) and the data itself. If \(Y_i\) is our dependent variable, then we can write \(\hat{Y}_i = \beta_0 + \beta_1 * X_i\) as our model-predicted values (our line)! Thus, our goal is to minimize

\[ min \space \Sigma_{i=1}^n (Y_i - [\beta_0 + X_i \beta_1])^2 \] The summed values above is known as the sum of squared errors. When we turn it into a minimization problem, we say that we are trying to solve the least squares equation.

Let’s take a quick dive into matrices for a second.

If we want to solve the least squares equation, it can be helpful to re-write our formula in terms of matrices. Let’s imagine \(Y_i\) as a column vector, where each row \(i\) contains one observation. Below, we can see the first 10 rows of the vector \(Y_i\) from our simulation. This is our observed data!

matrix(Yi[1:10])

           [,1]
 [1,]  8.241770
 [2,] 10.080694
 [3,]  3.735962
 [4,] 12.224881
 [5,]  9.111558
 [6,]  9.267149
 [7,]  3.047968
 [8,]  7.764929
 [9,] -1.390841
[10,]  4.902824

Our goal is to know find a single value for \(\beta_0\) and a single value for \(\beta_1\) such that \(\beta_0 + \beta_1 * X_i\) is our best fit line to \(Y_i\). Note that because in this case, because we simulated the data, we technically know what \(\beta_0\) and \(\beta_1\) should be. But, in non-simulated data land, we won’t know neither of these values, nor the extent of the variance \(\sigma^2\). We will just have our best guess at a relationship. In this case it’s a linear one: \(\beta_0 + \beta_1 * X_i\).

Let’s re-write this line in matrix form such that for row \(i\), we get that \(\hat{Y}_i = \beta_0 + \beta_1 * X_i\). (I put a hat on \(Y\) to showcase that it is an estimate and not a true value. It can also be known as our expected value of Y).

Recalling our rules of matrices, we can re-write it as \[ E[Y] = \hat{Y} = X\vec{\beta} \]

where \[ \begin{align} X &= \begin{bmatrix} 1 & x_1 \\ 1 & x_2 \\ \vdots & \vdots \\ 1 & x_n \end{bmatrix} \\\\ \vec{\beta} &= \begin{bmatrix} \beta_0 \\ \beta_1 \end{bmatrix} \end{align} \]

Note that \(X\) in this context is called the design matrix! It is a \(n \times p\) matrix, where \(n\) is the number of observations and \(p\) is the number of coefficients (\(\beta\)’s) to estimate.

Solving the least squares equation

Now we can rewrite the least squared equation as \[ min (Y - X \vec{\beta})^2 = \hat{\epsilon}^2 \] Note that we can re-write this as \[ \begin{align} min (Y - X \vec{\beta})^2 & = min( (Y - X\vec{\beta})^T(Y - X\vec{\beta}) \\\\ &= min (Y^TY - Y^TX \vec{\beta} - \vec{\beta}^T X^T Y + \vec{\beta}^T X^T X \vec{\beta}) \\\\ &= min (Y^TY - 2\vec{\beta}^T X^T Y + \vec{\beta}^T X^T X \vec{\beta}) \end{align} \] Because we know what \(X\) and \(Y\) are already, our goal is to find the right \(\beta\)’s to minimize this equation. So we differentiate and set the equation equal to 0 to find an inflection point (either a min or a max.)

\[ \begin{align} 0 &= \frac{d}{d\beta} (Y^TY - 2\vec{\beta}^T X^T Y + \vec{\beta}^T X^T X \vec{\beta}) \\\\ &= - 2 X^T Y + 2 X^T X \vec{\beta} \\\\ &= - X^T Y + X^T X \vec{\beta} \\\\ X^T Y &= X^T X \vec{\beta} \\\\ (X^T X)^{-1} X^T Y &= \vec{\beta} \end{align} \] Now we have our \(\beta\)’s that solve our least squares equation!

Going back to predictions and the hat matrix

Because we know that \((X^T X)^{-1} X^T Y = \vec{\beta}\), we can put it back into our equation from before of \(E[Y] = \hat{Y} = X\vec{\beta}\) to get that \[ E[Y] = \hat{Y} = X\vec{\beta} = X (X^T X)^{-1} X^T Y \] Where this matrix \(X (X^T X)^{-1} X^T\) is known as the hat matrix (since it is a matrix that when applied to a vector of observed data, \(Y\), it will give back the expected value given the covariates \(X\), or the predicted values, which conventionally in notation puts a hat over the variable \(\hat{Y}\)).

Can we check this with our simulated data?

# doing linear regression in r
lm.fit <- lm(Yi ~ Xi)
# getting the coefficients from the r output
coef(lm.fit)

(Intercept)          Xi 
   1.551369    1.040747

# creating our design matrix
X.matrix <- as.data.frame(Xi) |>
  # adding the column of intercepts
  add_column(int = 1, .before = 1) |>
  as.matrix()

# solving for the betas from (X^T X)^{-1} X^T
betas <- solve(t(X.matrix)%*%X.matrix)%*%t(X.matrix)%*%as.matrix(Yi)
betas

        [,1]
int 1.551369
Xi  1.040747

# graphing our line 
plot(Xi,Yi, pch = 16) 
lines(Xi, X.matrix%*%betas)

Looks good! So to recap, we now have our beta estimates for our linear regression.

Wait, but what about our variance?

Note that the variance of \(Y\) is the variance of the residuals. Because we’ve assumed that each observation is independent and identically distributed, we will have a variance-covariance matrix that is a diagonal matrix with just \(\sigma^2\) along the diagonal.

\[ \begin{align} Var(Y) &= Var(X\vec{\beta}+\vec{\epsilon}) \\\\ &= Var(X\vec{\beta}) + Var(\vec{\epsilon}) + 2Cov(X\vec{\beta}, \vec{\epsilon}) \\\\ &= Var(\vec{\epsilon}) \\\\ &= \sigma^2*I \end{align} \]

Recall that our variance \(\epsilon_i \sim N(0, \sigma^2)\).

Adding on from the previous section, we have that our hat matrix is \(H = X(X^T X)^{-1} X^T\). So we can rewrite the linear regression and then the errors as…

\[ \begin{align} Y_i &= \beta_0 + \beta_1 * X_i + \epsilon_i \\\\ Y &= X \vec{\beta} + \vec{\epsilon} \\\\ &= HY + \vec{\epsilon} \\\\ \vec{\epsilon} &= Y - HY \\\\ &= (I- H)Y \end{align} \] So now we have a way of getting the residuals from our hat matrix! However, we still need to find way to estimate \(\sigma\).

How do we estimate \(\sigma^2\)? An intuitive way I like to think about the variance, is that it quantifies how dispersed the data is from the mean. In other words, how far \(Y_i\) is from \(\beta_0 + \beta_1 * X_i\) averaged over the whole data set. Sound familiar? This is the same idea as our least squares equation from before. This quantity is often called the mean squared error.

\[ \begin{align} \sigma^2 \rightarrow s^2 &= \Sigma_{i=1}^{n} (Y_i - H*Y_i)^2 / (n-p) \\\\ &= \epsilon^2 / (n-p) \\\\ &= \vec{\epsilon}^T\vec{\epsilon}/(n-p) \end{align} \] We use \(s^2\) to denote the estimated variance. Note that the bottom part of the above derivation is just how we would write in matrix form from the definition of the residuals that we showed earlier.

Why do we divide by \(n-p\) and not just \(n\) to average over the whole set? This is because we just have \(n-p\) degrees of freedom. An intuitive way to think about this is that each data point you have gives you information. And the more parameters you estimate, the more you use up these information data points.

For example, imagine you have exactly two points and you want to conduct a linear regression. You need a minimum of those two data points to estimate an intercept and a slope. So if you had just one data point, you would not be able to use this type of model, and if you have more than two data points, what’s left over gives you more freedom to better model your data with more information. This the general idea of degrees of freedom.

Let’s again check this in R.

# getting the hat matrix 
H <- X.matrix%*%solve(t(X.matrix)%*%X.matrix)%*%t(X.matrix)

# doing it with a sum 
sum((Yi - H%*%as.matrix(Yi))^2)/(N-2)

[1] 16.60914

# doing it the matrix way 
e.dot <- Yi - H%*%as.matrix(Yi)
(t(e.dot)%*%e.dot)/(N-2) # just a vector of sigmas

         [,1]
[1,] 16.60914

# looking at the variance from the linear model fit 
as.numeric(summary(lm.fit)[6])^2

[1] 16.60914

We see both the sum and matrix method give the same thing and they match the linear model output from R! Note that we square the value from the linear model fit since that is just \(\sigma\) which is the standard deviation (i.e. the square root of the variance).

Back to hypothesis testing

But remember our goal: to see if there is any linear relationship between our independent variable \(X_i\) and our dependent variable \(Y_i\).

We can formulate this into a hypothesis test. The general steps are

Create the null and alternative hypotheses.
Create a test statistic that will quantify evidence against the null hypothesis given the sample. This test statistic should have a calculable sampling distribution. Note that you can’t mix and match test statistics with distributions.
Decide a threshold or significance level, \(\alpha\). If the probability of getting the test statistic given that the null hypothesis is true is smaller than \(\alpha\), we reject the null hypothesis. This probability is also known as the p-value.

So in this case, what is our null and alternative hypothesis? We care if there is any linear relationship between our independent and dependent variables, i.e. if our slope (\(\beta_1\)) is not 0. We can formulate this as

\[ \begin{align} H_0 : \beta_1 = 0 \\ H_A : \beta_1 \neq 0 \end{align} \]

A short digression on the frequentist framework. The data that one has, \(X_i\) is viewed as a sample of a larger population. Because of this, we will never get the true, fixed values of \(\beta_0\) and \(\beta_1\). We can only do our best to estimate their values, \(\hat{\beta_0}\) and \(\hat{\beta_1}\). We can make the argument, that if were to sample our population a bunch of times (i.e. get a bunch of \(X_i\)’s, we would then be able to construct a sampling distribution of different \((\hat{\beta_0}, \hat{\beta_1})\) pairings. Obviously this would take a lot of time and usually we don’t necessarily have a cost or time-effective way to come up with a bunch of samples. This is where the Central Limit Theorem comes in!

The Central Limit Theorem states that if one was to construct a sampling distribution of means by sampling a population over and over and estimating the mean each time, then as the number of samples increases, the sampling distribution approaches a normal distribution.

Thinking about this, we can state that \(\hat{\beta_1}\) will be normally distributed with a mean at \(\beta_1\). Note that we also will have a sampling distribution for \(s^2\), the estimate of \(\sigma^2\), which is \(\chi^2\) distributed.

Quick general facts about some test-statistic distributions

Note that there are test statistics that come from different distributions. Common test statistics are designed to come from the t-distribution, F-distribution, and \(\chi^2\) distribution.

The t-distribution with \(\nu\) degrees of freedom is defined as the distribution of a random variable \(T\), where \[ T = \frac{Z}{\sqrt{V/\nu}} = Z\sqrt{\frac{\nu}{V}} \] Where \(Z\) is standard normally distributed, \(V\) has a \(\chi^2\)-distribution with \(\nu\) degrees of freedom and \(Z\) and \(V\) are independent. The main usage of the t-distribution in this context is by testing the difference between two means and whether that difference is “significant” given a pre-determined threshold \(\alpha\) or not.

The \(\chi^2\) distribution is defined as \[ X = \Sigma_{i=1}^k Z_i^2 \] where \(Z_i\) are independent, standard normal random variables. This distribution is often used in goodness of fit tests and likelihood-ratio tests for nested models, among other things. Note the sample variance is \(\chi^2\)-distributed.

The F-distribution is defined as \[ X = \frac{U_1/d_1}{U_2/d_2} \] where \(U_1\) and \(U_2\) are independent random variables with chi-square distributions with respective degrees of freedom \(d_1\) and \(d_2\). A usage for this distribution is when comparing variances (ANOVA).

So what is our test statistic then?

We are testing to see if there is some linear relationship between our \(Y_i\) and \(X_i\). We have assumed a null hypothesis, \(H_0\) that states that \(\beta_1 = 0\). This our assumed truth. So, if we take our estimated \(\hat{\beta_1}\), and we standardize it around the value of \(\beta_1\) under our null hypothesis, we will get a measure of how far \(\hat{\beta_1}\) is from our assumed \(\beta_1\). Note that this measure will be t-distributed! So, our test statistic, \(T\), can be written as \[ T = \frac{\hat{\beta_1} - \beta_1}{SE(\hat{\beta_1})} \] And because in this hypothesis test, we are assuming that \(\beta_1 = 0\) (i.e. given the null hypothesis is true), this will simplify to

\[ T = \frac{\hat{\beta_1}}{SE(\hat{\beta_1})} \]

Finding the standard errors

We have the solution to the least squares equation. Going back to the general case, we have a vector of \(\hat{\beta}\)’s, as well as an estimate for the variance, \(s^2\). Now we need to find the standard error for the \(\hat{\beta}\)’s so we can calculate our hypothesis test statistic of \(\hat{\beta_1} / SE(\hat{\beta_1})\) which we now know will be t-distributed.

We recall from up way above that \(\vec{\beta} = (X^TX)^{-1}X^TY\). Also note this neat trick that given a matrix \(A\) and vector \(x\), \(Var(Ax) = A*Var(x)*A^T\) and that given a symmetric matrix \(A\), \(A = A^T\) and \(A^{-1}\) is also a symmetric matrix. In our case, we know that \(X^TX\) is a symmetric matrix.

\[ \begin{align} Var(\beta) &= Var((X^TX)^{-1}X^T Y) \\\\ &= (X^TX)^{-1}X^T * Var(Y) * ((X^TX)^{-1}X^T)^T \\\\ &= (X^TX)^{-1}X^T * \sigma^2 I * ((X^TX)^{-1}X^T)^T \\\\ &= \sigma^2 I * (X^TX)^{-1}X^T * ((X^TX)^{-1}X^T)^T \\\\ &= \sigma^2 I * (X^TX)^{-1}X^T * (X^T)^T ((X^TX)^{-1})^T \\\\ &= \sigma^2 I * (X^TX)^{-1}X^T * X ((X^TX)^{-1})^T \\\\ &= \sigma^2 I * (X^TX)^{-1} * (X^T X) * ((X^TX)^{-1})^T \\\\ &= \sigma^2 I * ((X^TX)^{-1})^T \\\\ &= \sigma^2 I * (X^TX)^{-1} \end{align} \] Now we only have estimates for some of these values, so in practice, we have that \[ Var(\hat{\beta}) = s^2 I * (X^TX)^{-1} \] Let’s check this out in R.

# getting our sigma estimate as before 
sigma.hat <- sum((Yi - H%*%as.matrix(Yi))^2)/(N-2)

# turning it into an diagonal matrix 
sigma.hat <- matrix(c(sigma.hat,0,0,sigma.hat), nrow = 2) # sigma*Identity matrix

# the matrix of residuals should be 
sigma.hat %*% solve(t(X.matrix)%*%X.matrix)

            int          Xi
[1,]  0.9643141 -0.13942541
[2,] -0.1394254  0.02435341

# checking with the linear model fit
vcov(lm.fit) # compare with above

            (Intercept)          Xi
(Intercept)   0.9643141 -0.13942541
Xi           -0.1394254  0.02435341

We can also then check for the standard errors since the square root of the variance estimate will be the standard error.

# estimated standard error for beta0
sqrt( (sigma.hat %*% solve(t(X.matrix)%*%X.matrix))[1,1] )

     int 
0.981995

# estimated standard error for beta1
sqrt( (sigma.hat %*% solve(t(X.matrix)%*%X.matrix))[2,2] )

       Xi 
0.1560558

# SE from the linear model fit for beta0
sqrt(vcov(lm.fit)[1,1])

[1] 0.981995

# SE from the linear model fit for beta0
sqrt(vcov(lm.fit)[2,2])

[1] 0.1560558

We see that they match! Neat!

Calculating our test statistic

Now that we have our \(\hat{\beta_1}\) and our \(SE(\hat{\beta_1})\), we can now construct our test statistic and evaluate it.

# getting our derived variance covariance matrix 
varcovar.matrix <- sigma.hat %*% solve(t(X.matrix)%*%X.matrix)

# calculating our test statistic
beta1_hat <- betas[2,1]
beta1_SE <- sqrt(varcovar.matrix[2,2])
test.statistic <- beta1_hat/beta1_SE
test.statistic

      Xi 
6.669068

How do we evaluate this test statistic? We know that it comes from a t-distribution and we want evaluate the probability that we would get a value as extreme or more than our evaluated test statistic.

A small digression into probability distributions

The probability of an event \(x\) happening can be defined as \(P(A \leq x \leq B) = \int_A^B f(x;\theta) dx\) (for a continuous distribution), where \(\theta\) are the parameters that describe \(f(x;\theta)\), the probability density function of a probability distribution. In R, there are four different types of functions for probability distributions. These functions start with either d, p, q, or r.

ddistr: d is for density and returns the value of \(f(y;\theta)\), i.e. the value of the probability density function (continuous distributions) or probability mass function (discrete distributions). This is the green point in the figure below. It is the density evaluated at \(x = 2\).
pdistr: p is for probability and returns a value of \(F(y;\theta)\), the cumulative distribution function. This is the shaded region in the figure below. It is evaluated up until \(x = 2\).
qdistr: q is for quantile and returns a value from the inverse of \(F(y; \theta)\) and is also known as the quantile function. This is the blue point. It is the inverse of pdistr. It is evaluated at 0.977, which results in a value of \(x = 2\).
rdistr: r is for random and generates a random value from the given distribution. These are the red points in the figure below. There are 50 points drawn from a Normal distribution, meaning that the mean is 0 and the variance is 1.

plot(seq(-3, 3, 0.01), dnorm(seq(-3, 3, 0.01), 0, 1), cex = 0.5)
# plotting the area under the curve for what pnorm(2) is
polygon(c(seq(-3, qnorm(0.977), 0.01), rev(seq(-3, qnorm(0.977), 0.01))),
        c(rep(0, length(seq(-3, qnorm(0.977), 0.01))), rev(dnorm(seq(-3, qnorm(0.977), 0.01), 0, 1))),
        col = 'gray', border = NA)
# plotting the value of N(0, 1) at P(x = 2)
points(2, dnorm(2, 0, 1), pch = 19, col = 'green') 
# plotting that given the value of x such that the cumulative probability is 0.977
points(qnorm(0.977, 0, 1), 0, pch = 19, col = 'blue')
# plotting 50 random variables drawn from a N(0, 1)
points(rnorm(50, 0, 1), rep(0, 50), pch = 20, col = 'red')

Evaluating the test statistic

In our case, we want to find what the probability of getting a value that is equal to or more extreme than the test statistic value given that the null hypothesis is true. This is know as our p-value.

Let’s look at this in R.

# # we should get the same estimates of the t-value and p-value from the lm
p.val <- 2*(1-pt(abs(test.statistic), N-2)) # or 2*pt(beta1_hat/beta1_SE, N-2), so we do abs() to account
p.val

          Xi 
1.533163e-09

# checking with the linear model fit
summary(lm.fit)[4]

$coefficients
            Estimate Std. Error  t value     Pr(>|t|)
(Intercept) 1.551369  0.9819950 1.579813 1.173720e-01
Xi          1.040747  0.1560558 6.669068 1.533163e-09

# let's plot this!
plot(seq(-8, 8, 0.01), dt(seq(-8, 8, 0.01), N-2), cex = 0.2)
# plotting that given the value of x such that the cumulative probability is our p-value
points(test.statistic, 0, pch = 19, col = 'blue')
# we also care about the other extreme end
points(-test.statistic, 0, pch = 19, col = 'blue')

An easy question to follow this, is why do we multiply by two above to get the p-value? In this formulation of our hypothesis test, we have a two-tailed test. We are assuming that \(\beta_1 = 0\) and seeing if it \(\neq 0\) as opposed to just \(>\) or \(<\). Therefore, we have to consider the probability that the difference between our estimated and null-assumed value is either far greater than 0 or far less than 0. In the graph above, we care about the area under the curves outside of the blue points.

How good is this value to reject our null hypothesis? Is there a set of values?

Ideally before going through all of this, you will have chosen a threshold value, \(\alpha\). This is the probability you are willing to reject the null hypothesis at. If your p-value lies below this threshold, then we would reject the null hypothesis. If it is above, we simply do not reject the null hypothesis. This does not mean the null hypothesis is true, it simply means you do not have adequate data to reject the null.

Taking a look again at our t-distribution, if we have our \(\alpha = 0.05\), this is the parts under the curve that is shaded. We see our test statistic from our linear model fit plotted in blue!

# let's plot this!
plot(seq(-8, 8, 0.01), dt(seq(-8, 8, 0.01), N-2), cex = 0.2)
# getting the areas of the curve that match up to our threshold of alpha
polygon(c(seq(qt(0.975, N-2), 8, 0.01), rev(seq(qt(0.975, N-2), 8, 0.01))),
        c(rep(0, length(seq(qt(0.975, N-2), 8, 0.01))), rev(dt(seq(qt(0.975, N-2), 8, 0.01), N-2))),
        col = 'gray', border = NA)
polygon(c(seq(-8, qt(0.025, N-2), 0.01), rev(seq(-8, qt(0.025, N-2), 0.01))),
        c(rep(0, length(seq(-8, qt(0.025, N-2), 0.01))), rev(dt(seq(-8, qt(0.025, N-2), 0.01), N-2))),
        col = 'gray', border = NA)
# plotting our test-statistic 
points(test.statistic, 0, pch = 19, col = 'blue')

We can also construct an interval such that there is a \((1-\alpha)\)% probability that the interval would contain the true parameter. This is called the confidence interval for the value. Based on our choice of \(\alpha\) value, we have assumed that with our test statistic \(T\), \(P(T < qt(\alpha/2) \cup T > qt(1-\alpha/2)) = 0.05\) (This is the entire shaded region in gray in the plot above). Another way to write this is…

\[ \begin{align} P(T < qt(\alpha/2) \cup T > qt(1-\alpha/2)) &= 0.05 \\\\ P(qt(\alpha/2) \leq T \leq qt(1-\alpha/2)) &= 0.95 \\\\ P(qt(\alpha/2) \leq \frac{\hat{\beta_1}-\beta_1}{SE(\hat{\beta_1})} \leq qt(1-\alpha/2)) &= 0.95 \\\\ P(SE(\hat{\beta_1})*qt(\alpha/2) \leq \hat{\beta_1}-\beta_1 \leq SE(\hat{\beta_1})*qt(1-\alpha/2)) &= 0.95 \\\\ P(- \hat{\beta_1} +SE(\hat{\beta_1})*qt(\alpha/2) \leq -\beta_1 \leq -\hat{\beta_1} + SE(\hat{\beta_1})*qt(1-\alpha/2)) &= 0.95 \\\\ P(- \hat{\beta_1} -SE(\hat{\beta_1})*qt(1-\alpha/2)) \leq -\beta_1 \leq -\hat{\beta_1} + SE(\hat{\beta_1})*qt(1-\alpha/2)) &= 0.95 \\\\ P(\hat{\beta_1} + SE(\hat{\beta_1})*qt(1-\alpha/2)) \geq \beta_1 \geq \hat{\beta_1} - SE(\hat{\beta_1})*qt(1-\alpha/2)) &= 0.95 \end{align} \] Therefore we have our 95% confidence interval as \(\hat{\beta_1} \pm qt(1-\alpha/2)*SE(\hat{\beta_1})\).

How are test statistics and confidence intervals related? Looking up above, we see that both use the same quantities of \(\hat{\beta_1}, SE(\hat{\beta_1})\) as well as use the fact that the test statistic in this case comes from the t-distribution. The test-statistic will just tell us where we are in the t-distribution given that the null hypothesis is true whereas confidence intervals tell us a range of the possible values that would capture the true value \((1-\alpha)\)% of the time.

Note that as our sample size increases, our t-distribution will converge to a Normal distribution, so often folks will use 1.96 as a proxy instead of \(qt(1-\alpha/2)\).

plot(seq(-5, 5, 0.01), dt(seq(-5, 5, 0.01), 10), cex = 0.1, col = 'darkgreen')
points(seq(-5, 5, 0.01), dt(seq(-5, 5, 0.01), 1), cex = 0.1, col = 'purple')
points(seq(-5, 5, 0.01), dt(seq(-5, 5, 0.01), 2), cex = 0.1, col = 'blue')
points(seq(-5, 5, 0.01), dt(seq(-5, 5, 0.01), 5), cex = 0.1, col = 'darkgreen')
points(seq(-5, 5, 0.01), dt(seq(-5, 5, 0.01), 10), cex = 0.1, col = 'orange')
points(seq(-5, 5, 0.01), dt(seq(-5, 5, 0.01), 10^6), cex = 0.3, col = 'red')
points(seq(-5, 5, 0.01), dnorm(seq(-5, 5, 0.01), 0, 1), cex = 0.1, col = 'black')

Looking at additional information (\(R^2\))

\(R^2\) (sometimes called the coefficient of determination) is the proportion of the variation in the dependent variable that is predictable from the independent variables. But what does this mean intuitively? If we had no independent variables, we could plot our total variation using the mean of our dependent variable. Note that we are squaring our residuals to better capture high amounts of variation from the mean and if we don’t square it, the sum will be equal to 0.

plot((Yi - mean(Yi))^2, pch = 16)

But, we have created a best-fit line using our mean-squared error solution. The variation between our independent and dependent variables now looks like this (overlaid in gray)

plot((Yi - mean(Yi))^2, pch = 16)
points((Yi - H%*%as.matrix(Yi))^2, pch = 16, col = 'gray')

Is there a difference between these two? We see those differences in red.

plot((Yi - mean(Yi))^2, pch = 16)
points((Yi - H%*%as.matrix(Yi))^2, pch = 16, col = 'gray')
points((Yi - mean(Yi))^2 - (Yi - H%*%as.matrix(Yi))^2, pch = 16, col = 'red')

At each different index, we see that sometimes the difference between the average value of the dependent variable (in a way this is the null hypothesis, where \(\beta_1 = 0\)) versus the best-fit line is very small. There is not a lot of difference. However sometimes we see that the difference can be large. How could we quantify these differences for the whole dataset?

If we were to add up all of the squared residuals from the mean (in black) and then add up all of the squared residuals from the best-fit line (in gray), the difference would be the sum of all the squared residuals of the differences (in red). Intuitively, if the squared residuals from the mean and best fit line matched, their sums would be the same and the difference would be 0. (And we would concur that in our model there was no point in adding in predictors since it matched simply the mean line). In other words, relative to the variation in the mean line, the best-fit line was not able to capture any more additional variation.

However if the squared residuals of the best-fit line were 0 (i.e. our model matched perfectly), then the difference between the squared residuals of the mean line and the best-fit line would just be the squared residuals of the mean line. In other words, relative to the variation in the mean line, the best-fit line is able to capture all the variation.

One way to standardize this relationship is by dividing these differences in squared residuals over the squared residuals from the mean line. This value (which now lies between 0 and 1) is our \(R^2\)! The sum of the squared residuals of the mean line is often called the sum of squares total (SST) while the sum of the squared residuals of the best-fit line is called the sum of squared errors (SSE, which we saw when explaining the variance of our data).

\[ R^2 = \frac{SST - SSE}{SST} \]

Let’s look at this in R. Note that the adjusted \(R^2\) takes into account our degrees of freedom (so this time our variance is the same as our adjusted sum of squared errors).

SST <- sum((Yi - mean(Yi))^2) # sum of residuals from the mean, df = n-1
SSE <- sum((Yi - H%*%as.matrix(Yi))^2) # sum of residuals from best fit line, df = n-p
SSR <- SST - SSE # the difference, df = p-1.

# this should match the R-squared of the linear model fit
SSR/SST # also can be written as 1 - SSE / SST

[1] 0.3121671

# checking the linear model fit
summary(lm.fit)[8]

$r.squared
[1] 0.3121671

# Adjusted R-squared
# we can adjust using the degrees of freedom
adjSST <- SST/(N-1) # sum of residuals from the mean, df = n-1
adjSSE <- SSE/(N-2) # sum of residuals from best fit line, df = n-p
adjSSR <- adjSST - adjSSE # the difference, df = p-1.
adjSSR/adjSST

[1] 0.3051484

# checking the linear model fit 
summary(lm.fit)[9]

$adj.r.squared
[1] 0.3051484

Now you can understand where almost all the values in a summary of a model fit in R for a simple linear regression come from!

summary(lm.fit)


Call:
lm(formula = Yi ~ Xi)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.5281 -2.3597 -0.1926  2.4011  9.7745 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   1.5514     0.9820   1.580    0.117    
Xi            1.0407     0.1561   6.669 1.53e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.075 on 98 degrees of freedom
Multiple R-squared:  0.3122,    Adjusted R-squared:  0.3051 
F-statistic: 44.48 on 1 and 98 DF,  p-value: 1.533e-09

Additional resources

On matrix properties: https://www.ucl.ac.uk/~ucahhwi/MATH6502/handout9.pdf
More on the hat matrix: https://pj.freefaculty.org/guides/stat/Regression/RegressionDiagnostics/OlsHatMatrix.pdf
More on simple linear regression with matrices: https://www.stat.cmu.edu/~cshalizi/mreg/15/lectures/13/lecture-13.pdf
On a geometric interpretation: https://bookdown.org/ts_robinson1994/10EconometricTheorems/linear_projection.html
A nice video on explanation of variance: https://www.youtube.com/watch?v=x0rmUXWtSS8
A more in depth video of the least square estimators: https://www.youtube.com/watch?v=D2ns91Vcs8U
A more in depth video on confidence intervals: https://www.youtube.com/watch?v=eaB6SmKJFrk