# setting up our model parameters
N <- 100
beta0 <- 1
beta1 <- 2
X <- runif(N, 0, 1)
lambda <- exp(beta0 + beta1 * X)
# getting our observations
Y <- rpois(N, lambda = lambda)
# plotting
plot(Y ~ X)Logistic and Poisson Regression
Bayes’ Theorem
Bayes’ theorem is a theorem of conditional probability formulated using other probability definitions.
Given event A and event B, the probability of A and P can be re-written as a conditional probability.
\[ P(A \hspace{1mm} and \hspace{1mm} B) = P(A|B)P(B) = P(B|A)P(A) \]
We can thus set these two conditional probabilities equal to each other and get that
\[ P(A|B) = \frac{P(B|A)P(A)}{P(B)} \]
When do we use Bayes’ Theorem?
Bayes’ theorem has been revolutionary in the field of statistics. Given evidence (or data) X and parameters \(\theta\), the above theorem is often written in the form
\[ P(\theta|X) = \frac{P(X|\theta)P(\theta)}{P(X)} \]
Where \(P(\theta|X)\) can be viewed as the probability of the parameters given the data. In Bayesian statistics, it is also known as the posterior distribution. \(P(X|\theta)\) is known as the likelihood (the probability of the data given the parameters) and \(P(\theta)\) is known as the prior. \(P(X)\) is often known as the normalizing constant and thanks to the total law of probability, can be written as \(\int P(X|\theta)P(\theta)d\theta\). However, this is often very hard to find, and is also a constant, so often one will see that
\[ P(\theta|X) \propto P(X|\theta)P(\theta) \]
Where \(\propto\) means ‘’proportional to.’’ Usually, \(P(\theta)\) will be set beforehand based on prior knowledge. Or it will be set to a neutral condition to signify no additional knowledge. And then \(P(X|\theta)\) can be calculated using maximum likelihood, which is where we calculate the most likely parameter \(\theta\). See my blog post on maximum likelihood here.
Generalized Linear Models
A generalized linear model is a generalization of linear regression that allows one to model a non-linear predictive relationship between \(X\) and \(Y\). Two obvious examples are logistic regression (where we want to model a binary outcome) or a Poisson regression (where we want to model counts).
Poisson Regression
Poisson regression deals with count data. The model formulation with one covariate typically looks like the one below. An intuitive explanation for why we log-transform, is because we know that for a Poisson distribution, \(\lambda \geq 0\). But we want to be open to different values of \(\beta_0, \beta_1\), and \(X_i\), thus we log-transform so that way we can switch back and forth from the domain of the positives (where \(\lambda_i\) lives) to the domain of the reals (where _0 + _1 X_i can live).
\[ \begin{align} Y_i \sim Poisson(\lambda_i) \\ log(\lambda_i) = \beta_0 + \beta_1 X_i \end{align} \]
How do we estimate these parameters?
First let’s take a look at the Poisson distribution:
\[ P(Y | \lambda) = \frac{e^{-\lambda} \lambda^y}{y!} \] In maximum likelihood, we have assumed that our data is independent and each point is coming from a Poisson distribution. The probability of the exact realization of our data is the same as multiplying the probability mass function of each data point together. \[ \mathcal{L}(\lambda | Y) = P(Y = y_i | \lambda) = \Pi_{i=1}^n \frac{ \lambda^{y_i} }{y_i !} e^{- \lambda} \] This is known as our likelihood function. We want to find \(\lambda\) that maximizes this function. Note that to make this calculation easier, and because we are assuming there is one maximum, we usually use the log-likelihood instead.
\[ \begin{align} \text{ln} \hspace{1mm} \mathcal{L}(\lambda | Y) &= \sum_1^n (y_i*log(\lambda) +log(e^{-\lambda}) - log(y_i!)) \\ &= \sum_1^n (y_i*log(\lambda) -\lambda - log(y_i!)) \end{align} \] However, let’s first re-write this in terms of linear model. We have defined our \(y_i\)’s to come from a Poisson distribution where \(\lambda = e^{(\beta_0 + \beta_1 * x_i)}\). Thus we have that
\[ \begin{align} \text{ln} \hspace{1mm} \mathcal{L}(\lambda | Y) &= \sum_1^n (y_i*log(\lambda) -\lambda - log(y_i!)) \\ &= \sum_1^n (y_i*log(e^{(\beta_0 + \beta_1 * x_i)}) -e^{(\beta_0 + \beta_1 * x_i)} - log(y_i!)) \\ &= \sum_1^n (y_i*(\beta_0 + \beta_1 * x_i) -e^{(\beta_0 + \beta_1 * x_i)} - log(y_i!)) \end{align} \] To maximize a function, one takes the derivative, sets the function equal to 0, and solve. This would mean solving for both of these equations.
\[ \begin{align} \frac{d}{d \beta_0} \text{ln} \hspace{1mm} \mathcal{L}(\beta_0, \beta_1) &= \sum_1^n y_i-e^{\beta_0 + \beta_1 x_i} \\ \frac{d}{d \beta_1} \text{ln} \hspace{1mm} \mathcal{L}(\beta_0, \beta_1) &= \sum_1^n y_i x_i - x_i e^{\beta_0 + \beta_1 x_i} \end{align} \]
However instead of trying to solve this two-dimensional system, we can use the optim function in r. Note that the optim function takes in a function and minimizes it.
# creating our log-likelihood function
llh <- function(theta) {
b0 <- theta[1]
b1 <- theta[2]
-(sum(Y*(b0 + b1*X)) - sum(exp(b0+b1*X)) - sum(log(factorial(Y)))) # we negate it to get the maximum
}
llh.estimates <- optim(c(1,1), llh, hessian = T)
llh.estimates$par[1] 0.9454429 2.0202174
These are our estimates, otherwise known as our maximum likelihood estimates (MLEs).
What about our standard errors?
In textbooks it is common to see that \(\hat{\theta} \sim N(\theta, I^{-1}(\theta))\), where \(\theta\) is a vector of parameters (in our case \((\beta_0\) and \(\beta_1\)) and \(I(\theta)\) is what is known as the information matrix and is the negative of the Hessian matrix evaluated at our MLEs. The Hessian is the matrix of second-order partial derivatives and intuitively we can think of them as the curvature of the function. If our likelihood function gives us a surface where we have \(\beta_0\) and \(\beta_1\) as the x-y axis and the output of the log-likelihood as the z-axis, then the Hessian gives us the curvature at each point on this surface.
b0 <- seq(0, 2, length = 30)
b1 <- seq(1, 3, length = 30)
llh_grid <- function(b0, b1) {
llh(c(b0, b1))
}
# Create a grid matrix
z <- outer(b0, b1, Vectorize(llh_grid))
# Draw the 3D surface
persp(b0, b1, -z,
theta = 10, phi = 27, # viewing angles
col = "lightblue", # surface color
shade = 0.5, # shading intensity
xlab = 'b0', ylab = 'b1', zlab = "llh(b0, b1)",
main = "log-likelihood surface")We can construct our Hessian from our log-likelihood partial derivatives.
\[ H = \begin{bmatrix} \frac{d^2}{d \beta_0^2} \text{ln} \hspace{1mm} \mathcal{L}(\beta_0, \beta_1) & \frac{d}{d \beta_0 \beta_1} \text{ln} \hspace{1mm} \mathcal{L}(\beta_0, \beta_1) \\ \frac{d}{d \beta_1 \beta_0} \text{ln} \hspace{1mm} \mathcal{L}(\beta_0, \beta_1) & \frac{d^2}{d \beta_1^2} \text{ln} \hspace{1mm} \mathcal{L}(\beta_0, \beta_1) \end{bmatrix} = \begin{bmatrix} \sum -e^{\beta_0 + \beta_1 x_i} & \sum-x_i e^{\beta_0 + \beta_1 x_i} \\ \sum - x_i e^{\beta_0 + \beta_1 x_i} & \sum-x^2 e^{\beta_0 + \beta_1 x_i} \end{bmatrix} \]
Again, we can solve this at our estimates we previously got. We can also check this with the optim function from before.
H <- function(b0, b1) {
matrix(c( -sum(exp(b0 + b1*X)) , -sum(X*exp(b0+b1*X)),
-sum(X*exp(b0+b1*X)) , -sum(X^2*exp(b0+b1*X)) ),
nrow = 2, byrow = T)
}
# getting our var-covar matrix
H(llh.estimates$par[1], llh.estimates$par[2]) [,1] [,2]
[1,] -833.0266 -551.6966
[2,] -551.6966 -426.0000
# comparing with optim (we have to negate bc optim finds the minimum)
-llh.estimates$hessian [,1] [,2]
[1,] -833.0269 -551.6968
[2,] -551.6968 -426.0001
Finally, as mentioned previously: \(\hat{\theta} \sim N(\theta, I^{-1}(\theta))\). So to get our estimated standard error, we take the square root of the diagonals of the variance-covariance matrix that is \(I^{-1}(\theta) = -H^{-1}\).
# check sqrt(diag(inverse -hessian)) with glm std. errors of param estim
llh.se <- sqrt(diag(solve(-H(llh.estimates$par[1], llh.estimates$par[2]))))
llh.se[1] 0.0918456 0.1284349
Comparing our estimates and standard errors with R
We see that we get very similar values! We can chalk up the differences in the hundredths/thousandths place to the (probably more efficient) MLE machinery that R uses to calculate these estimates.
# beta0 and beta1
llh.estimates$par [1] 0.9454429 2.0202174
# beta0 and beta1 standard errors
llh.se[1] 0.0918456 0.1284349
fit <- glm(Y ~ X, family=poisson)
summary(fit)
Call:
glm(formula = Y ~ X, family = poisson)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.94538 0.09185 10.29 <2e-16 ***
X 2.02026 0.12844 15.73 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 339.931 on 99 degrees of freedom
Residual deviance: 67.535 on 98 degrees of freedom
AIC: 447.15
Number of Fisher Scoring iterations: 4
Additional Resources
- Great introductory video on generalized linear models: https://www.youtube.com/watch?v=1fZeBafD1nw
- Great video on Logistic Regression: https://www.youtube.com/watch?v=Iju8l2qgaJU
- My blog post on Maximum Likelihood: https://drodriguezchavez.github.io/stat-blog/maximum-likelihood.html
- Nice introduction to Maximum Likelihood: https://medium.com/@dataforyou/maximum-likelihood-estimation-the-poisson-distribution-3ac952f4c36b
- An addition book on regression: https://empirical-methods.com/poisson-regression-and-mle.html