Above Average Thoughts: Using weighted least squares

Consider, perhaps, the most useful statistical tool in use today: the linear regression model. Now the basic linear regression model assumes that the residuals are all distributed $N(0,\sigma^2)$. This was most probably done out of mathematical convenience, but it also makes pooling the residuals appropriate - and pooling gives a more accurate estimate of the residual variance. When residuals all come from the same error distribution, then we call them homoscedastic errors.

Anyone that has ever actually worked with regression models knows that homoscedasticity will not always hold. In introductory courses, students are taught to apply a transformation to the dependent variable, $Y$, to try to correct for heteroscedastic errors. The problem with this is that, if the relationship between $X$ and $Y$ is already well understood, then transforming $Y$ will probably require defining a new model. It is not always obvious what transformation will work. If we do not know the transformation, then we're screwed right? Not exactly. We could replace the assumption that all error terms come from the same distribution and let each $\epsilon_i$ potentially come from a different distribution. Hence, $\epsilon_i \sim IDN(0, \sigma_i^2)$ and the variance structure can now be defined as a diagonal variance-covariance matrix, $\Sigma$.

The classic textbook approach is to solve the problem assuming that 1) the variance of all $Y_i$'s is known, 2) the variances are known up to some proportionality constant, or 3) you don't know them at all and have to estimate them. In general, scenario 3) is encountered most often in practice and we will give it the most love and care, but scenarios 1) and 2) are equally important to discuss. As someone that works in the analytical chemistry realm, I often work with reference materials that have certified reference values and an estimate of variability. Therefore I often end up using approaches 1) and 2) and you should as well if you are working in this domain. Before we dive into the mathematics, let's discuss how to determine whether or not the residuals of a linear model fit are homoscedastic.

Determining homoscedasticity of residuals

The most common tool for determining the constancy of error variance is to plot the residuals (or some transformation thereof) against either the fitted values $\hat{Y}$ or one of the predictor variables $X$. A common deviation from constant variance appears as a megaphone like shape in these plots. One may also find that error variance either increases or decreases proportionally with a predictor. In the following figures, we produce a plot where the error variance is constant and where the error variance is increasing with the predictor.

This is the scale-location plot used by R in the function plot.lm(). The red line is a Lowess curve (The lowess R function, NOT loess!) fit of the transformed residuals versus the fitted values. Do not let the structure in the curve fool you - these errors all come from a single normal distribution $N(0,1)$.

Scale-location plot for model $Y_i = \beta_0 + \beta_1X + \epsilon_i$ where $\epsilon \sim IDN(0, \exp(3X))$. Note the structure in the residuals.

In the graph with constant error variance, one can see that the general structure of residuals appears to be independent of the value of X. Do not let the loess curve fit fool you - this data was modeled assuming that the residuals are drawn from a standard normal distribution. In the second plot, I explicitly defined the distribution of residuals such that the variance is a function of the predictor. If we plot the raw residuals against $X$, we can see that the error increases rapidly with the value of the predictor.

Apart from the use of diagnostics plots, one can also use hypothesis tests to determine constancy of error variance. Only under unique circumstances do I take the time to rigorously test the constancy of error variance since it is usually approximately constant or completely heteroscedastic. Therefore I will not go into details but instead suggest that the reader look into the Breusch-Pagan Test and the Brown-Forsythe Test.

Error variances known

In the event that you are working with reference materials and/or results with reported estimates of standard deviation, then we should treat these as "truth" and use this information to find the estimates of our regression coefficients. It is called weighted regression because we will use information about the variability of $Y$ to define the weight associated with each $Y_i$. In general, the weights are defined as being the inverse of the variance $w_i = \frac{1}{\sigma_i^2}$. This is a rational thing to do because then each weight is inversely related to the variance of $Y_i$. Hence, those $Y_i$ with large variability will receive less weight when fitting the model since those values are not known with a high degree of certainty.

As it so happens, one can define the problem of solving for the regression coefficients as a least squares problem.

\begin{align}
Q_w = \sum_{i=1}^n w_i\left(Y_i - \sum_{j=0}^{p-1} \beta_jX_j\right)
\end{align}

This is kinda nice because then we can write the mathematics elegantly as a set of normal equations.

\begin{align}
(\textbf{X}'\textbf{WX})\textbf{b}_{\textbf{w}} = \textbf{X}'\textbf{WY}
\end{align}

Where $\textbf{b}_{\textbf{w}} = \left(\beta_0, \ldots, \beta_{p-1}\right)$ is the vector of regression coefficients to be estimated, $\textbf{X}$ is the matrix of independent variables, $\textbf{W} = \text{diag}(w_1, w_2, \ldots, w_n)$ is the error structure of $Y_i$, and $Y$ is the dependent variable. Given the normal equations, then solving for the regression coefficients is trivial.

$$\textbf{b}_{\textbf{w}} = (\textbf{X}'\textbf{WX})^{-1}\textbf{X}'\textbf{WY}$$

The variance of the regression coefficients is $\text{Var}(\textbf{b}_{\textbf{w}}) = (\textbf{X}'\textbf{WX})^{-1}$ and this variance is "exact" in the sense that we assume that $\textbf{W}$ is known exactly. It turns out (as it does with OLS as well) that the least squares estimator is the same as the maximum likelihood estimate and therefore it gets all of its associated nice properties: the estimated coefficients are consistent and BLUE.

Error variances known up to a constant

This is probably the least often encountered of the three scenarios, but there are certain problems where it may be of interest. Let's say, for example, that you know that the relative variance of a measurement decreases proportionally with the amount of analyte being measured (often true). Hence if you make a measurement at one concentration and another at double the original concentration, then the second measurement should have a proportionally smaller relative variance. The question then becomes how to solve for that proportionality constant. Let's define the proportionality constant as $k$. Therefore, we can define our weights as $w_i = \frac{k}{\sigma_i^2}$. The nice thing about $k$ being the same proportionality constant for all $w_i$ is that we can define the normal equations as...

\begin{align}
(\textbf{X}'k\textbf{WX})\textbf{b}_{\textbf{w}} = \textbf{X}'k\textbf{WY}
\end{align}

and therefore it cancels from both sides and the problem can solved in the same way as 1) except for a few minor changes. Because we do not know the value of $k$, it must be estimated from the data. This also means that $\textbf{W}$ is unknown. It turns out that the best estimate for $k$ is the weighted mean squared error, $MSE_w$.

\begin{align}
\hat{k} = MSE_w = \frac{\sum_{i=1}^n w_i\left(Y_i - \hat{Y}_i\right)^2}{n-p}
\end{align}

Finally, we may estimate the variance-covariance matrix of our regression coefficients using $s^2(\textbf{b}_{\textbf{w}}) = MSE_w(\textbf{X}'\textbf{WX})^{-1}$.

Error variances must be estimated

In general, you will not have any information about the variance of $Y$. In this case, we have to estimate the weights and the most common technique may be surprising. If you have no information, then often the best thing you can do is fit a regression model on the residuals as a function of either the predictors ($X$) or the fitted values from the regression model ($\hat{Y}$). This is a rational thing to do because, very often, the scale-location plot will show that the residuals vary in a predictable way with the fitted values. Furthermore, from the properties of residuals, $\sigma_i^2 = \text{Var}\left(\epsilon_i\right) = E[\epsilon_i^2] - \left(E[\epsilon_i]\right)^2 = E[\epsilon_i^2] $ which is to say that $\epsilon_i^2$ is an unbiased estimator of $\sigma_i^2$.

So the general technique is the following:

Regress Y on X
Investigate scale-location plot for structure.
If there is structure in 2., then plot residuals against predictors and (\hat{Y}\). Regress the residuals against the predictor of your choice.
The fitted values of 3. are estimates of the error variance. Create the matrix of weights using these fitted values.
Regress Y on X using the weight matrix from 4.

This part requires some art as well as mathematical know how. A common error structure in applied work will show that the residuals vary with the predictor in a megaphone like way. By this I mean that the dispersion of results increases with increasing values of the predictor. Other common error structures involve exponential like behavior or inverse proportionality to the predictor.

The code

The following code covers how to model different error structures and how to use weighted regression in R. As a bonus, I even throw in a shitty version of iteratively reweighted least squares to deal with severely jacked-up error structure.

Works Cited

Kutner, Michaeld H., Nachstheim, Christopher J., Neter, John, Li, William. Applied Linear Statistical Models. 5th edition. Mcgraw-Hill Irwin. 2005.

Exported from Notepad++

## Weighted least squares
rm(list = ls())

# scenario 1) weights known
set.seed(91)

X <- rnorm(30, 1, 0.5)

# These are the errors 
E <- c(rnorm(9, 0, 0.4), rnorm(10, 0, 0.1), rnorm(11, 0, 1.1))

b0 <- 0.7
b1 <- 0.33

Y <- b0 + b1*X + E

plot(X, Y)

# Let's assume we actually know E. Let's fit the linear regression model and see what 
# the scale-location diagnostic shows
lm.fit <- lm(Y ~ X)
par(mfrow=c(2,2))
plot(lm.fit)

# plots of residuals ~ X
# In the OLS model the residuals come from N(0,1)
E_ols <- rnorm(50)
X_ols <- rnorm(50, 2, 0.2)
Y_ols <- b0 + b1*X_ols + E_ols

ols.fit <- lm(Y_ols ~ X_ols)
par(mfrow=c(2,2))
plot(ols.fit)

y.res <- sqrt(abs(as.numeric(scale(ols.fit$residuals))))
x.res <- ols.fit$fitted.values

loess.fit <- lowess(x.res, y.res)

tiff('scale-location_OLS.tif', height = 6, width = 9, units = 'in', res = 200)
par(cex.lab = 1.2, cex.axis = 1.2, cex.main = 1.2)
plot(y.res ~ x.res, xlab = expression(hat(Y)), ylab = expression(sqrt(abs(epsilon[standardized]))),
     main = 'Scale-location plot (constant error variance)') 
lines(loess.fit$x, loess.fit$y, col = 'red', lwd = 2)
dev.off()

# non-constant error variance
# We assume that the variance of the residuals 
# varies as a function of the predictor variable
E_hetero <- rnorm(50, 0, exp(3*X_ols))
Y_hetero <- b0 + b1*X_ols + E_hetero

het.fit <- lm(Y_hetero ~ X_ols)

y.res <- sqrt(abs(as.numeric(scale(het.fit$residuals))))
x.res <- het.fit$fitted.values

loess.fit <- lowess(x.res, y.res)

tiff('scale-location_hetero.tif', height = 6, width = 9, units = 'in', res = 200)
par(cex.lab = 1.2, cex.axis = 1.2, cex.main = 1.2)
plot(y.res ~ x.res, xlab = 'X', ylab = expression(sqrt(abs(epsilon[standardized]))),
     main = 'Scale-location plot (increasing error variance)') 
lines(loess.fit$x, loess.fit$y, col = 'red', lwd = 2)
dev.off()

# OK, so we have some structure suggesting that our errors 
# do not follow a single normal distribution
wts <- c(rep(1/0.4^2, 9), rep(1/0.1^2, 10), rep(1/1.1^2, 11))
wls.fit <- lm(Y ~ X, weights = wts)
par(mfrow=c(2,2))
plot(wls.fit)

# just to prove we know what we're doing
W <- diag(wts)
X <- model.matrix(lm.fit)

betas     <- solve(t(X) %*% W %*% X) %*% t(X) %*% W %*% Y
coef(wls.fit)

# huh? Not the same? Is Awkward Casanova retarded?
var.betas <- solve(t(X) %*% W %*% X)
vcov(wls.fit)

# R is assuming that the weights you give are only 
# known up to a proportionality constant. 
# Hence R will solve for 2) i.e. MSEw * var.betas
MSEw <- as.numeric(wts %*% wls.fit$residuals^2)/(30 - 2)
prop.var.betas <- MSEw * var.betas
vcov(wls.fit)

# Now let's do a special case of 3) where the variance of the 
# residuals is a linear function of the predictor X
X   <- rnorm(30, 1, 0.5)
E_x <- rnorm(30, 0, 6.5 + 24*X)
Y   <- b0 + b1*X + E_x

plot(Y ~ X)

lm.fit <- lm(Y ~ X)
par(mfrow = c(2,2))
plot(lm.fit)

# plot absolute residuals against X
plot(abs(lm.fit$residuals) ~ X)

fix.fit <- lm(abs(lm.fit$residuals) ~ X)
wts     <- 1/fix.fit$fitted.values^2

wls.fit <- lm(Y ~ X, weights = wts)
par(mfrow = c(2,2))
plot(wls.fit)

# let's check the difference in the estimated coefficients
coef(lm.fit)
coef(wls.fit)

# Holy cow those are really different! In this case, we should
# use a technique called iteratively reweighted least squares to
# make sure our coefficients settle down
epsilon <- .Machine$double.eps^0.25
# generally good to limit number of iterations in case we bounce
# between two sets of coefficients
k <- 1
flag <- TRUE
b.upd <- coef(lm.fit)
y.upd <- abs(lm.fit$residuals)

while (flag & k <= 10) {
  
  # use regression to find the set of weights
  fix.fit <- lm(y.upd ~ X)
  wts     <- 1/fix.fit$fitted.values^2
  
  # fit the weighted model
  wls.fit <- lm(Y ~ X, weights = wts)
  b.wght <- coef(wls.fit)
  
  # compute the difference in coefficients between 
  # previous model and new model
  delta <- abs(b.upd - b.wght)
  
  # If differences between coefficients < epsilon, then halt
  if (all(delta <= epsilon)) {
    flag <- FALSE
  } else {
    # update the coefficients and dependent variable and iterate
    b.upd <- coef(wls.fit)
    y.upd <- abs(wls.fit$residuals)
  }
  
  k <- k + 1
  
}

# plot the scale-location plot of weighted model - if it 
# worked then we should see a flat error structure - meh, kinda worked
y.res <- sqrt(abs(as.numeric(scale(wls.fit$residuals))))
x.res <- wls.fit$fitted.values
loess.fit <- lowess(x.res, y.res)
plot(x.res, y.res)
lines(loess.fit$x, loess.fit$y, col = 'red')

Above Average Thoughts

Thursday, August 25, 2016

Using weighted least squares