Thursday, August 25, 2016

Using weighted least squares


Consider, perhaps, the most useful statistical tool in use today: the linear regression model. Now the basic linear regression model assumes that the residuals are all distributed \(N(0,\sigma^2)\). This was most probably done out of mathematical convenience, but it also makes pooling the residuals appropriate - and pooling gives a more accurate estimate of the residual variance. When residuals all come from the same error distribution, then we call them homoscedastic errors.

Anyone that has ever actually worked with regression models knows that homoscedasticity will not always hold. In introductory courses, students are taught to apply a transformation to the dependent variable, \(Y\), to try to correct for heteroscedastic errors. The problem with this is that, if the relationship between \(X\) and \(Y\) is already well understood, then transforming \(Y\) will probably require defining a new model. It is not always obvious what transformation will work. If we do not know the transformation, then we're screwed right? Not exactly. We could replace the assumption that all error terms come from the same distribution and let each \(\epsilon_i\) potentially come from a different distribution. Hence, \(\epsilon_i \sim IDN(0, \sigma_i^2)\) and the variance structure can now be defined as a diagonal variance-covariance matrix, \(\Sigma\).

The classic textbook approach is to solve the problem assuming that 1) the variance of all \(Y_i\)'s is known, 2) the variances are known up to some proportionality constant, or 3) you don't know them at all and have to estimate them. In general, scenario 3) is encountered most often in practice and we will give it the most love and care, but scenarios 1) and 2) are equally important to discuss. As someone that works in the analytical chemistry realm, I often work with reference materials that have certified reference values and an estimate of variability. Therefore I often end up using approaches 1) and 2) and you should as well if you are working in this domain. Before we dive into the mathematics, let's discuss how to determine whether or not the residuals of a linear model fit are homoscedastic.

Determining homoscedasticity of residuals

The most common tool for determining the constancy of error variance is to plot the residuals (or some transformation thereof) against either the fitted values \(\hat{Y}\) or one of the predictor variables \(X\). A common deviation from constant variance appears as a megaphone like shape in these plots. One may also find that error variance either increases or decreases proportionally with a predictor. In the following figures, we produce a plot where the error variance is constant and where the error variance is increasing with the predictor. 

This is the scale-location plot used by R in the function plot.lm(). The red line is a Lowess curve (The lowess R function, NOT loess!) fit of the transformed residuals versus the fitted values. Do not let the structure in the curve fool you - these errors all come from a single normal distribution \(N(0,1)\).
Scale-location plot for model \(Y_i = \beta_0 + \beta_1X + \epsilon_i\) where \(\epsilon \sim IDN(0, \exp(3X))\). Note the structure in the residuals. 

In the graph with constant error variance, one can see that the general structure of residuals appears to be independent of the value of X. Do not let the loess curve fit fool you - this data was modeled assuming that the residuals are drawn from a standard normal distribution. In the second plot, I explicitly defined the distribution of residuals such that the variance is a function of the predictor. If we plot the raw residuals against \(X\), we can see that the error increases rapidly with the value of the predictor.



Apart from the use of diagnostics plots, one can also use hypothesis tests to determine constancy of error variance. Only under unique circumstances do I take the time to rigorously test the constancy of error variance since it is usually approximately constant or completely heteroscedastic. Therefore I will not go into details but instead suggest that the reader look into the Breusch-Pagan Test and the Brown-Forsythe Test.

Error variances known


In the event that you are working with reference materials and/or results with reported estimates of standard deviation, then we should treat these as "truth" and use this information to find the estimates of our regression coefficients. It is called weighted regression because we will use information about the variability of \(Y\) to define the weight associated with each \(Y_i\). In general, the weights are defined as being the inverse of the variance \(w_i = \frac{1}{\sigma_i^2}\). This is a rational thing to do because then each weight is inversely related to the variance of \(Y_i\). Hence, those \(Y_i\) with large variability will receive less weight when fitting the model since those values are not known with a high degree of certainty.

As it so happens, one can define the problem of solving for the regression coefficients as a least squares problem.

\begin{align}
Q_w = \sum_{i=1}^n w_i\left(Y_i - \sum_{j=0}^{p-1} \beta_jX_j\right)
\end{align}

This is kinda nice because then we can write the mathematics elegantly as a set of normal equations.

\begin{align}
(\textbf{X}'\textbf{WX})\textbf{b}_{\textbf{w}} = \textbf{X}'\textbf{WY}
\end{align}

Where \(\textbf{b}_{\textbf{w}} = \left(\beta_0, \ldots, \beta_{p-1}\right)\) is the vector of regression coefficients to be estimated, \(\textbf{X}\) is the matrix of independent variables, \(\textbf{W} = \text{diag}(w_1, w_2, \ldots, w_n)\) is the error structure of \(Y_i\), and \(Y\) is the dependent variable. Given the normal equations, then solving for the regression coefficients is trivial.

$$\textbf{b}_{\textbf{w}} = (\textbf{X}'\textbf{WX})^{-1}\textbf{X}'\textbf{WY}$$

The variance of the regression coefficients is \(\text{Var}(\textbf{b}_{\textbf{w}}) = (\textbf{X}'\textbf{WX})^{-1}\) and this variance is "exact" in the sense that we assume that \(\textbf{W}\) is known exactly. It turns out (as it does with OLS as well) that the least squares estimator is the same as the maximum likelihood estimate and therefore it gets all of its associated nice properties: the estimated coefficients are consistent and BLUE.

Error variances known up to a constant


This is probably the least often encountered of the three scenarios, but there are certain problems where it may be of interest. Let's say, for example, that you know that the relative variance of a measurement decreases proportionally with the amount of analyte being measured (often true). Hence if you make a measurement at one concentration and another at double the original concentration, then the second measurement should have a proportionally smaller relative variance. The question then becomes how to solve for that proportionality constant. Let's define the proportionality constant as \(k\). Therefore, we can define our weights as \(w_i = \frac{k}{\sigma_i^2}\). The nice thing about \(k\) being the same proportionality constant for all \(w_i\) is that we can define the normal equations as...

\begin{align}
(\textbf{X}'k\textbf{WX})\textbf{b}_{\textbf{w}} = \textbf{X}'k\textbf{WY}
\end{align}

and therefore it cancels from both sides and the problem can solved in the same way as 1) except for a few minor changes. Because we do not know the value of \(k\), it must be estimated from the data. This also means that \(\textbf{W}\) is unknown. It turns out that the best estimate for \(k\) is the weighted mean squared error, \(MSE_w\).

\begin{align}
\hat{k} = MSE_w = \frac{\sum_{i=1}^n w_i\left(Y_i - \hat{Y}_i\right)^2}{n-p}
\end{align}

Finally, we may estimate the variance-covariance matrix of our regression coefficients using \(s^2(\textbf{b}_{\textbf{w}}) = MSE_w(\textbf{X}'\textbf{WX})^{-1}\).

Error variances must be estimated


In general, you will not have any information about the variance of \(Y\). In this case, we have to estimate the weights and the most common technique may be surprising. If you have no information, then often the best thing you can do is fit a regression model on the residuals as a function of either the predictors (\(X\)) or the fitted values from the regression model (\(\hat{Y}\)). This is a rational thing to do because, very often, the scale-location plot will show that the residuals vary in a predictable way with the fitted values. Furthermore, from the properties of residuals, \(\sigma_i^2 = \text{Var}\left(\epsilon_i\right) = E[\epsilon_i^2] - \left(E[\epsilon_i]\right)^2 = E[\epsilon_i^2] \) which is to say that \(\epsilon_i^2\) is an unbiased estimator of \(\sigma_i^2\). 

So the general technique is the following:
  1. Regress Y on X 
  2. Investigate scale-location plot for structure.
  3. If there is structure in 2., then plot residuals against predictors and (\hat{Y}\). Regress the residuals against the predictor of your choice.
  4. The fitted values of 3. are estimates of the error variance. Create the matrix of weights using these fitted values.
  5. Regress Y on X using the weight matrix from 4.
This part requires some art as well as mathematical know how. A common error structure in applied work will show that the residuals vary with the predictor in a megaphone like way. By this I mean that the dispersion of results increases with increasing values of the predictor. Other common error structures involve exponential like behavior or inverse proportionality to the predictor.

The code

The following code covers how to model different error structures and how to use weighted regression in R. As a bonus, I even throw in a shitty version of iteratively reweighted least squares to deal with severely jacked-up error structure. 

Works Cited

Kutner, Michaeld H., Nachstheim, Christopher J., Neter, John, Li, William. Applied Linear Statistical Models. 5th edition. Mcgraw-Hill Irwin. 2005. 
Exported from Notepad++
## Weighted least squares rm(list = ls()) # scenario 1) weights known set.seed(91) X <- rnorm(30, 1, 0.5) # These are the errors E <- c(rnorm(9, 0, 0.4), rnorm(10, 0, 0.1), rnorm(11, 0, 1.1)) b0 <- 0.7 b1 <- 0.33 Y <- b0 + b1*X + E plot(X, Y) # Let's assume we actually know E. Let's fit the linear regression model and see what # the scale-location diagnostic shows lm.fit <- lm(Y ~ X) par(mfrow=c(2,2)) plot(lm.fit) # plots of residuals ~ X # In the OLS model the residuals come from N(0,1) E_ols <- rnorm(50) X_ols <- rnorm(50, 2, 0.2) Y_ols <- b0 + b1*X_ols + E_ols ols.fit <- lm(Y_ols ~ X_ols) par(mfrow=c(2,2)) plot(ols.fit) y.res <- sqrt(abs(as.numeric(scale(ols.fit$residuals)))) x.res <- ols.fit$fitted.values loess.fit <- lowess(x.res, y.res) tiff('scale-location_OLS.tif', height = 6, width = 9, units = 'in', res = 200) par(cex.lab = 1.2, cex.axis = 1.2, cex.main = 1.2) plot(y.res ~ x.res, xlab = expression(hat(Y)), ylab = expression(sqrt(abs(epsilon[standardized]))), main = 'Scale-location plot (constant error variance)') lines(loess.fit$x, loess.fit$y, col = 'red', lwd = 2) dev.off() # non-constant error variance # We assume that the variance of the residuals # varies as a function of the predictor variable E_hetero <- rnorm(50, 0, exp(3*X_ols)) Y_hetero <- b0 + b1*X_ols + E_hetero het.fit <- lm(Y_hetero ~ X_ols) y.res <- sqrt(abs(as.numeric(scale(het.fit$residuals)))) x.res <- het.fit$fitted.values loess.fit <- lowess(x.res, y.res) tiff('scale-location_hetero.tif', height = 6, width = 9, units = 'in', res = 200) par(cex.lab = 1.2, cex.axis = 1.2, cex.main = 1.2) plot(y.res ~ x.res, xlab = 'X', ylab = expression(sqrt(abs(epsilon[standardized]))), main = 'Scale-location plot (increasing error variance)') lines(loess.fit$x, loess.fit$y, col = 'red', lwd = 2) dev.off() # OK, so we have some structure suggesting that our errors # do not follow a single normal distribution wts <- c(rep(1/0.4^2, 9), rep(1/0.1^2, 10), rep(1/1.1^2, 11)) wls.fit <- lm(Y ~ X, weights = wts) par(mfrow=c(2,2)) plot(wls.fit) # just to prove we know what we're doing W <- diag(wts) X <- model.matrix(lm.fit) betas <- solve(t(X) %*% W %*% X) %*% t(X) %*% W %*% Y coef(wls.fit) # huh? Not the same? Is Awkward Casanova retarded? var.betas <- solve(t(X) %*% W %*% X) vcov(wls.fit) # R is assuming that the weights you give are only # known up to a proportionality constant. # Hence R will solve for 2) i.e. MSEw * var.betas MSEw <- as.numeric(wts %*% wls.fit$residuals^2)/(30 - 2) prop.var.betas <- MSEw * var.betas vcov(wls.fit) # Now let's do a special case of 3) where the variance of the # residuals is a linear function of the predictor X X <- rnorm(30, 1, 0.5) E_x <- rnorm(30, 0, 6.5 + 24*X) Y <- b0 + b1*X + E_x plot(Y ~ X) lm.fit <- lm(Y ~ X) par(mfrow = c(2,2)) plot(lm.fit) # plot absolute residuals against X plot(abs(lm.fit$residuals) ~ X) fix.fit <- lm(abs(lm.fit$residuals) ~ X) wts <- 1/fix.fit$fitted.values^2 wls.fit <- lm(Y ~ X, weights = wts) par(mfrow = c(2,2)) plot(wls.fit) # let's check the difference in the estimated coefficients coef(lm.fit) coef(wls.fit) # Holy cow those are really different! In this case, we should # use a technique called iteratively reweighted least squares to # make sure our coefficients settle down epsilon <- .Machine$double.eps^0.25 # generally good to limit number of iterations in case we bounce # between two sets of coefficients k <- 1 flag <- TRUE b.upd <- coef(lm.fit) y.upd <- abs(lm.fit$residuals) while (flag & k <= 10) { # use regression to find the set of weights fix.fit <- lm(y.upd ~ X) wts <- 1/fix.fit$fitted.values^2 # fit the weighted model wls.fit <- lm(Y ~ X, weights = wts) b.wght <- coef(wls.fit) # compute the difference in coefficients between # previous model and new model delta <- abs(b.upd - b.wght) # If differences between coefficients < epsilon, then halt if (all(delta <= epsilon)) { flag <- FALSE } else { # update the coefficients and dependent variable and iterate b.upd <- coef(wls.fit) y.upd <- abs(wls.fit$residuals) } k <- k + 1 } # plot the scale-location plot of weighted model - if it # worked then we should see a flat error structure - meh, kinda worked y.res <- sqrt(abs(as.numeric(scale(wls.fit$residuals)))) x.res <- wls.fit$fitted.values loess.fit <- lowess(x.res, y.res) plot(x.res, y.res) lines(loess.fit$x, loess.fit$y, col = 'red')

Tuesday, July 26, 2016

Robust estimation of parameters for a Gaussian random sample


Anyone that has ever worked with data has come across outliers. Outliers, simply put, are extreme observations that do not seem to belong to the rest of the data. Often outliers are caused by data transcription errors (putting one too many decimals in an Excel sheet) or failure to follow directions, but every so often an outlier comes along that cannot be so easily explained away. 

There are a plethora of different approaches for identifying and dealing with outliers. This post is concerned with the idea of robust estimation, i.e, recovering reliable parameter estimates in the presence of outliers. In the great big world of applied statistics, we almost always try to shove everything into the normal distribution box. Unless otherwise indicated, the data is assumed to be distributed \(x_1, \ldots, x_n \sim IIDN(\mu, \sigma^2)\).

We are going to focus on an idea called Huberization, which as you may have guessed is named after a fellow named Huber. His 1964 paper on the robust estimation of a location parameter is a landmark in the field. He begins with the location estimator probably first discovered by Gauss.

$$\underset{T}{\operatorname{argmin}} \sum_{i=1}^n \left(x_i - T\right)^2$$

Whose answer is of course $\bar{x}$. Huber recognized that, if an outlier is present, this estimate of location will become unreliable since the outlier is assumed to be an observation from a different distribution than the rest of $x_1, \ldots, x_{n-1}$. So he thought to himself "Can I create estimators such that extreme observations do not cause the estimates of \(\bar{x}\) and \(s^2\) to grow without bound?" The idea he came up with was to replace the objective functions for \(\bar{x}\) and \(s^2\) with functions \(\rho(.)\) such that  \(\rho(x) < \infty \forall x \in R\). Furthermore, Huber wanted these functions to be differentiable so that he could work with them using a maximum likelihood type of approach.

Let's take a brief moment to discuss the classic MLEs for normally distributed data. Anyone that has ever been forced through the Casella & Berger ringer will have derived the following:

\begin{align}
\hat{\mu}_{MLE} &= \bar{x} \\
\hat{\sigma}^2_{MLE} &= \frac{1}{n} \sum_{i=1}^n (x_i - \bar{x})^2
\end{align}

Apart from \(\hat{\sigma}^2_{MLE}\) being biased, these estimators have some very nice properties. It just so happens that these estimators have the smallest possible sampling variance and \(\bar{x}\) is unbiased. Those are damn nice properties! However, due to the definition of \(\bar{x}\), these estimates are also very susceptible to outliers. This is especially clear for the variance where an arbitrarily large \(x_i\) causes the estimate of the variance to explode since we are computing a sum of squares.

So we wish to replace \(\hat{\mu}_{MLE}\) and \(\hat{\sigma}^2_{MLE}\) with estimators that are not unduly affected by outliers but also have a reasonably small sampling variance. Huber's famous proposal II suggests the following. Choose a fixed value for a parameter k that is your a priori best guess at the number of outliers in the data set. Once k is fixed, then solve the following optimization problem for robust estimates of location (T) and scale (S)

\begin{align}
\sum_{i=1}^n \psi\left(k, \frac{x_i-T}{S}\right) &= 0 \\
\sum_{i=1}^n \psi^2\left(k, \frac{x_i-T}{S}\right) - n\lambda &= 0
\end{align}

where \(\psi(x) = \rho'(x)\) and \(\lambda = E_{\Phi}\psi^2(k,x)\). The \(\lambda\) term is the expected value computed at the normal distribution with \(\psi^2(k,x)\) and it is needed to make the estimator S consistent with a normal distribution. Recall that an estimator is consistent if it converges in probability to the true parameter. If you want to learn more then I suggest reading the paper, but beware that it is a minefield of functional analysis.

So, the definition above is completely useless for the applied statistician / mathematician. So what we're gonna do, and Huber himself recommended it, is to turn this into an equivalent winsorization problem. The idea is to replace the sum of squares with a robust function of the data, \(\rho(x)\), and create a set of pseudo-values. These pseudo-values are essentially winsorized values of the original data set. These pseudo-values are determined using \(k, \mu\) and \(\sigma\)

$$\tilde{x}_i = \begin{cases}
x_i & : |x_i - \mu| \leq k\sigma \\
\mu - k\sigma &: x_i < \mu - k\sigma \\
\mu + k\sigma &: x_i > \mu + k\sigma
\end{cases}$$

We will now proceed to apply an iterative algorithm where one searches for value of \(\sigma\) that produces the optimal set of pseudo-values. This does not look like Huber's original proposal, but the two are equivalent where the winsorizing step acts like \(\psi(x)\) and \(k\) and \(S\) are found such that they converge to the correct value. The algorithm can be summarized in the following steps:

1. \(\hat{\mu}^{(0)} := median(x_1, \ldots, x_n)\) and \(\hat{\sigma}^{(0)} := MAD(x_1, \ldots, x_n)\)
2. compute the set of pseudo-values \(\tilde{x}\)
3. while \(\left(\hat{\sigma}^{(j)} - \hat{\sigma}^{(j-1)} > \varepsilon\right)\)
    a) \(\delta^{(j)} := k\hat{\sigma}^{(j)}\)
    b) compute \(\tilde{x}^{(j)}\) using \(\delta^{(j)}\)
    c) \(\hat{\mu}^{(j)} := \frac{1}{n}\sum_{i=1}^n \tilde{x}_i\)
    d) \(\hat{\sigma}^{(j)} := \lambda\sqrt{\frac{1}{n-1}\sum_{i=1}^n (\tilde{x}_i - \hat{\mu}^{(j)})^2}\)

So now that all the theory is out of the way, let's talk about using this in applied statistics. I am an R programmer, but this algorithm can easily be adapted to other programming languages. I do my best to make the code as general as possible (avoid using language specific functionalities) so that you can take this code and drop it into Python, Matlab, etc. That being said, I will outline the "pre-programmed" functions you will need that should be available in most modern mathematical programming languages. Note: Excel is not, and never will be, a suitable platform for mathematical and/or scientific computing.

  1. mean (mean in R)
  2. median (median in R)
  3. square root (sqrt in R)
  4. scaled median absolute deviation (mad in R)
  5. normal cumulative distribution function (pnorm in R)
  6. normal probability density function (dnorm in R)

The version of the algorithm I am using is a Frankenstein's monster of algorithms taken from ISO 13528, Proceedings of the analytical methods committee of the RSC, and the R metRology library. You will need the numerical functions for the normal distribution to estimate consistency factors for the standard deviation estimate. I use the terminology given to this algorithm in ISO 13528. This is widely referenced across international standards in chemistry and metrology, so people looking for an explanation of algorithm A will find it here.
Exported from Notepad++
algorithmA <- function (x, k = 1.5, max.iter = 100, rel.tol = .Machine$double.eps^0.25, verbose = F) { # Do stupid checks to make sure input makes sense stopifnot(max.iter > 0, rel.tol > 0, length(x) > 1) if (class(x) != "numeric") { stop("algorithmA expects x to be an object of type \"numeric\".") } if (!(class(max.iter) %in% c("numeric", "integer"))) { stop("algorithmA expects max.iter to be an object of type 'numeric' or 'integer'.") } # Only use non-NA observations xClean <- x[!is.na(x)] if (length(xClean) < 2) { warning("At most 1 entry of x is not NA. Returning NA results.") out <- list(mean = NA, sd = NA, iterations = NA, winsorizedX = NA) } else { # consistency term lambda and other ideas taken from code algA in R library metRology # inital robust estimate of location xs <- median(xClean) # initial robust estimate of scale tol <- ss <- mad(xClean) theta <- 2 * pnorm(abs(k)) - 1 lambda <- 1/sqrt(theta + (1 - theta) * k^2 - 2 * k * dnorm(k)) delta <- k * ss i <- 1 while (tol > (rel.tol * ss) & i < max.iter) { if (verbose) { cat(k, ": ", "mu = ", xs, "; s = ", ss, "\n", sep = "") } # get new estimate of delta and winsorize ss.old <- ss delta <- k * ss.old x.tilde <- xClean # create the pseudo-values # z[z < k] is the vector subset {z: z < k} x.tilde[x.tilde < xs - delta] <- xs - delta x.tilde[x.tilde > xs + delta] <- xs + delta # compute mean and standard deviation of pseudo-values xs <- mean(x.tilde) ss <- lambda * sqrt(sum((x.tilde - xs)^2)/(length(x.tilde) - 1)) # update stopping criterion and iterator tol <- abs(ss.old - ss) i <- i + 1 } if (!exists("x.tilde")) { x.tilde <- xClean } out <- list(mean = xs, sd = ss, iterations = k, winsorizedX = x.tilde) } return(out) }


So we get a couple of interesting things out of this algorithm - we get an estimate of the mean, an estimate of the standard deviation, and the winsorized data. Making a density plot of the winsorized data along with the original data set will make it very clear what the algorithm is actually doing. Essentially it is forcing extreme observations into the range \([-k\sigma, k\sigma]\). This is related to the simple z-score where observations outside of \(\pm3\sigma\) are considered to be unlikely to occur by random chance alone. Hence, instead of removing outliers completely, the idea is that we will re-weight them such that they are now conforming to the "true" distribution. The following picture is a contaminated normal distribution \((1-\varepsilon)N(0,1) + \varepsilon N(5,0.5^2)\) where \(\varepsilon = 1\%\) of observations are coming from the contamination distribution. Notice the secondary minor mode centered around 5.


Here k = 1.5 but increasing the value of k will "smooth" out the two winsorized modes at \(\pm 2\), however the estimator will become more susceptible to outliers the larger k becomes.

Works Cited

  1. P.J. Huber, Robust Estimation of a Location Parameter, The Annals of Mathematical Statistics 1964, 73-101.
  2. ISO. ISO 13528: Statistical Methods for use in Proficiency Testing by Interlaboratory Comparison, 2015.
  3. M. Koller, Robust Estimation of Linear Mixed Models, PhD Dissertation, 2013.
  4. S.L.R. Ellison, metRology R library, 2015.
  5. Analytical Methods Committee - statistical subcommittee, Robust Statistics - How not to reject outliers: Part 1. Basic Concepts, Analyst, 1989, 1693-1697.