Above Average Thoughts: Robust estimation of parameters for a Gaussian random sample

Anyone that has ever worked with data has come across outliers. Outliers, simply put, are extreme observations that do not seem to belong to the rest of the data. Often outliers are caused by data transcription errors (putting one too many decimals in an Excel sheet) or failure to follow directions, but every so often an outlier comes along that cannot be so easily explained away.

There are a plethora of different approaches for identifying and dealing with outliers. This post is concerned with the idea of robust estimation, i.e, recovering reliable parameter estimates in the presence of outliers. In the great big world of applied statistics, we almost always try to shove everything into the normal distribution box. Unless otherwise indicated, the data is assumed to be distributed $x_1, \ldots, x_n \sim IIDN(\mu, \sigma^2)$.

We are going to focus on an idea called Huberization, which as you may have guessed is named after a fellow named Huber. His 1964 paper on the robust estimation of a location parameter is a landmark in the field. He begins with the location estimator probably first discovered by Gauss.

$$\underset{T}{\operatorname{argmin}} \sum_{i=1}^n \left(x_i - T\right)^2$$

Whose answer is of course $\bar{x}$. Huber recognized that, if an outlier is present, this estimate of location will become unreliable since the outlier is assumed to be an observation from a different distribution than the rest of $x_1, \ldots, x_{n-1}$. So he thought to himself "Can I create estimators such that extreme observations do not cause the estimates of $\bar{x}$ and $s^2$ to grow without bound?" The idea he came up with was to replace the objective functions for $\bar{x}$ and $s^2$ with functions $\rho(.)$ such that $\rho(x) < \infty \forall x \in R$. Furthermore, Huber wanted these functions to be differentiable so that he could work with them using a maximum likelihood type of approach.

Let's take a brief moment to discuss the classic MLEs for normally distributed data. Anyone that has ever been forced through the Casella & Berger ringer will have derived the following:

\begin{align}
\hat{\mu}_{MLE} &= \bar{x} \\
\hat{\sigma}^2_{MLE} &= \frac{1}{n} \sum_{i=1}^n (x_i - \bar{x})^2
\end{align}

Apart from $\hat{\sigma}^2_{MLE}$ being biased, these estimators have some very nice properties. It just so happens that these estimators have the smallest possible sampling variance and $\bar{x}$ is unbiased. Those are damn nice properties! However, due to the definition of $\bar{x}$, these estimates are also very susceptible to outliers. This is especially clear for the variance where an arbitrarily large $x_i$ causes the estimate of the variance to explode since we are computing a sum of squares.

So we wish to replace $\hat{\mu}_{MLE}$ and $\hat{\sigma}^2_{MLE}$ with estimators that are not unduly affected by outliers but also have a reasonably small sampling variance. Huber's famous proposal II suggests the following. Choose a fixed value for a parameter k that is your a priori best guess at the number of outliers in the data set. Once k is fixed, then solve the following optimization problem for robust estimates of location (T) and scale (S)

\begin{align}
\sum_{i=1}^n \psi\left(k, \frac{x_i-T}{S}\right) &= 0 \\
\sum_{i=1}^n \psi^2\left(k, \frac{x_i-T}{S}\right) - n\lambda &= 0
\end{align}

where $\psi(x) = \rho'(x)$ and $\lambda = E_{\Phi}\psi^2(k,x)$. The $\lambda$ term is the expected value computed at the normal distribution with $\psi^2(k,x)$ and it is needed to make the estimator S consistent with a normal distribution. Recall that an estimator is consistent if it converges in probability to the true parameter. If you want to learn more then I suggest reading the paper, but beware that it is a minefield of functional analysis.

So, the definition above is completely useless for the applied statistician / mathematician. So what we're gonna do, and Huber himself recommended it, is to turn this into an equivalent winsorization problem. The idea is to replace the sum of squares with a robust function of the data, $\rho(x)$, and create a set of pseudo-values. These pseudo-values are essentially winsorized values of the original data set. These pseudo-values are determined using $k, \mu$ and $\sigma$

$$\tilde{x}_i = \begin{cases}
x_i & : |x_i - \mu| \leq k\sigma \\
\mu - k\sigma &: x_i < \mu - k\sigma \\
\mu + k\sigma &: x_i > \mu + k\sigma
\end{cases}$$

We will now proceed to apply an iterative algorithm where one searches for value of $\sigma$ that produces the optimal set of pseudo-values. This does not look like Huber's original proposal, but the two are equivalent where the winsorizing step acts like $\psi(x)$ and $k$ and $S$ are found such that they converge to the correct value. The algorithm can be summarized in the following steps:

1. $\hat{\mu}^{(0)} := median(x_1, \ldots, x_n)$ and $\hat{\sigma}^{(0)} := MAD(x_1, \ldots, x_n)$
2. compute the set of pseudo-values $\tilde{x}$
3. while $\left(\hat{\sigma}^{(j)} - \hat{\sigma}^{(j-1)} > \varepsilon\right)$
a) $\delta^{(j)} := k\hat{\sigma}^{(j)}$
b) compute $\tilde{x}^{(j)}$ using $\delta^{(j)}$
c) $\hat{\mu}^{(j)} := \frac{1}{n}\sum_{i=1}^n \tilde{x}_i$
d) $\hat{\sigma}^{(j)} := \lambda\sqrt{\frac{1}{n-1}\sum_{i=1}^n (\tilde{x}_i - \hat{\mu}^{(j)})^2}$

So now that all the theory is out of the way, let's talk about using this in applied statistics. I am an R programmer, but this algorithm can easily be adapted to other programming languages. I do my best to make the code as general as possible (avoid using language specific functionalities) so that you can take this code and drop it into Python, Matlab, etc. That being said, I will outline the "pre-programmed" functions you will need that should be available in most modern mathematical programming languages. Note: Excel is not, and never will be, a suitable platform for mathematical and/or scientific computing.

mean (mean in R)
median (median in R)
square root (sqrt in R)
scaled median absolute deviation (mad in R)
normal cumulative distribution function (pnorm in R)
normal probability density function (dnorm in R)

The version of the algorithm I am using is a Frankenstein's monster of algorithms taken from ISO 13528, Proceedings of the analytical methods committee of the RSC, and the R metRology library. You will need the numerical functions for the normal distribution to estimate consistency factors for the standard deviation estimate. I use the terminology given to this algorithm in ISO 13528. This is widely referenced across international standards in chemistry and metrology, so people looking for an explanation of algorithm A will find it here.
Exported from Notepad++

algorithmA <- function (x, k = 1.5, max.iter = 100, rel.tol = .Machine$double.eps^0.25, verbose = F) {

# Do stupid checks to make sure input makes sense
  stopifnot(max.iter  > 0, 
            rel.tol   > 0, 
            length(x) > 1)
            
  if (class(x) != "numeric") {
    stop("algorithmA expects x to be an object of type \"numeric\".")
  }
  
  if (!(class(max.iter) %in% c("numeric", "integer"))) {
    stop("algorithmA expects max.iter to be an object of type 'numeric' or 'integer'.")
  }
  
  # Only use non-NA observations
  xClean <- x[!is.na(x)]
  
  if (length(xClean) < 2) {
  
    warning("At most 1 entry of x is not NA. Returning NA results.")
    
    out <- list(mean        = NA, 
                sd          = NA, 
                iterations  = NA,  
                winsorizedX = NA)
                
  } else {
  
    # consistency term lambda and other ideas taken from code algA in R library metRology
  
    # inital robust estimate of location
    xs  <- median(xClean)
    
    # initial robust estimate of scale
    tol <- ss <- mad(xClean)

    theta <- 2 * pnorm(abs(k)) - 1
    lambda <- 1/sqrt(theta + (1 - theta) * k^2 - 2 * k * dnorm(k))
    
    delta <- k * ss
    
    i <- 1
    while (tol > (rel.tol * ss) & i < max.iter) {
    
      if (verbose) {
        cat(k, ": ", "mu = ", xs, "; s = ", ss, "\n", sep = "")
      }
      
      # get new estimate of delta and winsorize 
      ss.old  <- ss
      delta   <- k * ss.old
      x.tilde <- xClean
      
      # create the pseudo-values
      # z[z < k] is the vector subset {z: z < k}
      x.tilde[x.tilde < xs - delta] <- xs - delta
      x.tilde[x.tilde > xs + delta] <- xs + delta
      
      # compute mean and standard deviation of pseudo-values
      xs <- mean(x.tilde)
      ss <- lambda * sqrt(sum((x.tilde - xs)^2)/(length(x.tilde) - 1))
      
      # update stopping criterion and iterator
      tol <- abs(ss.old - ss)
      i   <- i + 1
    }
    
    if (!exists("x.tilde")) {
      x.tilde <- xClean
    }
    
    out <- list(mean        = xs, 
                sd          = ss, 
                iterations  = k, 
                winsorizedX = x.tilde)
                
  }
  
  return(out)
}

So we get a couple of interesting things out of this algorithm - we get an estimate of the mean, an estimate of the standard deviation, and the winsorized data. Making a density plot of the winsorized data along with the original data set will make it very clear what the algorithm is actually doing. Essentially it is forcing extreme observations into the range $[-k\sigma, k\sigma]$. This is related to the simple z-score where observations outside of $\pm3\sigma$ are considered to be unlikely to occur by random chance alone. Hence, instead of removing outliers completely, the idea is that we will re-weight them such that they are now conforming to the "true" distribution. The following picture is a contaminated normal distribution $(1-\varepsilon)N(0,1) + \varepsilon N(5,0.5^2)$ where $\varepsilon = 1\%$ of observations are coming from the contamination distribution. Notice the secondary minor mode centered around 5.

Here k = 1.5 but increasing the value of k will "smooth" out the two winsorized modes at $\pm 2$, however the estimator will become more susceptible to outliers the larger k becomes.

Works Cited

P.J. Huber, Robust Estimation of a Location Parameter, The Annals of Mathematical Statistics 1964, 73-101.
ISO. ISO 13528: Statistical Methods for use in Proficiency Testing by Interlaboratory Comparison, 2015.
M. Koller, Robust Estimation of Linear Mixed Models, PhD Dissertation, 2013.
S.L.R. Ellison, metRology R library, 2015.
Analytical Methods Committee - statistical subcommittee, Robust Statistics - How not to reject outliers: Part 1. Basic Concepts, Analyst, 1989, 1693-1697.

Above Average Thoughts

Tuesday, July 26, 2016

Robust estimation of parameters for a Gaussian random sample

Works Cited

No comments:

Post a Comment