LMS

Output     Options     Examples     References

LMS computes least median of squares regression. This is a very robust procedure that is useful for outlier detection. It is the highest possible “breakdown” estimator, which means that up to 50% of the data can be replaced with bad numbers and it will still yield a consistent estimate. Proper standard errors (such as asymptotically normal) for LMS coefficients are not known at present.

LMS (ALL, LTS, MOST, PRINT, SILENT, SUBSETS=<value>, TERSE)<dependent  variable> <list of independent variables> ;

Usage

To estimate by least median of squares in TSP, use the LMS command just like the OLSQ command. For example,

LMS CONS,C,GNP ;

estimates the consumption function. The PRINT option enables the printing of the outliers, or you can define a set of outliers for printing by screening on the residuals, which are stored in @RES.

Output

The usual regression output is printed and stored (see OLSQ for further discussion). The number of possible subsets and the best subset found are also printed. The following results are stored:

variable

type

length

description

@LHV    

list

1

name of dependent variable

@RNMS

list

#vars

list of names of independent variables

@YMEAN

scalar

1

mean of dependent variable

@SDEV    

scalar

1

standard deviation of dependent variable

@SSR

scalar

1

sum of squared residuals

@S2

scalar

1

standard error squared

@S

scalar

1

standard error of estimate

@RSQ   

scalar

1

R-squared

@ARSQ

scalar

1

adjusted R-squared

@LMHET

scalar

1

LM test for heteroskedasticity

%LMHET

scalar

1

p-value for heteroskedasticity test

@DW

scalar

1

Durbin-Watson statistic

@PHI

scalar

1

median of squares for final estimate

@STDDEV

scalar

1

standard deviation of residuals, dropping outliers

@NCOEF

scalar

1

number of coefficients (variables)

@NCID

scalar

1

number of identified coefficients (<=@NCOEF)

@COEF

vector

#vars

vector of estimated coefficients

@SES

vector

#vars

standard errors of estimated coefficients

@T

vector

#vars

T-statistics for estimates (null is zero)

%T

vector

#vars

p-values corresponding to T-statistics

@VCOV

matrix

#vars*#vars

estimated variance-covariance of estimated coefficients

@RES

series

#obs

residuals = dependent variable-fitted values

@FIT

series

#obs

fitted values of dependent variable

Method

The LMS estimator minimizes the square (or the absolute value) of the median residual with respect to the coefficient vector b:

Clearly this ignores the sizes of the largest residuals in the sample (i.e. those whose absolute values are larger than the median), so it will be robust to the presence of any extreme data points (outliers).

If there are K independent variables (excluding the constant term), LMS will consider many different subsets of K observations each. An exact fit regression line is computed for each subset, and residuals are computed for the remaining observations using these coefficients. The residuals are essentially sorted to find the median, and a slight adjustment is made to allow for the constant term. If K or the number of observations is large, the number of subsets could be very large (and sorting time could be lengthy), so random subsets will usually be considered in this case. This is controlled with the ALL, MOST, and SUBSETS= options.

The result is not necessarily the global Least Median of Squares optimum, but a feasible close approximation to it, with the same properties for outlier detection. Least Trimmed Squares (LTS) also has about the same properties. The LMS estimator occasionally produces a non-unique estimate of the coefficient vector b; TSP reports the number of non-unique subsets in this case. It uses the following steps to determine the preferred solution:

1. Try an arithmetic average of the parameter vectors. If this yields the same objective function value, this is the reported solution. (This is equivalent to using the average of the two middle values for the median, when there is an even number of observations).

2. Choose the parameter vector with minimum L1 norm (sum of absolute values of the coefficients). If this yields a tie, use rule 3.

3. Choose the parameter vector with minimum absolute value of the first coefficient. If this yields a tie, pick the first vector of those tied.

An extremely rough estimate of the variance-covariance of the estimated coefficients is computed with an OLS-type formula:

where the estimate of σ squared is computed after deleting the largest residuals. This estimate is not asymptotically normal, and is likely to be an underestimate, so it should not be used for serious hypothesis testing. It all depends on how the outliers are generated by the underlying model.

The code used in TSP was adapted from Rousseeuw’s Progress program, obtainable from his web page, referenced below.

Options

ALL/NOALL  uses all possible observation subsets (see Method), even if there are over one million of them.

LTS/NOLTS  computes the Least Trimmed Squares estimates, which minimize the sum of squared residuals, from the smallest up to the median (instead of LMS, which minimizes just the squared median residual). Usually the LTS and LMS estimates are fairly close to each other.

MOST/NOMOST  uses all possible subsets, unless the number of subsets is one million or more (in which case random subsets are used).

PRINT/NOPRINT  prints better subsets (as progress towards a minimum is made), and the final outliers.

SILENT/NOSILENT  suppresses all output.

SUBSETS= number of random subsets to use. The default is 500 to 3000. If the number of possible subsets is less than the number of random subsets, all possible subsets will be evaluated systematically.

TERSE/NOTERSE suppresses all printed output except the table of coefficient estimates and the value of the objective function.

Examples

LMS Y C Z1-Z6;                                 ?  uses 3000 random subsets

LMS (MOST) Y C Z1-Z6 ;                   ?  probably all subsets

?  2000 subsets (but not the same set as the default)

LMS (SUBSET=2000) Y C Z1-Z6 ;     

References

Rousseeuw, P. J., "Least Median of Squares Regression," JASA 79 (1984), pp. 871-880.

Rousseeuw, P. J., and Leroy, A. M., Robust Regression and Outlier Detection, Wiley, 1987.

Rousseeuw, P. J., and Wagner, J., “Robust Regression with a distributed intercept using Least Median of Squares,” Computational Statistics and Data Analysis 17 (1994), pp. 66-68.

http://www.agoras.ua.ac.be         (Rousseeuw / Progress)