Efficient even for data sets

Data analysis — and particularly big-data analysis — is often a matter of fitting data to some sort of mathematical model. The most familiar example of this might be linear regression, which finds a line that approximates a distribution of data points. But fitting data to probability distributions, such as the familiar bell curve, is just as common.

If, however, a data set has just a few corrupted entries — say, outlandishly improbable measurements — standard data-fitting techniques can break down. This problem becomes much more acute with high-dimensional data, or data with many variables, which is ubiquitous in the digital age.

Since the early 1960s, it’s been known that there are algorithms for weeding corruptions out of high-dimensional data, but none of the algorithms proposed in the past 50 years are practical when the variable count gets above, say, 12.

That’s about to change. Earlier this month, at the IEEE Symposium on Foundations of Computer Science, a team of researchers from MIT’s Computer Science and Artificial Intelligence Laboratory, the University of Southern California, and the University of California at San Diego presented a new set of algorithms that can efficiently fit probability distributions to high-dimensional data.

Remarkably, at the same conference, researchers from Georgia Tech presented a very similar algorithm.

The pioneering work on “robust statistics,” or statistical methods that can tolerate corrupted data, was done by statisticians, but both new papers come from groups of computer scientists. That probably reflects a shift of attention within the field, toward the computational efficiency of model-fitting techniques.

“From the vantage point of theoretical computer science, it’s much more apparent how rare it is for a problem to be efficiently solvable,” says Ankur Moitra, the Rockwell International Career Development Assistant Professor of Mathematics at MIT and one of the leaders of the MIT-USC-UCSD project. “If you start off with some hypothetical thing — ‘Man, I wish I could do this. If I could, it would be robust’ — you’re going to have a bad time, because it will be inefficient. You should start off with the things that you know that you can efficiently do, and figure out how to piece them together to get robustness.”

Resisting corruption

To understand the principle behind robust statistics, Moitra explains, consider the normal distribution — the bell curve, or in mathematical parlance, the one-dimensional Gaussian distribution. The one-dimensional Gaussian is completely described by two parameters: the mean, or average, value of the data, and the variance, which is a measure of how quickly the data spreads out around the mean.

If the data in a data set — say, people’s heights in a given population — is well-described by a Gaussian distribution, then the mean is just the arithmetic average. But suppose you have a data set consisting of height measurements of 100 women, and while most of them cluster around 64 inches — some a little higher, some a little lower — one of them, for some reason, is 1,000 inches. Taking the arithmetic average will peg a woman’s mean height at 6 feet 4 inches, not 5 feet 4 inches.