## Monthly Archives: September 2016

## Efficient even for data sets

Data analysis — and particularly big-data analysis — is often a matter of fitting data to some sort of mathematical model. The most familiar example of this might be linear regression, which finds a line that approximates a distribution of data points. But fitting data to probability distributions, such as the familiar bell curve, is just as common.

If, however, a data set has just a few corrupted entries — say, outlandishly improbable measurements — standard data-fitting techniques can break down. This problem becomes much more acute with high-dimensional data, or data with many variables, which is ubiquitous in the digital age.

Since the early 1960s, it’s been known that there are algorithms for weeding corruptions out of high-dimensional data, but none of the algorithms proposed in the past 50 years are practical when the variable count gets above, say, 12.

That’s about to change. Earlier this month, at the IEEE Symposium on Foundations of Computer Science, a team of researchers from MIT’s Computer Science and Artificial Intelligence Laboratory, the University of Southern California, and the University of California at San Diego presented a new set of algorithms that can efficiently fit probability distributions to high-dimensional data.

Remarkably, at the same conference, researchers from Georgia Tech presented a very similar algorithm.

The pioneering work on “robust statistics,” or statistical methods that can tolerate corrupted data, was done by statisticians, but both new papers come from groups of computer scientists. That probably reflects a shift of attention within the field, toward the computational efficiency of model-fitting techniques.

“From the vantage point of theoretical computer science, it’s much more apparent how rare it is for a problem to be efficiently solvable,” says Ankur Moitra, the Rockwell International Career Development Assistant Professor of Mathematics at MIT and one of the leaders of the MIT-USC-UCSD project. “If you start off with some hypothetical thing — ‘Man, I wish I could do this. If I could, it would be robust’ — you’re going to have a bad time, because it will be inefficient. You should start off with the things that you know that you can efficiently do, and figure out how to piece them together to get robustness.”

Resisting corruption

To understand the principle behind robust statistics, Moitra explains, consider the normal distribution — the bell curve, or in mathematical parlance, the one-dimensional Gaussian distribution. The one-dimensional Gaussian is completely described by two parameters: the mean, or average, value of the data, and the variance, which is a measure of how quickly the data spreads out around the mean.

If the data in a data set — say, people’s heights in a given population — is well-described by a Gaussian distribution, then the mean is just the arithmetic average. But suppose you have a data set consisting of height measurements of 100 women, and while most of them cluster around 64 inches — some a little higher, some a little lower — one of them, for some reason, is 1,000 inches. Taking the arithmetic average will peg a woman’s mean height at 6 feet 4 inches, not 5 feet 4 inches.

## The basis for machine learning systems

In recent years, the best-performing systems in artificial-intelligence research have come courtesy of neural networks, which look for patterns in training data that yield useful predictions or classifications. A neural net might, for instance, be trained to recognize certain objects in digital images or to infer the topics of texts.

But neural nets are black boxes. After training, a network may be very good at classifying data, but even its creators will have no idea why. With visual data, it’s sometimes possible to automate experiments that determine which visual features a neural net is responding to. But text-processing systems tend to be more opaque.

At the Association for Computational Linguistics’ Conference on Empirical Methods in Natural Language Processing, researchers from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) will present a new way to train neural networks so that they provide not only predictions and classifications but rationales for their decisions.

“In real-world applications, sometimes people really want to know why the model makes the predictions it does,” says Tao Lei, an MIT graduate student in electrical engineering and computer science and first author on the new paper. “One major reason that doctors don’t trust machine-learning methods is that there’s no evidence.”

“It’s not only the medical domain,” adds Regina Barzilay, the Delta Electronics Professor of Electrical Engineering and Computer Science and Lei’s thesis advisor. “It’s in any domain where the cost of making the wrong prediction is very high. You need to justify why you did it.”

“There’s a broader aspect to this work, as well,” says Tommi Jaakkola, an MIT professor of electrical engineering and computer science and the third coauthor on the paper. “You may not want to just verify that the model is making the prediction in the right way; you might also want to exert some influence in terms of the types of predictions that it should make. How does a layperson communicate with a complex model that’s trained with algorithms that they know nothing about? They might be able to tell you about the rationale for a particular prediction. In that sense it opens up a different way of communicating with the model.”

## System accounts for the deflection

MIT researchers have developed a technique for recovering visual information from light that has scattered because of interactions with the environment — such as passing through human tissue.

The technique could lead to medical-imaging systems that use visible light, which carries much more information than X-rays or ultrasound waves, or to computer vision systems that work in fog or drizzle. The development of such vision systems has been a major obstacle to self-driving cars.

In experiments, the researchers fired a laser beam through a “mask” — a thick sheet of plastic with slits cut through it in a certain configuration, such as the letter A — and then through a 1.5-centimeter “tissue phantom,” a slab of material designed to mimic the optical properties of human tissue for purposes of calibrating imaging systems. Light scattered by the tissue phantom was then collected by a high-speed camera, which could measure the light’s time of arrival.

From that information, the researchers’ algorithms were able to reconstruct an accurate image of the pattern cut into the mask.

“The reason our eyes are sensitive only in this narrow part of the spectrum is because this is where light and matter interact most,” says Guy Satat, a graduate student at the MIT Media Lab and first author on the new paper. “This is why X-ray is able to go inside the body, because there is very little interaction. That’s why it can’t distinguish between different types of tissue, or see bleeding, or see oxygenated or deoxygenated blood.”

The imaging technique’s potential applications in automotive sensing may be even more compelling than those in medical imaging, however. Many experimental algorithms for guiding autonomous vehicles are highly reliable under good illumination, but they fall apart completely in fog or drizzle; computer vision systems misinterpret the scattered light as having reflected off of objects that don’t exist. The new technique could address that problem.

Satat’s coauthors on the new paper, published today in *Scientific Reports,* are three other members of the Media Lab’s Camera Culture group: Ramesh Raskar, the group’s leader, Satat’s thesis advisor, and an associate professor of media arts and sciences; Barmak Heshmat, a research scientist; and Dan Raviv, a postdoc.

Expanding circles

Like many of the Camera Culture group’s projects, the new system relies on a pulsed laser that emits ultrashort bursts of light, and a high-speed camera that can distinguish the arrival times of different groups of photons, or light particles. When a light burst reaches a scattering medium, such as a tissue phantom, some photons pass through unmolested; some are only slightly deflected from a straight path; and some bounce around inside the medium for a comparatively long time. The first photons to arrive at the sensor have thus undergone the least scattering; the last to arrive have undergone the most.