Data scientists could accomplish in days

Last year, MIT researchers presented a system that automated a crucial step in big-data analysis: the selection of a “feature set,” or aspects of the data that are useful for making predictions. The researchers entered the system in several data science contests, where it outperformed most of the human competitors and took only hours instead of months to perform its analyses.

This week, in a pair of papers at the IEEE International Conference on Data Science and Advanced Analytics, the team described an approach to automating most of the rest of the process of big-data analysis — the preparation of the data for analysis and even the specification of problems that the analysis might be able to solve.

The researchers believe that, again, their systems could perform in days tasks that used to take data scientists months.

“The goal of all this is to present the interesting stuff to the data scientists so that they can more quickly address all these new data sets that are coming in,” says Max Kanter MEng ’15, who is first author on last year’s paper and one of this year’s papers. “[Data scientists want to know], ‘Why don’t you show me the top 10 things that I can do the best, and then I’ll dig down into those?’ So [these methods are] shrinking the time between getting a data set and actually producing value out of it.”

Both papers focus on time-varying data, which reflects observations made over time, and they assume that the goal of analysis is to produce a probabilistic model that will predict future events on the basis of current observations.

Real-world problems

The first paper describes a general framework for analyzing time-varying data. It splits the analytic process into three stages: labeling the data, or categorizing salient data points so they can be fed to a machine-learning system; segmenting the data, or determining which time sequences of data points are relevant to which problems; and “featurizing” the data, the step performed by the system the researchers presented last year.

The second paper describes a new language for describing data-analysis problems and a set of algorithms that automatically recombine data in different ways, to determine what types of prediction problems the data might be useful for solving.

According to Kalyan Veeramachaneni, a principal research scientist at MIT’s Laboratory for Information and Decision Systems and senior author on all three papers, the work grew out of his team’s experience with real data-analysis problems brought to it by industry researchers.

“Our experience was, when we got the data, the domain experts and data scientists sat around the table for a couple months to define a prediction problem,” he says. “The reason I think that people did that is they knew that the label-segment-featurize process takes six to eight months. So we better define a good prediction problem to even start that process.”