Monthly Archives: October 2016

Data scientists could accomplish in days

Last year, MIT researchers presented a system that automated a crucial step in big-data analysis: the selection of a “feature set,” or aspects of the data that are useful for making predictions. The researchers entered the system in several data science contests, where it outperformed most of the human competitors and took only hours instead of months to perform its analyses.

This week, in a pair of papers at the IEEE International Conference on Data Science and Advanced Analytics, the team described an approach to automating most of the rest of the process of big-data analysis — the preparation of the data for analysis and even the specification of problems that the analysis might be able to solve.

The researchers believe that, again, their systems could perform in days tasks that used to take data scientists months.

“The goal of all this is to present the interesting stuff to the data scientists so that they can more quickly address all these new data sets that are coming in,” says Max Kanter MEng ’15, who is first author on last year’s paper and one of this year’s papers. “[Data scientists want to know], ‘Why don’t you show me the top 10 things that I can do the best, and then I’ll dig down into those?’ So [these methods are] shrinking the time between getting a data set and actually producing value out of it.”

Both papers focus on time-varying data, which reflects observations made over time, and they assume that the goal of analysis is to produce a probabilistic model that will predict future events on the basis of current observations.

Real-world problems

The first paper describes a general framework for analyzing time-varying data. It splits the analytic process into three stages: labeling the data, or categorizing salient data points so they can be fed to a machine-learning system; segmenting the data, or determining which time sequences of data points are relevant to which problems; and “featurizing” the data, the step performed by the system the researchers presented last year.

The second paper describes a new language for describing data-analysis problems and a set of algorithms that automatically recombine data in different ways, to determine what types of prediction problems the data might be useful for solving.

According to Kalyan Veeramachaneni, a principal research scientist at MIT’s Laboratory for Information and Decision Systems and senior author on all three papers, the work grew out of his team’s experience with real data-analysis problems brought to it by industry researchers.

“Our experience was, when we got the data, the domain experts and data scientists sat around the table for a couple months to define a prediction problem,” he says. “The reason I think that people did that is they knew that the label-segment-featurize process takes six to eight months. So we better define a good prediction problem to even start that process.”

Memory management scheme

A year ago, researchers from MIT’s Computer Science and Artificial Intelligence Laboratory unveiled a fundamentally new way of managing memory on computer chips, one that would use circuit space much more efficiently as chips continue to comprise more and more cores, or processing units. In chips with hundreds of cores, the researchers’ scheme could free up somewhere between 15 and 25 percent of on-chip memory, enabling much more efficient computation.

Their scheme, however, assumed a certain type of computational behavior that most modern chips do not, in fact, enforce. Last week, at the International Conference on Parallel Architectures and Compilation Techniques — the same conference where they first reported their scheme — the researchers presented an updated version that’s more consistent with existing chip designs and has a few additional improvements.

The essential challenge posed by multicore chips is that they execute instructions in parallel, while in a traditional computer program, instructions are written in sequence. Computer scientists are constantly working on ways to make parallelization easier for computer programmers.

The initial version of the MIT researchers’ scheme, called Tardis, enforced a standard called sequential consistency. Suppose that different parts of a program contain the sequences of instructions ABC and XYZ. When the program is parallelized, A, B, and C get assigned to core 1; X, Y, and Z to core 2.

Sequential consistency doesn’t enforce any relationship between the relative execution times of instructions assigned to different cores. It doesn’t guarantee that core 2 will complete its first instruction — X — before core 1 moves onto its second — B. It doesn’t even guarantee that core 2 will begin executing its first instruction — X — before core 1 completes its last one — C. All it guarantees is that, on core 1, A will execute before B and B before C; and on core 2, X will execute before Y and Y before Z.

The first author on the new paper is Xiangyao Yu, a graduate student in electrical engineering and computer science. He is joined by his thesis advisor and co-author on the earlier paper, Srini Devadas, the Edwin Sibley Webster Professor in MIT’s Department of Electrical Engineering and Computer Science, and by Hongzhe Liu of Algonquin Regional High School and Ethan Zou of Lexington High School, who joined the project through MIT’s Program for Research in Mathematics, Engineering and Science (PRIMES) program.

Featured discussions on cybersecurity

MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) hosted a summit that brought together cybersecurity experts from business, government, and academia to talk about better ways to combat cyber-threats directed at companies and countries.

Co-organized by the Aspen Institute and CNBC, the “Cambridge Cyber Summit” featured discussions with leaders that include Admiral Michael Rogers, director of the National Security Agency (NSA); and Andrew McCabe, deputy director of the Federal Bureau of Investigation (FBI).

Taking place in MIT’s Kresge Auditorium, the event included a mix of interviews and demos from top government officials, technologists, and “white hat” security hackers, as well as live coverage throughout the day on CNBC.

The summit focused on critical issues in privacy and security. Rogers spoke of the growing threats to public and private companies, including trends on how hacking has changed over time and the importance of law enforcement being able to access criminals’ information.

“The challenge for us is how to access content in a way that will protect [people’s] rights,” but still allow us to generate the answers to protect our citizens,” Rogers said.

Throughout the day there were constant reminders of the summit’s timeliness. During an interview with senior national security official John Carlin, CNBC anchor Andrew Ross Sorkin broke the news of an NSA contractor who had just been arrested for allegedly stealing top-secret information — and asked Carlin to chime in with his thoughts.

“There’s been a shift in approach post-9/11,” said Carlin. “Success is not prosecution after the fact. It’s preventing an attack from occurring in the first place. We need to learn and adjust our defenses to ensure that we can prevent the next one, before it happens.”

During a live demonstration of the dark web and ransomware, CSAIL’s Srini Devadas spoke about the complex nature of anonymity.

“The good side of it is that it protects the everyday web user,” said Devadas, a professor of computer science at MIT. “The other part, in places like the dark web, shows the bad side — when bad people target innocent users.”

White-hat security hacker David Kennedy, who had previously penetrated the healthcare.gov website in just four minutes, demonstrated how easy it is to uncover personal information. He took a volunteer from the audience and, using just his full name and hometown, was able to uncover his social security number and address, as well as send him a text message that seemed to come from his wife’s phone.

CSAIL’s Daniel Weitzner, founding director of the MIT Internet Policy Research Initiative, spoke passionately about maintaining user privacy in the face of legitimate national security issues.

“Over the last decade, there have been efforts to introduce technology to make surveillance easier, which has put users at risk,” he said. “There will never be perfectly secure systems, but we will always try to close the gaps.”

The summit comes on the heels of several recent cybersecurity efforts at MIT, including last year’s launch of three new initiatives that span multiple labs and departments. The three efforts — Cybersecurity@CSAIL, the Internet Policy Initiative (CPI), and MIT Sloan’s Interdisciplinary Consortium for Improving Critical Infrastructure Cybersecurity (IC)3 — aim to provide a cross-disciplinary strategy for tackling the complex problem of cybersecurity.

Improve systems like Gmail and Dropbox

Git is an open-source system with a polarizing reputation among programmers. It’s a powerful tool to help developers track changes to code, but many view it as prohibitively difficult to use.

To make it more user-friendly, a team from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) has developed “Gitless,” an interface that fixes many of the system’s core problems without fundamentally changing what it does.

“With Gitless we’ve developed a tool that we think is easier to learn and use, but that still keeps the core elements that make Git popular,” says graduate student Santiago Perez De Rosso, who co-wrote a related paper with MIT Professor Daniel Jackson. “What’s particularly encouraging about this work is that it suggests that the same approach might be used to improve the usability of other software systems, such as Dropbox and Google Inbox.”

Gitless was developed, in part, by looking at nearly 2,400 Git-related questions from the popular programming site StackOverflow. The team then outlined some of Git’s biggest issues, including its concepts of “staging” and “stashing,” and proposed changes aimed at minimizing those problems.

Because Gitless is implemented on top of Git, users can easily switch between the two without having migrate code from one to the other. Plus, their collaborators don’t even have to know that they aren’t big fans of Git.

Perez De Rosso will present the paper at next month’s ACM SIGPLAN conference on “Systems, Programming, Languages and Applications: Software for Humanity” in Amsterdam.

How it works

Git is what’s called a “version control system.” It allows multiple programmers to track changes to code, including making “branches” of a file that can be worked on individually.

Users make changes and then save (or “commit”) them so that everyone knows who did what. If you and a colleague are on version 10 of a file, and you want to try something new, you can create a separate “branch” while your friend works on the “master.”

Makes sense, right? But things get confusing quickly. One feature of Gitless is that it eliminates “staging,” which lets you save just certain parts of a file. For example, let’s say you have a file with both finished and unfinished changes, and you’d like to commit the finished changes. “Staging” lets you commit those changes while keeping the others as a work-in-progress.

However, having a file with both a staged and working version creates tricky situations. If you stage a file and make more changes that you then commit, the version that’s committed is the one you staged before, not the one you’re working on now.

Gitless essentially hides the staging area altogether, which makes the process much clearer and less complex for the user. Instead, there’s a much more flexible “commit” command that still allows you to do things like selecting segments of code to commit.

Another concept that Gitless removes is “stashing.” Imagine that you’re in the middle of a project and have to switch to a different branch of it, but don’t yet want to commit your half-done work. Stashing takes the changes you’ve made and saves them on a stack of unfinished changes that you can restore later. (The key difference between stashing and staging is that, with stashing, changes disappear from the working directory.)

“The problem is that, when switching branches, it can be hard to remember which stash goes where,” says Perez De Rosso. “On top of that, stashing doesn’t help if you are in the middle of an action like a merge that involves conflicting files.”

Gitless solves this issue by making branches completely independent from each other. This makes it much easier and less confusing for developers who have to constantly switch between tasks.

Computer system that could help identify subtle speech

For children with speech and language disorders, early-childhood intervention can make a great difference in their later academic and social success. But many such children — one study estimates 60 percent — go undiagnosed until kindergarten or even later.

Researchers at the Computer Science and Artificial Intelligence Laboratory at MIT and Massachusetts General Hospital’s Institute of Health Professions hope to change that, with a computer system that can automatically screen young children for speech and language disorders and, potentially, even provide specific diagnoses.

This week, at the Interspeech conference on speech processing, the researchers reported on an initial set of experiments with their system, which yielded promising results. “We’re nowhere near finished with this work,” says John Guttag, the Dugald C. Jackson Professor in Electrical Engineering and senior author on the new paper. “This is sort of a preliminary study. But I think it’s a pretty convincing feasibility study.”

The system analyzes audio recordings of children’s performances on a standardized storytelling test, in which they are presented with a series of images and an accompanying narrative, and then asked to retell the story in their own words.

“The really exciting idea here is to be able to do screening in a fully automated way using very simplistic tools,” Guttag says. “You could imagine the storytelling task being totally done with a tablet or a phone. I think this opens up the possibility of low-cost screening for large numbers of children, and I think that if we could do that, it would be a great boon to society.”

Subtle signals

The researchers evaluated the system’s performance using a standard measure called area under the curve, which describes the tradeoff between exhaustively identifying members of a population who have a particular disorder, and limiting false positives. (Modifying the system to limit false positives generally results in limiting true positives, too.) In the medical literature, a diagnostic test with an area under the curve of about 0.7 is generally considered accurate enough to be useful; on three distinct clinically useful tasks, the researchers’ system ranged between 0.74 and 0.86.

To build the new system, Guttag and Jen Gong, a graduate student in electrical engineering and computer science and first author on the new paper, used machine learning, in which a computer searches large sets of training data for patterns that correspond to particular classifications — in this case, diagnoses of speech and language disorders.