January 20, 2011

The Nuage Project: Big Data Analytics in the Cloud

Filed under: Projects — magda balazinska @ 1:28 am

In our database group, we have several ongoing projects that cover a wide range of data management topics from theory to systems.

The Nuage project (http://nuage.cs.washington.edu/) is one of our big projects right now. In this project, we are developing new data management systems and techniques for handling large volumes of data using cloud-computing environments, with a special emphasis on scientific applications.

There are two reasons why we are focusing on scientific applications. The first reason is that scientists today are able to generate data at an unprecedented scale and rate: lab techniques are becoming high-throughput; remote sensing deployments are more pervasive and use higher-resolution sensors than ever before; and simulations on high-performance computing (HPC) platforms significantly expand the resolution of spatial and temporal events. As a result, science is becoming a data management problem. The second reason is that we have a lot of great scientists on the University of Washington campus who are facing this data deluge first hand and are happy to talk to us about the challenges that they are facing and give us access to their queries and data.

So what problems are we tackling exactly?

In the Nuage project, we are looking at the problem of helping domain experts rather than computer scientists to more easily analyze large-scale datasets. There are several challenges associated with this goal and we started to look at the following two:

First, we find that expressing various analysis tasks on parallel data processing systems such as MapReduce is only half the challenge for a data analyst. Getting high-performance from such systems is another great hurdle. We did the exercise ourselves and converted a clustering algorithm used in astronomy into both Pig Latin (running on top of Hadoop) and DryadLINQ (running on top of Dryad). Our first attempt resulted in a terrible runtime of 20 hours for a 40 GB dataset. With extra work, we got it down to about 1 hour, but concluded that we couldn’t ask users to spend a few weeks tuning their queries each time they wanted to ask a question on their data! In this case, the initial slow performance was due to skew caused by the clustering algorithm. In response to this challenge, we developed a system called SkewReduce that automatically partitions a dataset based on user-provided cost functions to avoid skew problems. SkewReduce automatically achieves the fast 1-hour runtime! For more details, we invite you to read our SSDBM 2010 paper and our SOCC 2010 paper on this topic.

In the context of helping users efficiently execute analysis tasks, we have also developed HaLoop, a system that efficiently runs iterative applications on top of Hadoop. The system modifies Hadoop’s scheduler and adds a variety of caches to Hadoop that together minimize the data shuffled between machines. The savings come from avoiding shuffling any data that remains invariant between consecutive iterations, easily cutting runtimes in half. This work appeared in VLDB 2010 (PVLDB vol 3).

Second, we want to help users better understand the performance they are getting from these systems. For this, as a first step, we have developed a time-remaining progress indicator for analysis tasks in the form of MapReduce DAGs. Our indicator is significantly more accurate than previously developed indicators. More details are available in our ICDE 2010 paper and our SIGMOD 2010 paper.

Overall, the area of big data analytics, scientific data management, and cloud computing is full of exciting challenges that we tackle in this project. Please visit our website regularly for updates.


Leave a Comment »

No comments yet.

RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Blog at WordPress.com.

%d bloggers like this: