January 20, 2011

The Nuage Project: Big Data Analytics in the Cloud

Filed under: Projects — magda balazinska @ 1:28 am

In our database group, we have several ongoing projects that cover a wide range of data management topics from theory to systems.

The Nuage project (http://nuage.cs.washington.edu/) is one of our big projects right now. In this project, we are developing new data management systems and techniques for handling large volumes of data using cloud-computing environments, with a special emphasis on scientific applications.

There are two reasons why we are focusing on scientific applications. The first reason is that scientists today are able to generate data at an unprecedented scale and rate: lab techniques are becoming high-throughput; remote sensing deployments are more pervasive and use higher-resolution sensors than ever before; and simulations on high-performance computing (HPC) platforms significantly expand the resolution of spatial and temporal events. As a result, science is becoming a data management problem. The second reason is that we have a lot of great scientists on the University of Washington campus who are facing this data deluge first hand and are happy to talk to us about the challenges that they are facing and give us access to their queries and data.

So what problems are we tackling exactly?

In the Nuage project, we are looking at the problem of helping domain experts rather than computer scientists to more easily analyze large-scale datasets. There are several challenges associated with this goal and we started to look at the following two:

First, we find that expressing various analysis tasks on parallel data processing systems such as MapReduce is only half the challenge for a data analyst. Getting high-performance from such systems is another great hurdle. We did the exercise ourselves and converted a clustering algorithm used in astronomy into both Pig Latin (running on top of Hadoop) and DryadLINQ (running on top of Dryad). Our first attempt resulted in a terrible runtime of 20 hours for a 40 GB dataset. With extra work, we got it down to about 1 hour, but concluded that we couldn’t ask users to spend a few weeks tuning their queries each time they wanted to ask a question on their data! In this case, the initial slow performance was due to skew caused by the clustering algorithm. In response to this challenge, we developed a system called SkewReduce that automatically partitions a dataset based on user-provided cost functions to avoid skew problems. SkewReduce automatically achieves the fast 1-hour runtime! For more details, we invite you to read our SSDBM 2010 paper and our SOCC 2010 paper on this topic.

In the context of helping users efficiently execute analysis tasks, we have also developed HaLoop, a system that efficiently runs iterative applications on top of Hadoop. The system modifies Hadoop’s scheduler and adds a variety of caches to Hadoop that together minimize the data shuffled between machines. The savings come from avoiding shuffling any data that remains invariant between consecutive iterations, easily cutting runtimes in half. This work appeared in VLDB 2010 (PVLDB vol 3).

Second, we want to help users better understand the performance they are getting from these systems. For this, as a first step, we have developed a time-remaining progress indicator for analysis tasks in the form of MapReduce DAGs. Our indicator is significantly more accurate than previously developed indicators. More details are available in our ICDE 2010 paper and our SIGMOD 2010 paper.

Overall, the area of big data analytics, scientific data management, and cloud computing is full of exciting challenges that we tackle in this project. Please visit our website regularly for updates.


January 7, 2011

UWDB Website

Filed under: Uncategorized — magda balazinska @ 1:13 pm

To learn more about our group, please visit our website: http://db.cs.washington.edu/

UWDB Blog is Finally Born

Filed under: Uncategorized — uwdb @ 12:16 pm

Welcome to the UWDB Blog!

This is the blog of the University of Washington Database Group.

Follow our blog to stay in touch with our recent work, events, and more.


Blog at WordPress.com.