August 12, 2013

Analysis of Hadoop Workloads

Filed under: Uncategorized — magda balazinska @ 10:45 pm

Hadoop, the open-source implementation of Google’s MapReduce, has
become a commonly used tool for Big Data analytics. Due to Hadoop’s
popularity, it is natural to ask the question: How well does Hadoop
actually work? Many papers partly answer this question either by
performing direct comparisons to alternate tools or by carrying out
measurement studies of production clusters.  Prior studies, however,
analyze primarily the performance of individual queries and the
efficiency of entire Hadoop clusters, focusing on performance metrics
at the cluster or application levels such as job resource utilization.

In the following paper that will appear at VLDB 2013 in August this year:

Hadoop’s Adolescence: An analysis of Hadoop usage in scientific workloads
Kai Ren, YongChul Kwon, Magdalena Balazinska, and Bill Howe

we provide a complementary answer to the question of how well Hadoop
works. We study what users actually do with their Hadoop system: we
study behaviors and application patterns from a user-centric
perspective.  The users that we focus on are called “data
scientists” in the sense that they have large datasets and need to
process them to extract information. The goal of the study is to
assess how well Hadoop works for data scientists in terms of what they
need to do to write Hadoop applications, execute them, tune them, and
use them to extract knowledge from their data.

Our analysis is based on Hadoop workloads collected over periods of
five to 20 months in three different clusters. Our traces comprise a
total of more than 100,000 Hadoop jobs. The clusters that we study
come from academic institutions.  Our data scientists are 113 domain
experts from various disciplines including computational astrophysics,
computational biology, computational neurolinguistics, information
retrieval and information classification, machine learning from the
contents of the web, natural language processing, image and video
analysis, and others. The traces were collected between 2009 and 2012.

We study three aspects of how users leverage their Hadoop clusters:
* Application Workload: We first analyze how users author Hadoop
applications, the resulting structure of these applications, and how
users interact with their Hadoop jobs over time.

* Tunings: We then study when and how users change configuration
  parameters and otherwise tune their Hadoop jobs.

* Resource Usage and Sharing: Finally, we analyze the diverse, aggregate
 resource-consumption profiles of users.

The following are some interesting findings that emerged from this

1) Application Workload

* In the 2009 to 2012 time-period, the majority of MapReduce
applications in the three clusters were written directly on top of the
raw MapReduce API rather than through more advanced tools such as Pig
Latin or Scoobi. The use of streaming, which allows users to specify
map and reduce functions in any programming language was also common.
There are several reasons to the lack of uptake of more advanced
tools. First, these tools are still evolving even today.  We talked to
some of the users recently and several considered more advanced tools
cumbersome to use or hard to use. Others complain that high-level
tools that fail in the middle of a complex graph of jobs make it
difficult to restart and resume from the location where the failure
occurred. Second, legacy code (e.g., unix commands and machine
learning packages such as libsvm) and language familiarity (e.g.,
python, perl, and ruby) also play an important role as shown by the
prevalence of Streaming. Third, users indicated that certain uses of
Hadoop do not focus on the MapReduce model but rather use Hadoop
simply as a task scheduler.

* The majority of applications take the form of a single
MapReduce job (46% to 78%). However, a sizeable number of
applications are either chains of jobs, directed acyclic graphs
of jobs (for example applications that join two datasets or that
run multi-stage machine learning algorithms), and even iterative
applications (such as k-means clustering).

* We find that 50% to 89% of all distinct applications were only
executed with one input dataset. Those applications are one-off ad-hoc
queries (or one-off ad-hoc data transformations). Interestingly, we
find that a large fraction of these ad-hoc queries are written using
the streaming interface: 58% to 91%. At the same time, repetitive
applications used to process multiple datasets form the bulk
of all jobs executed in the clusters (40% to 90%).

* We see evidence that users leverage Hadoop both for batch-oriented
data processing and interactive data processing where an ensemble of
MapReduce jobs produce a result and the next ensemble is launched by
the same user 10 seconds to 15 minutes later.

2) Tunings

* We find that more than 30% of jobs experience imbalance in the
amount of work assigned to each task either in the map or in the
reduce phase.

* We find that a visible fraction of users customize their jobs for
performance (mostly using combiners) and correctness (such as changing
the input/output formats).  However, most users are reluctant to use
the manual tuning features provided by Hadoop to solve load imbalance problems.

* For Hadoop configuration parameters, users mostly use default parameters
except those related to failures. Many users explicitly tuned these options
in response to poor failure behaviors of their jobs such as “out of memory” error.
In contrast, users rarely tune parameters related to performance,
because their performance requirements were generally being met,
or because these parameters are more difficult to understand and manipulate.

3) Resource Usage and Sharing

* Users do leverage Hadoop to execute long-duration jobs that process
massive-scale datasets (100’s of GB to a few TB). However, users also
try to use their Hadoop hammer to process short jobs over small
datasets. In fact, short jobs over small datasets dominate the
workload even when Hadoop is known to be a suboptimal tool for such
workloads. Interestingly, some users indicated that they often break
their large and long-running jobs into smaller ones as a way to
checkpoint intermediate results and ensure that less work is lost when
an entire job fails.

* In aggregate, users process significant amounts of data each month
in spite of the fact that most of their jobs are short-lived. We find
that the median user processes hundreds of GB of data to nearly 1 TB
of data in a month. Users, however, have very different resource
consumption profiles. For example, the differences between the top
users and the bottom users can very from 1,000X to 100,000X across
months and clusters.

* We found that only 1% datasets are shared across users.

* Caching can effectively reduce disk I/O requests wasted by
accesses to small files.  For OpenCloud and M45 workloads, 69% of data
read accesses would hit the cache even when using only 512 MB memory
at each node.

Based on our findings, we identify important requirements for the next generation
Big Data systems that want to go beyond what Hadoop offers today:

* First and most important, we find that Big Data analysis comes in a large variety of forms: A good fraction of applications follow the batch processing form for which MapReduce was designed. The needs, however, extend significantly beyond this pattern:
many jobs are iterative with short-duration inner-loops, some applications require complex combinations of jobs to produce a desired result, others yet are embarrassingly parallel and can be spread almost arbitrarily across cluster resources. The next-generation Big-Data systems must thus cater to this extensive variety,
rather than optimize for a single type of applications as some systems do today

* Debugging and tuning Big-Data analysis applications is a great challenge.
While there exist important research from automatic configuration tuning to
correctness and performance debugging, these tools still need to make their way into widespread use. An interesting observation is that users are willing to work with the engine to overcome performance challenges and avoid failures, but the complexity of the settings is overwhelming. Informative monitoring tools and novel interactive debugging tools are urgently needed by Hadoop users.

For more details, we invited you to read our paper available here:

August 2, 2013

Myria: Big Data as a Service

Filed under: Uncategorized — uwdb @ 3:06 pm
Tags: ,

Myria: Big Data as a Service

Magda’s post, Myria: Big Data as a Service, about our group’s Myria Project, is now up on the Intel Science and Technology Center for Big Data Blog!

July 17, 2013

PODS 2013

Filed under: Uncategorized — paris koutris @ 9:49 am

There has been a long time since we last posted at the UWDB blog. However, coming back from SIGMOD/PODS 2013 in New York is a good opportunity to start posting again.

Our group participated at PODS with the paper “Communication Steps for Parallel Query Processing” by Paul Beame, Paris Koutris and Dan Suciu. Here is the link for the PODS paper and here is the link for full version.

In this paper, we study the problem of computing a relational query q on a large input database of size n, using a large number p of servers. In our model, the computation is performed in synchronous rounds, where each round includes some local computation followed by global communication between all servers. Further, we restrict the amount of data each server can receive during some computation: each server can receive only O(n/p^(1-ε)) bits, where ε is a parameter in [0,1) that controls replication and we call the space exponent. For example, if ε=1, one could send all data to one server and compute the result locally in one round. On the other end, if ε=0, each server receives O(n/p) data and so the data is equally distributed between the servers.  The restriction on communication/replication creates a trade-off between the number of rounds (r) and space exponent (ε), and computing this trade-off for Conjunctive Queries is the main subject of our work.

For a single round, r=1,  we show lower bounds and upper bounds that are matching. We prove that a conjunctive query q cannot be computed with space exponent ε < 1-1/τ*(q), where τ*(q) is the fractional vertex cover number for the hypergraph of query q. Our lower bounds are particularly strong, since we make no assumption about the way information is exchanged or about the computational capabilities of the servers. Additionally, we present an algorithm for one round that matches the lower bound, i.e. works with space exponent ε >= 1-1/τ*(q), for a specific class of skew-free databases. As an application of our 1-round result, consider the “triangle” query Q(x,y,z) = R(x,y), S(y,z), T(z,x). The hypegraph of Q has a fractional vertex cover of size 3/2, so the triangle query needs space exponent at least ε=1/3 to be computed. For multiple rounds of communication, we present lower bounds in a weaker model where the routing decisions for data are tuple-based. The lower bounds for multiple rounds are, to the best of our knowledge, the first of their kind. Interestingly, we also obtain as a corollary that transitive closure of a graph needs a non-constant number of rounds to be computed, for any space exponent.

There are many exciting questions that remain open. Can we show lower bounds for multiple rounds in the case where there is no assumptions on the kind of communication? What happens in the presence of data skew? Also, what can we say about the space exponent for other problems apart from relational queries?

September 7, 2011

VLDB Luck Continues

Filed under: Uncategorized — magda balazinska @ 12:02 am

We just got back from VLDB 2011. It was not a long trip since the conference was in downdown Seattle this year.

Our UWDB group had a lot of luck at the conference this year. In addition to Jayant’s and Phil’s 10-year best paper award, our group had 2 regular research papers at the conference and 3 out 10 papers on a new and fun “Challenges and Visions Track”. There were 41 submissions, so it is very nice to see that our group managed to get 3 papers accepted.

Additionally, one of our papers, “Data Markets in the Cloud: An Opportunity for the Database Community” by Balazinska, Howe, and Suciu won 2nd place in the Best Paper Competition on this track! This vision paper discusses interesting research questions related to building organized data markets where buyers and sellers come together to sell structured data. Microsoft recently started such a market. They call it the Windows Azure Marketplace DataMarket.

Congrats to the whole group!




April 14, 2011

VLDB 10-year Best Paper Award!

Filed under: Uncategorized — magda balazinska @ 12:32 pm

I’m delighted to share the news that Jayant Madhavan (a UW CSE Ph.D alum currently at Google) and Phil Bernstein (researcher at MSR and also affiliate faculty and member of our DB group), along with their co-author Erhard Rahm received the VLDB 2011 10-year Best Paper Award for their VLDB 2001 paper “Generic Schema Matching with Cupid”! (This award is given to the paper from the conference 10 years ago that had the most impact).

April 12, 2011

PODS 2011

Filed under: Publication News — paris koutris @ 12:34 am

Apart from SIGMOD 2011, our group will also participate in PODS 2011 with the following paper:

Parallel Evaluation of Conjunctive Queries (Paraschos Koutris and Dan Suciu)

The availability of large data centers with tens of thousands of servers has led to the popular adoption of massive parallelism for data analysis on large datasets. Several query languages exist for running queries on massively parallel architectures, some based on the MapReduce infrastructure, others using proprietary implementations. Motivated by this trend, we analyze the parallel complexity of conjunctive queries. We propose a very simple model of parallel computation that captures these architectures, in which the complexity parameter is the number of parallel steps requiring synchronization of all servers. We study the complexity of conjunctive queries and give a complete characterization of the queries which can be computed in one parallel step. These form a strict subset of hierarchical queries, and include flat queries like R(x,y),S(x,z),T(x,v),U(x,w), tall queries like R(x),S(x,y),T(x,y,z),U(x,y,z,w), and combinations thereof, which we call tall-flat queries. We describe an algorithm for computing in parallel any tall-flat query, and prove that any query that is not tall-flat cannot be computed in one step in this model. Finally, we present extensions of our results to queries that are not tall-flat.

February 16, 2011

January 20, 2011

The Nuage Project: Big Data Analytics in the Cloud

Filed under: Projects — magda balazinska @ 1:28 am

In our database group, we have several ongoing projects that cover a wide range of data management topics from theory to systems.

The Nuage project (http://nuage.cs.washington.edu/) is one of our big projects right now. In this project, we are developing new data management systems and techniques for handling large volumes of data using cloud-computing environments, with a special emphasis on scientific applications.

There are two reasons why we are focusing on scientific applications. The first reason is that scientists today are able to generate data at an unprecedented scale and rate: lab techniques are becoming high-throughput; remote sensing deployments are more pervasive and use higher-resolution sensors than ever before; and simulations on high-performance computing (HPC) platforms significantly expand the resolution of spatial and temporal events. As a result, science is becoming a data management problem. The second reason is that we have a lot of great scientists on the University of Washington campus who are facing this data deluge first hand and are happy to talk to us about the challenges that they are facing and give us access to their queries and data.

So what problems are we tackling exactly?

In the Nuage project, we are looking at the problem of helping domain experts rather than computer scientists to more easily analyze large-scale datasets. There are several challenges associated with this goal and we started to look at the following two:

First, we find that expressing various analysis tasks on parallel data processing systems such as MapReduce is only half the challenge for a data analyst. Getting high-performance from such systems is another great hurdle. We did the exercise ourselves and converted a clustering algorithm used in astronomy into both Pig Latin (running on top of Hadoop) and DryadLINQ (running on top of Dryad). Our first attempt resulted in a terrible runtime of 20 hours for a 40 GB dataset. With extra work, we got it down to about 1 hour, but concluded that we couldn’t ask users to spend a few weeks tuning their queries each time they wanted to ask a question on their data! In this case, the initial slow performance was due to skew caused by the clustering algorithm. In response to this challenge, we developed a system called SkewReduce that automatically partitions a dataset based on user-provided cost functions to avoid skew problems. SkewReduce automatically achieves the fast 1-hour runtime! For more details, we invite you to read our SSDBM 2010 paper and our SOCC 2010 paper on this topic.

In the context of helping users efficiently execute analysis tasks, we have also developed HaLoop, a system that efficiently runs iterative applications on top of Hadoop. The system modifies Hadoop’s scheduler and adds a variety of caches to Hadoop that together minimize the data shuffled between machines. The savings come from avoiding shuffling any data that remains invariant between consecutive iterations, easily cutting runtimes in half. This work appeared in VLDB 2010 (PVLDB vol 3).

Second, we want to help users better understand the performance they are getting from these systems. For this, as a first step, we have developed a time-remaining progress indicator for analysis tasks in the form of MapReduce DAGs. Our indicator is significantly more accurate than previously developed indicators. More details are available in our ICDE 2010 paper and our SIGMOD 2010 paper.

Overall, the area of big data analytics, scientific data management, and cloud computing is full of exciting challenges that we tackle in this project. Please visit our website regularly for updates.

January 7, 2011

UWDB Website

Filed under: Uncategorized — magda balazinska @ 1:13 pm

To learn more about our group, please visit our website: http://db.cs.washington.edu/

UWDB Blog is Finally Born

Filed under: Uncategorized — uwdb @ 12:16 pm

Welcome to the UWDB Blog!

This is the blog of the University of Washington Database Group.

Follow our blog to stay in touch with our recent work, events, and more.


Blog at WordPress.com.