uwdb

August 12, 2013

Analysis of Hadoop Workloads

Filed under: Uncategorized — magda balazinska @ 10:45 pm

Hadoop, the open-source implementation of Google’s MapReduce, has
become a commonly used tool for Big Data analytics. Due to Hadoop’s
popularity, it is natural to ask the question: How well does Hadoop
actually work? Many papers partly answer this question either by
performing direct comparisons to alternate tools or by carrying out
measurement studies of production clusters.  Prior studies, however,
analyze primarily the performance of individual queries and the
efficiency of entire Hadoop clusters, focusing on performance metrics
at the cluster or application levels such as job resource utilization.

In the following paper that will appear at VLDB 2013 in August this year:

Hadoop’s Adolescence: An analysis of Hadoop usage in scientific workloads
Kai Ren, YongChul Kwon, Magdalena Balazinska, and Bill Howe

we provide a complementary answer to the question of how well Hadoop
works. We study what users actually do with their Hadoop system: we
study behaviors and application patterns from a user-centric
perspective.  The users that we focus on are called “data
scientists” in the sense that they have large datasets and need to
process them to extract information. The goal of the study is to
assess how well Hadoop works for data scientists in terms of what they
need to do to write Hadoop applications, execute them, tune them, and
use them to extract knowledge from their data.

Our analysis is based on Hadoop workloads collected over periods of
five to 20 months in three different clusters. Our traces comprise a
total of more than 100,000 Hadoop jobs. The clusters that we study
come from academic institutions.  Our data scientists are 113 domain
experts from various disciplines including computational astrophysics,
computational biology, computational neurolinguistics, information
retrieval and information classification, machine learning from the
contents of the web, natural language processing, image and video
analysis, and others. The traces were collected between 2009 and 2012.

We study three aspects of how users leverage their Hadoop clusters:
* Application Workload: We first analyze how users author Hadoop
applications, the resulting structure of these applications, and how
users interact with their Hadoop jobs over time.

* Tunings: We then study when and how users change configuration
  parameters and otherwise tune their Hadoop jobs.

* Resource Usage and Sharing: Finally, we analyze the diverse, aggregate
 resource-consumption profiles of users.

The following are some interesting findings that emerged from this
study:

1) Application Workload

* In the 2009 to 2012 time-period, the majority of MapReduce
applications in the three clusters were written directly on top of the
raw MapReduce API rather than through more advanced tools such as Pig
Latin or Scoobi. The use of streaming, which allows users to specify
map and reduce functions in any programming language was also common.
There are several reasons to the lack of uptake of more advanced
tools. First, these tools are still evolving even today.  We talked to
some of the users recently and several considered more advanced tools
cumbersome to use or hard to use. Others complain that high-level
tools that fail in the middle of a complex graph of jobs make it
difficult to restart and resume from the location where the failure
occurred. Second, legacy code (e.g., unix commands and machine
learning packages such as libsvm) and language familiarity (e.g.,
python, perl, and ruby) also play an important role as shown by the
prevalence of Streaming. Third, users indicated that certain uses of
Hadoop do not focus on the MapReduce model but rather use Hadoop
simply as a task scheduler.

* The majority of applications take the form of a single
MapReduce job (46% to 78%). However, a sizeable number of
applications are either chains of jobs, directed acyclic graphs
of jobs (for example applications that join two datasets or that
run multi-stage machine learning algorithms), and even iterative
applications (such as k-means clustering).

* We find that 50% to 89% of all distinct applications were only
executed with one input dataset. Those applications are one-off ad-hoc
queries (or one-off ad-hoc data transformations). Interestingly, we
find that a large fraction of these ad-hoc queries are written using
the streaming interface: 58% to 91%. At the same time, repetitive
applications used to process multiple datasets form the bulk
of all jobs executed in the clusters (40% to 90%).

* We see evidence that users leverage Hadoop both for batch-oriented
data processing and interactive data processing where an ensemble of
MapReduce jobs produce a result and the next ensemble is launched by
the same user 10 seconds to 15 minutes later.

2) Tunings

* We find that more than 30% of jobs experience imbalance in the
amount of work assigned to each task either in the map or in the
reduce phase.

* We find that a visible fraction of users customize their jobs for
performance (mostly using combiners) and correctness (such as changing
the input/output formats).  However, most users are reluctant to use
the manual tuning features provided by Hadoop to solve load imbalance problems.

* For Hadoop configuration parameters, users mostly use default parameters
except those related to failures. Many users explicitly tuned these options
in response to poor failure behaviors of their jobs such as “out of memory” error.
In contrast, users rarely tune parameters related to performance,
because their performance requirements were generally being met,
or because these parameters are more difficult to understand and manipulate.

3) Resource Usage and Sharing

* Users do leverage Hadoop to execute long-duration jobs that process
massive-scale datasets (100’s of GB to a few TB). However, users also
try to use their Hadoop hammer to process short jobs over small
datasets. In fact, short jobs over small datasets dominate the
workload even when Hadoop is known to be a suboptimal tool for such
workloads. Interestingly, some users indicated that they often break
their large and long-running jobs into smaller ones as a way to
checkpoint intermediate results and ensure that less work is lost when
an entire job fails.

* In aggregate, users process significant amounts of data each month
in spite of the fact that most of their jobs are short-lived. We find
that the median user processes hundreds of GB of data to nearly 1 TB
of data in a month. Users, however, have very different resource
consumption profiles. For example, the differences between the top
users and the bottom users can very from 1,000X to 100,000X across
months and clusters.

* We found that only 1% datasets are shared across users.

* Caching can effectively reduce disk I/O requests wasted by
accesses to small files.  For OpenCloud and M45 workloads, 69% of data
read accesses would hit the cache even when using only 512 MB memory
at each node.

Based on our findings, we identify important requirements for the next generation
Big Data systems that want to go beyond what Hadoop offers today:

* First and most important, we find that Big Data analysis comes in a large variety of forms: A good fraction of applications follow the batch processing form for which MapReduce was designed. The needs, however, extend significantly beyond this pattern:
many jobs are iterative with short-duration inner-loops, some applications require complex combinations of jobs to produce a desired result, others yet are embarrassingly parallel and can be spread almost arbitrarily across cluster resources. The next-generation Big-Data systems must thus cater to this extensive variety,
rather than optimize for a single type of applications as some systems do today

* Debugging and tuning Big-Data analysis applications is a great challenge.
While there exist important research from automatic configuration tuning to
correctness and performance debugging, these tools still need to make their way into widespread use. An interesting observation is that users are willing to work with the engine to overcome performance challenges and avoid failures, but the complexity of the settings is overwhelming. Informative monitoring tools and novel interactive debugging tools are urgently needed by Hadoop users.

For more details, we invited you to read our paper available here:
http://homes.cs.washington.edu/~magda/papers/ren-vldb13.pdf

Advertisements

August 2, 2013

Myria: Big Data as a Service

Filed under: Uncategorized — uwdb @ 3:06 pm
Tags: ,

Myria: Big Data as a Service

Magda’s post, Myria: Big Data as a Service, about our group’s Myria Project, is now up on the Intel Science and Technology Center for Big Data Blog!

July 17, 2013

PODS 2013

Filed under: Uncategorized — paris koutris @ 9:49 am
Tags:

There has been a long time since we last posted at the UWDB blog. However, coming back from SIGMOD/PODS 2013 in New York is a good opportunity to start posting again.

Our group participated at PODS with the paper “Communication Steps for Parallel Query Processing” by Paul Beame, Paris Koutris and Dan Suciu. Here is the link for the PODS paper and here is the link for full version.

In this paper, we study the problem of computing a relational query q on a large input database of size n, using a large number p of servers. In our model, the computation is performed in synchronous rounds, where each round includes some local computation followed by global communication between all servers. Further, we restrict the amount of data each server can receive during some computation: each server can receive only O(n/p^(1-ε)) bits, where ε is a parameter in [0,1) that controls replication and we call the space exponent. For example, if ε=1, one could send all data to one server and compute the result locally in one round. On the other end, if ε=0, each server receives O(n/p) data and so the data is equally distributed between the servers.  The restriction on communication/replication creates a trade-off between the number of rounds (r) and space exponent (ε), and computing this trade-off for Conjunctive Queries is the main subject of our work.

For a single round, r=1,  we show lower bounds and upper bounds that are matching. We prove that a conjunctive query q cannot be computed with space exponent ε < 1-1/τ*(q), where τ*(q) is the fractional vertex cover number for the hypergraph of query q. Our lower bounds are particularly strong, since we make no assumption about the way information is exchanged or about the computational capabilities of the servers. Additionally, we present an algorithm for one round that matches the lower bound, i.e. works with space exponent ε >= 1-1/τ*(q), for a specific class of skew-free databases. As an application of our 1-round result, consider the “triangle” query Q(x,y,z) = R(x,y), S(y,z), T(z,x). The hypegraph of Q has a fractional vertex cover of size 3/2, so the triangle query needs space exponent at least ε=1/3 to be computed. For multiple rounds of communication, we present lower bounds in a weaker model where the routing decisions for data are tuple-based. The lower bounds for multiple rounds are, to the best of our knowledge, the first of their kind. Interestingly, we also obtain as a corollary that transitive closure of a graph needs a non-constant number of rounds to be computed, for any space exponent.

There are many exciting questions that remain open. Can we show lower bounds for multiple rounds in the case where there is no assumptions on the kind of communication? What happens in the presence of data skew? Also, what can we say about the space exponent for other problems apart from relational queries?

September 7, 2011

VLDB Luck Continues

Filed under: Uncategorized — magda balazinska @ 12:02 am

We just got back from VLDB 2011. It was not a long trip since the conference was in downdown Seattle this year.

Our UWDB group had a lot of luck at the conference this year. In addition to Jayant’s and Phil’s 10-year best paper award, our group had 2 regular research papers at the conference and 3 out 10 papers on a new and fun “Challenges and Visions Track”. There were 41 submissions, so it is very nice to see that our group managed to get 3 papers accepted.

Additionally, one of our papers, “Data Markets in the Cloud: An Opportunity for the Database Community” by Balazinska, Howe, and Suciu won 2nd place in the Best Paper Competition on this track! This vision paper discusses interesting research questions related to building organized data markets where buyers and sellers come together to sell structured data. Microsoft recently started such a market. They call it the Windows Azure Marketplace DataMarket.

Congrats to the whole group!

 

 

 

April 14, 2011

VLDB 10-year Best Paper Award!

Filed under: Uncategorized — magda balazinska @ 12:32 pm

I’m delighted to share the news that Jayant Madhavan (a UW CSE Ph.D alum currently at Google) and Phil Bernstein (researcher at MSR and also affiliate faculty and member of our DB group), along with their co-author Erhard Rahm received the VLDB 2011 10-year Best Paper Award for their VLDB 2001 paper “Generic Schema Matching with Cupid”! (This award is given to the paper from the conference 10 years ago that had the most impact).

January 7, 2011

UWDB Website

Filed under: Uncategorized — magda balazinska @ 1:13 pm

To learn more about our group, please visit our website: http://db.cs.washington.edu/

UWDB Blog is Finally Born

Filed under: Uncategorized — uwdb @ 12:16 pm

Welcome to the UWDB Blog!

This is the blog of the University of Washington Database Group.

Follow our blog to stay in touch with our recent work, events, and more.

 

Blog at WordPress.com.