Hadoop, the open-source implementation of Google’s MapReduce, has
become a commonly used tool for Big Data analytics. Due to Hadoop’s
popularity, it is natural to ask the question: How well does Hadoop
actually work? Many papers partly answer this question either by
performing direct comparisons to alternate tools or by carrying out
measurement studies of production clusters. Prior studies, however,
analyze primarily the performance of individual queries and the
efficiency of entire Hadoop clusters, focusing on performance metrics
at the cluster or application levels such as job resource utilization.
In the following paper that will appear at VLDB 2013 in August this year:
Hadoop’s Adolescence: An analysis of Hadoop usage in scientific workloads
Kai Ren, YongChul Kwon, Magdalena Balazinska, and Bill Howe
we provide a complementary answer to the question of how well Hadoop
works. We study what users actually do with their Hadoop system: we
study behaviors and application patterns from a user-centric
perspective. The users that we focus on are called “data
scientists” in the sense that they have large datasets and need to
process them to extract information. The goal of the study is to
assess how well Hadoop works for data scientists in terms of what they
need to do to write Hadoop applications, execute them, tune them, and
use them to extract knowledge from their data.
Our analysis is based on Hadoop workloads collected over periods of
five to 20 months in three different clusters. Our traces comprise a
total of more than 100,000 Hadoop jobs. The clusters that we study
come from academic institutions. Our data scientists are 113 domain
experts from various disciplines including computational astrophysics,
computational biology, computational neurolinguistics, information
retrieval and information classification, machine learning from the
contents of the web, natural language processing, image and video
analysis, and others. The traces were collected between 2009 and 2012.
We study three aspects of how users leverage their Hadoop clusters:
* Application Workload: We first analyze how users author Hadoop
applications, the resulting structure of these applications, and how
users interact with their Hadoop jobs over time.
* Tunings: We then study when and how users change configuration
parameters and otherwise tune their Hadoop jobs.
* Resource Usage and Sharing: Finally, we analyze the diverse, aggregate
resource-consumption profiles of users.
The following are some interesting findings that emerged from this
1) Application Workload
* In the 2009 to 2012 time-period, the majority of MapReduce
applications in the three clusters were written directly on top of the
raw MapReduce API rather than through more advanced tools such as Pig
Latin or Scoobi. The use of streaming, which allows users to specify
map and reduce functions in any programming language was also common.
There are several reasons to the lack of uptake of more advanced
tools. First, these tools are still evolving even today. We talked to
some of the users recently and several considered more advanced tools
cumbersome to use or hard to use. Others complain that high-level
tools that fail in the middle of a complex graph of jobs make it
difficult to restart and resume from the location where the failure
occurred. Second, legacy code (e.g., unix commands and machine
learning packages such as libsvm) and language familiarity (e.g.,
python, perl, and ruby) also play an important role as shown by the
prevalence of Streaming. Third, users indicated that certain uses of
Hadoop do not focus on the MapReduce model but rather use Hadoop
simply as a task scheduler.
* The majority of applications take the form of a single
MapReduce job (46% to 78%). However, a sizeable number of
applications are either chains of jobs, directed acyclic graphs
of jobs (for example applications that join two datasets or that
run multi-stage machine learning algorithms), and even iterative
applications (such as k-means clustering).
* We find that 50% to 89% of all distinct applications were only
executed with one input dataset. Those applications are one-off ad-hoc
queries (or one-off ad-hoc data transformations). Interestingly, we
find that a large fraction of these ad-hoc queries are written using
the streaming interface: 58% to 91%. At the same time, repetitive
applications used to process multiple datasets form the bulk
of all jobs executed in the clusters (40% to 90%).
* We see evidence that users leverage Hadoop both for batch-oriented
data processing and interactive data processing where an ensemble of
MapReduce jobs produce a result and the next ensemble is launched by
the same user 10 seconds to 15 minutes later.
* We find that more than 30% of jobs experience imbalance in the
amount of work assigned to each task either in the map or in the
* We find that a visible fraction of users customize their jobs for
performance (mostly using combiners) and correctness (such as changing
the input/output formats). However, most users are reluctant to use
the manual tuning features provided by Hadoop to solve load imbalance problems.
* For Hadoop configuration parameters, users mostly use default parameters
except those related to failures. Many users explicitly tuned these options
in response to poor failure behaviors of their jobs such as “out of memory” error.
In contrast, users rarely tune parameters related to performance,
because their performance requirements were generally being met,
or because these parameters are more difficult to understand and manipulate.
3) Resource Usage and Sharing
* Users do leverage Hadoop to execute long-duration jobs that process
massive-scale datasets (100’s of GB to a few TB). However, users also
try to use their Hadoop hammer to process short jobs over small
datasets. In fact, short jobs over small datasets dominate the
workload even when Hadoop is known to be a suboptimal tool for such
workloads. Interestingly, some users indicated that they often break
their large and long-running jobs into smaller ones as a way to
checkpoint intermediate results and ensure that less work is lost when
an entire job fails.
* In aggregate, users process significant amounts of data each month
in spite of the fact that most of their jobs are short-lived. We find
that the median user processes hundreds of GB of data to nearly 1 TB
of data in a month. Users, however, have very different resource
consumption profiles. For example, the differences between the top
users and the bottom users can very from 1,000X to 100,000X across
months and clusters.
* We found that only 1% datasets are shared across users.
* Caching can effectively reduce disk I/O requests wasted by
accesses to small files. For OpenCloud and M45 workloads, 69% of data
read accesses would hit the cache even when using only 512 MB memory
at each node.
Based on our findings, we identify important requirements for the next generation
Big Data systems that want to go beyond what Hadoop offers today:
* First and most important, we find that Big Data analysis comes in a large variety of forms: A good fraction of applications follow the batch processing form for which MapReduce was designed. The needs, however, extend significantly beyond this pattern:
many jobs are iterative with short-duration inner-loops, some applications require complex combinations of jobs to produce a desired result, others yet are embarrassingly parallel and can be spread almost arbitrarily across cluster resources. The next-generation Big-Data systems must thus cater to this extensive variety,
rather than optimize for a single type of applications as some systems do today
* Debugging and tuning Big-Data analysis applications is a great challenge.
While there exist important research from automatic configuration tuning to
correctness and performance debugging, these tools still need to make their way into widespread use. An interesting observation is that users are willing to work with the engine to overcome performance challenges and avoid failures, but the complexity of the settings is overwhelming. Informative monitoring tools and novel interactive debugging tools are urgently needed by Hadoop users.
For more details, we invited you to read our paper available here: