SEO optimization
RC25281 Revised (AUS1303-002) March 7, 2013
Computer Science
IBM Research Report
Understanding System s and Architecture for Big Data
William M. Buros 1 , Guan Cheng Chen 2 , Mei-Mei Fu 3 , Anne E. Gattiker 4 ,
Fadi H. Gebara 4 , Ahmed Gheith 4 , H. Peter Hofstee 4 ,
Damir A. Jamsek 4 , Thomas G. Lendacky 1 , Jian Li 4 , Yan Li 2 , John S. Poelman 3 ,
Steven Pratt 1 , Ju Wei Shi 2 , Evan Speight 4 , Peter W. Wong 1
11501 Burnet Road
Austin, TX 78758
2 IBM Research Division
China Research Laboratory
Building 19, Zhouguancun Software Park
8 Dongbeiwang West Road, Haidian District
Beijing, 100193
555 Bailey Avenue
San Jose, CA 95141-1003
4 IBM Research Division
Austin Research Laboratory
11501 Burnet Road
Austin, TX 78758
Research Division
Almaden - Austin - Beijing - Cambridge - Haifa - India - T. J. Watson - Tokyo - Zurich
Understanding Systems and Architectures for Big Data
William M. Buros, Guan Cheng Chen, Mei-Mei Fu, Anne E. Gattiker, Fadi H. Gebara,
Ahmed Gheith, H. Peter Hofstee, Damir A. Jamsek, Thomas G. Lendacky, Jian Li, Yan
Li, John S. Poelman, Steven Pratt, Ju Wei Shi, Evan Speight, Peter W. Wong
International Business Machines Corp.
be structured in many different ways, and techniques are
The use of Big Data underpins critical activities in all sec-
needed to understand and process such variation. Veracity
tors of our society. Achieving the full transformative po-
refers to the quality of the information in the face of data un-
tential of Big Data in this increasingly digital world re-
certainty from many different places: the data itself, sensor
quires both new data analysis algorithms and a new class
precision, model approximation, or inherent process uncer-
of systems to handle the dramatic data growth, the de-
tainty. Emerging social media also brings in new types of
mand to integrate structured and unstructured data ana-
data uncertainty such as rumors, lies, falsehoods and wishful
lytics, and the increasing computing needs of massive-scale
thinking. We believe the 4 Vs presently capture the main
analytics. In this paper, we discuss several Big Data re-
characteristics of Big Data.
search activities at IBM Research: (1) Big Data bench-
The importance of Big Data in today’s society can not
marking and methodology; (2) workload optimized systems
be underestimated, and proper understanding and use of
for Big Data; and (3) a case study of Big Data workloads
this important resource will require both new data analysis
on IBM Power systems. Our case study shows that pre-
algorithms and new systems to handle the dramatic data
liminary infrastructure tuning results in sorting 1TB data
growth, the demand to integrate structured and unstruc-
tured data analytics, and the increasing computing needs of
in 8 minutes on 10 PowerLinux
7R2 with POWER7+
massive-scale analytics. As a result, there exists much re-
systems [5] running IBM InfoSphere BigInsights
. This
search in critical aspects of emerging analytics systems for
translates to sorting 12.8GB/node/minute for the IO inten-
Big Data. Examples of current research in this area include:
sive sort benchmark. We also show that 4 PowerLinux 7R2
processor, memory, and system architectures for data analy-
with POWER7+ nodes can sort 1TB input with around 21
sis; benchmarks, metrics, and workload characterization for
minutes. Further improvements are expected as we continue
analytics and data-intensive computing; debugging and per-
full-stack optimizations on both IBM software and hard-
formance analysis tools for analytics and data-intensive com-
puting; accelerators for analytics and data-intensive com-
puting; implications of data analytics to mobile and em-
bedded systems; energy efficiency and energy-efficient de-
The term “Big Data” refers to the continuing massive ex-
signs for analytics; availability, fault tolerance and recovery
pansion in the data volume and variety as well as the velocity
issues; scalable system and network designs for high con-
and veracity of data processing [12]. Volume refers to the
currency or high bandwidth data streaming; data manage-
scale of the data and processing needs. Whether for data
ment and analytics for vast amounts of unstructured data;
at rest or in motion, i.e., being a repository of information
evaluation tools, methodologies and workload synthesis; OS,
or a stream, the desire for high speed is constant, hence
distributed systems and system management support; and
the notion of velocity. Variety indicates that Big Data may
MapReduce and other processing paradigms and algorithms
for analytics.
All performance data contained in this publication was ob-
This short paper briefly discusses three topics that IBM
tained in the specific operating environment and under the
conditions described below and is presented as an illustra-
researchers and colleagues from other IBM divisions have
tion. Performance obtained in other operating environments
been working on:
may vary and customers should conduct their own testing.
  • Big Data benchmarking and methodology
  • Workload optimized systems for Big Data
  • A case study of a Big Data workload on IBM Power
  • systems
    We highlight the research directions that we are pursuing
    Contact Author: Jian Li, IBM Research in Austin, 11501 Burnet Road,
    in this paper. More technical details and progress updates
    Austin, TX 78758; Phone: (512)286-7228; Email:
    will be covered in several papers in an upcoming issue of the
    IBM Research Technical Report. July 10, 2012.
    Copyright 2012 .
    IBM Journal of Research and Development.
    Massive Scale Analytics is representative of a new class of
    While industry has made substantial investments in ex-
    workloads that justifies a re-thinking of how computing sys-
    tending its software capabilities in analytics and Big Data,
    tems should be optimized. We start by tackling the problem
    thus far these new workloads are being executed on systems
    of the absence of a set of benchmarks that system hardware
    that were designed and configured in response to more tra-
    designers can use to measure the quality of their designs
    ditional workload demands.
    and that customers can use to evaluate competing hardware
    In this paper we present two key improvements to tradi-
    offerings in this fast-growing and still rapidly-changing mar-
    tional system design. The first is the addition of reconfig-
    ket. Existing benchmarks, such as HiBench [11], fall short in
    urable acceleration. While reconfigurable acceleration has
    terms of both scale and relevance. We conceive a method-
    been used successfully in commercial systems and appliances
    before (e.g., DataPower
    [6] and Netezza
    ology for peta-scale data-size benchmark creation that in-
    [4]), we have
    cludes representative Big Data workloads and can be used
    demonstrated via a prototype system that such technology
    as a driver of total system performance, with demands bal-
    can benefit the processing of unstructured data.
    anced across storage, network and computation. Creating
    The second innovation we discuss is a new modular and
    such a benchmark requires meeting unique challenges asso-
    dense design that also leverages acceleration. Achieving the
    ciated with the data size and its unstructured nature. To
    computational and storage densities that this design pro-
    be useful, the benchmark also needs to be generic enough
    vides requires an increase in processing efficiency that is
    to be accepted by the community at large. We also ob-
    achieved by a combination of power-efficient processing cores
    serve unique challenges associated with massive scale ana-
    and offloading of performance-intensive functions to the re-
    lytics benchmark creation, along with a specific benchmark
    configurable logic.
    we have created to meet them.
    With an eye towards how analytics workloads are likely to
    The first consequence of the massive scale of the data is
    evolve and executing such workloads efficiently, we conceive
    that the benchmark must be descriptive, rather than pre-
    of a system that leverages the accelerated dense scale-out
    scriptive. In other words, our proposed benchmark is pro-
    design in combination with powerful global server nodes that
    vided as instructions for acquiring the required data and pro-
    orchestrate and manage computation. This system can also
    cessing it, rather than providing benchmark code to run on
    be used to perform deeper in-memory analytics on selected
    supplied data. We propose the use of existing large datasets,
    such as the 25TB ClueWeb09 dataset [9] and the over 200TB
    Stanford WebBase repository [10]. Challenges of using such
    real-world large datasets include physical data delivery (e.g.,
    via shipped disk drives), and data formatting/“cleaning” of
    Apache Hadoop [1] has been widely deployed on clusters
    the data to allow robust processing.
    of relatively large numbers of moderately sized, commodity
    We propose compute- and data-intensive processing tasks
    servers. However, it has not been widely used on large,
    that are representative of key massive-scale analytics work-
    multi-core, heavily threaded machines even though smaller
    loads to be applied to this unstructured data. These tasks
    systems have increasingly large core and hardware thread
    include major Big Data application areas, text analytics,
    counts. We describe an initial performance evaluation and
    graph analytics and machine-learning. Specifically our bench-
    tuning of Hadoop on a large multi-core cluster with only a
    mark efforts focused on document categorization based on
    handful of machines. The evaluation environment comprises
    dictionary-matching, document and page ranking, and topic
    IBM InfoSphere BigInsights [3] on a 10-machine cluster of
    determination via non-negative matrix factorization. The
    IBM PowerLinux
    7R2 with POWER7+ systems.
    first of the three, in particular, required innovation in bench-
    mark creation, as there is no “golden reference” to establish
    4.1 Evaluation Environment
    correct document categorization. Existing datasets typically
    Table 1 shows the evaluation environment, which com-
    used as references for text-categorization assessments, such
    prises IBM InfoSphere BigInsights [3] on a 10-machine clus-
    as the Enron corpus [2], are orders of magnitude smaller than
    ter of IBM PowerLinux 7R2 with POWER7+ servers. IBM
    what we required. Our approach for overcoming this chal-
    InfoSphere BigInsights provides the power of Apache Hadoop
    lenge included utilizing publicly-accessible documents coded
    in an enterprise-ready package. BigInsights enhances Hadoop
    by subject, such as US Patent Office patents and applica-
    by adding robust administration, workflow orchestration,
    tions, to create subject-specific dictionaries against which
    provisioning and security, along with best-in-class analyt-
    to match documents. Unique challenges of ensuring “real-
    ics tools from IBM Research. This version of BigInsights,
    world”relevance includes covering non-word terms of impor-
    version 1.3, uses Hadoop 0.20.2 and its built-in HDFS file
    tance, such as band names that include regular expression
    characters, and a “wisdom of crowds” approach that helps
    Each PowerLinux 7R2 includes 16 POWER7+
    cores @
    us meet those challenges.
    4.228GHz that can scale to 4.4GHz in a “Dynamic Power
    It is our intention to make our benchmark public. The
    Saving - Favor Performance”mode, up to 64 hardware threads,
    benchmark is complementary to existing prescriptive bench-
    128 GB of memory, 6 × 600 GB internal SAS drives, and
    marks, such as Terasort and its variations that have been
    24 × 600 GB SAS drives in an external Direct Attached Stor-
    widely exercised in the Big Data community. In this paper,
    age (DAS) drawer. We used software RAID0 over LVM for
    we use Terasort as a case study in Section 4.
    the 29 drives to each machine. One internal SAS drive is
    dedicated as the boot disk. The machines are connected
    by 10Gb Ethernet network. Each machine has two 10Gbe
    connections to the top of the rack switch. We used RedHat
    Linux (RHEL6.2). All 10 PowerLinux 7R2 with POWER7+
  • Preliminary intermediate control of map and reduce
  • Table 1: Evaluation Environment
    stages to better utilize available memory capacity;
  • Reconfiguration of storage subsystem to remove fail-
  • Cluster
    10 PowerLinux 7R2 with POWER7+
    over support of storage adapters for effective band-
    width improvement since Hadoop handles storage fail-
    16 processor cores per server (160 total)
    ures by replication at software level;
    128GB per server (1280GB total)
  • JVM, Jitting, and GC tuning that better fit the POWER
  • Internal
    6 600GB internal SAS drives
    per server (36TB total)
    24 600GB SAS drives in IBM EXP24S SFF
  • Architecture featuers, like hardware prefetching, NUMA
  • Expansion
    Gen2-bay Drawer, per server (144TB total)
    Currently, we are able to achieve an execution time of less
    2 10Gbe connections per server
    than 8 minutes to sort 1TB input data from disk and writ-
    BNT BLADE RackSwitch G8264
    ing back 1TB output data to disk storage. This translates
    to sorting 12.8GB/node/minute for the IO intensive sort
    Red Hat Enterprise Linux 6.2
    benchmark. Table 2 lists some of the BigInsights (Hadoop)
    IBM Java Version 7 SR1
    job configuration parameters used in this Terasort run that
    Version 1.3 (Hadoop v0.20.2)
    completed in 7 minutes and 50 seconds.
    10 P7R2 with POWER7+
    Figure 1 shows the CPU utilization of one of the 10 nodes
    as DataNodes/TaskTrackers
    in the cluster during the Terasort run. All nodes have sim-
    One of them as NameNode/JobTracker
    ilar CPU utilization. Figure 1 shows that CPU utilization
    is very high during the map and map-reduce overlapping
    stages. As expected, it is only when all mappers finish
    and only reducers (380 as shown in Table 2 out of the 640
    machines function as Hadoop data nodes and task trackers.
    hardware threads supported by the 10 PowerLinux 7R2 with
    Note that we use one of the P7R2 with POWER7+ as both
    POWER7+ servers) are writing output back to disks that
    maseter node (Name Node and Job Tracker) and a slave
    CPU utilization drops.
    node (Data Node and Task Tracker).
    While we have large quantities of other system profile
    data to help us understand the potential for further perfor-
    4.2 Results and Analysis
    mance improvement, one thing is clear: high CPU utiliza-
    We have some early experiences with measuring and tun-
    tion does not necessarily translate into high performance. In
    ing standard Hadoop programs, including some of the ones
    other words, the CPU may be busy doing inefficient comput-
    used in the HiBench [11] benchmark suite, and some from
    ing (during the Hadoop and Java framework, for instance),
    real-world customer engagements. In this paper, we use
    or performing inefficient algorithms that artificially inflate
    Terasort, a widely used sort program included in the Hadoop
    CPU utilization without improving performance. This leads
    distribution, as a case study. While Terasort in Hadoop can
    us to examine full-stack optimization during the next phase
    be configured to sort differing amounts of input data, we
    of our research.
    only present results for sorting 1TB of input data. In addi-
    tion we compress neither input nor output data.
    Our initial trial is done with the default Hadoop map-
    reduce configuration, e.g., limited map and reduce task slots,
    which does not utilize the 16 processor cores in a single Pow-
    erLinux 7R2 with POWER7+ system. As expected, the test
    takes hours to finish. After initial adjustment of the num-
    ber of mappers and reducers to fit to the parallel processing
    power of 16 cores in a PowerLinux 7R2 with POWER7+
    system, the execution time drastically decreases.
    We then apply the following public-domain tuning meth-
    ods and reference the best practices from the PowerLinux
    Community [8] to gradually improve the sort performance:
    Figure 1: CPU utilization on one PowerLinux 7R2 with
    POWER7+ node from a Terasort run of 7 minutes 50
  • Four-way Simultaneous Multithreading (SMT4) to fur-
  • seconds.
    ther increase computation parallelism and stress data
    parallelism via large L3 caches and high memory band-
    As indicated above, we have only applied infrastructure
    width on POWER7+
    tuning in this stage of the study. In the next stage, we
    plan to incorporate the performance enhancement features
  • Aggressive read ahead setting and deadline disk IO
  • in IBM InfoSphere BigInsights for further improvement. We
    scheduling at OS level;
    are also working on patches for better intermediate control
    in MapReduce. Newer Hadoop versions than 0.20.2 are ex-
  • Large block size and buffer sizing in Hadoop;
  • pected to continue to deliver performance improvements.
    Furthermore, we plan to apply reconfigurable acceleration
  • Publicly available LZO compression [7] for intermedi-
  • technology as indicated in Section 3. In the meantime, scal-
    ate data compression;
    ability studies are also important to understand the optimal
    Table 2: Sample BigInsights (Hadoop) Job Configuration
    ’-server -Xlp -Xnoclassgc -Xgcpolicy:gencon -Xms890m -Xmx890m
    -Xjit:optLevel=hot -Xjit:disableProfiling -Xgcthreads4 -XlockReservation’
    approach to (A) strong scaling of the BigInsights cluster in
    In this paper, we have presented our initial study on Big
    terms of scaling the number of nodes with constant input
    Data benchmarking and methodology as well as workload
    and (B) weak scaling of the BigInsights cluster that scales
    optimized systems for Big Data. We have also discussed our
    up the number of nodes with corresponding increases in in-
    initial experience of sorting 1TB data on a 10-node Pow-
    put size .
    erLinux 7R2 with POWER7+ cluster. As of this writing,
    As a preliminary proof of concept, our experiment shows
    it takes less than 8 minutes to complete the sort, which
    that BigInsights v1.3 with only 4 PowerLinux 7R2 with
    translates to sorting 12.8GB/node/minute for the IO inten-
    POWER7+ nodes can sort 1TB input in around 20 minutes
    sive sort benchmark. We expect additional improvements
    30 seconds. Thus, the Terasort benchmark exhibits nearly
    in the near future. We also show that 4 PowerLinux 7R2
    linear scaling with strong scaling from 4 up through the 10
    with POWER7+ nodes can sort 1TB input with around 21
    nodes in our cluster, leading to a lower system cost and
    minutes. Further improvement is expected as we continue
    footprint for situations where a 20 minutes Terasort time is
    full-stack optimization on both software and hardware.
    While we have not observed that the network is a bot-
    tleneck in our 10-node cluster, this may change when the
    We would like to thank the IBM InfoSphere BigInsights
    system scales up and the workload changes. Similar obser-
    team for their support. Particularly, we owe a debt of grat-
    vations apply to our disk subsystem, which currently has 30
    itude to Berni Schiefer, Stewart Tate and Hui Liao for their
    disks in total per PowerLinux 7R2 with POWER7+ server.
    guidance and insights. Susan Proietti Conti, Gina King, An-
    Finally, we expect people to utilize performance analysis
    gela S Perez, Raguraman Tumaati-Krishnan, Anitra Powell,
    similarly and judiciously size their systems for their partic-
    Demetrice Browder and Chuck Bryan have been supporting
    ular workloads and needs. This can be useful either when
    us to streamline the process along the way. Last but not
    they make purchasing decisions or when they reconfigure
    least, we could not have reached this stage without the gen-
    their systems in production.
    erous support and advice from Dan P. Dumarot, Richard W.
    Bishop, Pat Buckland, Clark Anderson, Ken Blake, Geoffrey
    Cagle and John Williams, among others.
    We borrow the two common notions of scalability, strong
    vs. weak scaling, from the High Performance Computing
    [1] Apache hadoop. http://hadoop.apache.hadoop.
    community, which we think fit well to Big Data analytics
    [2] Enron Email Dataset.
    and data-intensive computing. enron/.
    [3] IBM InfoSphere BigInsights. http://www-
    [4] IBM Netezza Data Warehouse Appliances.
    [5] IBM PowerLinux 7R2 Server. http://www-
    [6] IBM WebSphere DataPower SOA Appliances.
    [7] LZO: A Portable Lossless Data Compression Library.
    [8] PowerLinux Community.
    [9] The ClueWeb09 Dataset.
    [10] The Stanford WebBase Project. testbed/doc2/WebBase/.
    [11] S. Huang, J. Huang, J. Dai, T. Xie, and B. Huang.
    The HiBench benchmark suite: Characterization of
    the MapReduce-based data analysis. In IEEE
    International Conference on Data Engineering
    Workshops (ICDEW) , Long Beach, CA, USA, 2010.
    [12] T. Morgan. IBM Global Technology Outlook 2012. In
    Technology Innovation Exchange , IBM Warwick, 2012.