An Architecture for Fast and General Data Processing on Large Clusters , livre ebook

Association for Computing Machinery and Morgan & Claypool Publishers - Matei Zaharia

Découvre YouScribe en t'inscrivant gratuitement

Je m'inscris

Obtenez un accès à la bibliothèque pour le consulter en ligne
En savoir plus

83 pages

English

Vous pourrez modifier la taille du texte de cet ouvrage

Obtenez un accès à la bibliothèque pour le consulter en ligne
En savoir plus

A propos
Informations
Extrait

Description

The past few years have seen a major change in computing systems, as growing data volumes and stalling processor speeds require more and more applications to scale out to clusters. Today, a myriad data sources, from the Internet to business operations to scientific instruments, produce large and valuable data streams. However, the processing capabilities of single machines have not kept up with the size of data. As a result, organizations increasingly need to scale out their computations over clusters.

At the same time, the speed and sophistication required of data processing have grown. In addition to simple queries, complex algorithms like machine learning and graph analysis are becoming common. And in addition to batch processing, streaming analysis of real-time data is required to let organizations take timely action. Future computing platforms will need to not only scale out traditional workloads, but support these new applications too.

This book, a revised version of the 2014 ACM Dissertation Award winning dissertation, proposes an architecture for cluster computing systems that can tackle emerging data processing workloads at scale. Whereas early cluster computing systems, like MapReduce, handled batch processing, our architecture also enables streaming and interactive queries, while keeping MapReduce's scalability and fault tolerance. And whereas most deployed systems only support simple one-pass computations (e.g., SQL queries), ours also extends to the multi-pass algorithms required for complex analytics like machine learning. Finally, unlike the specialized systems proposed for some of these workloads, our architecture allows these computations to be combined, enabling rich new applications that intermix, for example, streaming and batch processing.

We achieve these results through a simple extension to MapReduce that adds primitives for data sharing, called Resilient Distributed Datasets (RDDs). We show that this is enough to capture a wide range of workloads. We implement RDDs in the open source Spark system, which we evaluate using synthetic and real workloads. Spark matches or exceeds the performance of specialized systems in many domains, while offering stronger fault tolerance properties and allowing these workloads to be combined. Finally, we examine the generality of RDDs from both a theoretical modeling perspective and a systems perspective.

This version of the dissertation makes corrections throughout the text and adds a new section on the evolution of Apache Spark in industry since 2014. In addition, editing, formatting, and links for the references have been added.

Table of Contents: Preface / 1. Introduction / 2. Resilient Distributed Datasets / 3. Models Built over RDDs / 4. Discretized Streams / 5. Generality of RDDs / 6. Conclusion / References / Author's Biography

Sujets

Computer Architecture

Computers

Informatique

Informations

Publié par	Association for Computing Machinery and Morgan & Claypool Publishers
Date de parution	01 mai 2016
Nombre de lectures	0
EAN13	9781970001587
Langue	English

Informations légales : prix de location à la page 0,2250€. Cette information est donnée uniquement à titre indicatif conformément à la législation en vigueur.

Extrait

An Architecture for Fast and General Data Processing on Large Clusters
ACM Books
Editor in Chief
M. Tamer zsu, University of Waterloo
ACM Books is a new series of high-quality books for the computer science community, published by ACM in collaboration with Morgan Claypool Publishers. ACM Books publications are widely distributed in both print and digital formats through booksellers and to libraries (and library consortia) and individual ACM members via the ACM Digital Library platform.
An Architecture for Fast and General Data Processing on Large Clusters
Matei Zaharia, Massachusetts Institute of Technology
2016
Reactive Internet Programming: State Chart XML in Action
Franck Barbier, University of Pau, France
2016
Verified Functional Programming in Agda
Aaron Stump, The University of Iowa
2016
The VR Book: Human-Centered Design for Virtual Reality
Jason Jerald, NextGen Interactions
2016
Ada s Legacy
Robin Hammerman, Stevens Institute of Technology; Andrew L. Russell, Stevens Institute of Technology
2016
Edmund Berkeley and the Social Responsibility of Computer Professionals
Bernadette Longo, New Jersey Institute of Technology
2015
Candidate Multilinear Maps
Sanjam Garg, University of California, Berkeley
2015
Smarter than Their Machines: Oral Histories of Pioneers in Interactive Computing
John Cullinane, Northeastern University; Mossavar-Rahmani Center for Business and Government, John F. Kennedy School of Government, Harvard University
2015
A Framework for Scientific Discovery through Video Games
Seth Cooper, University of Washington
2014
Trust Extension as a Mechanism for Secure Code Execution on Commodity Computers
Bryan Jeffrey Parno, Microsoft Research
2014
Embracing Interference in Wireless Systems
Shyamnath Gollakota, University of Washington
2014
An Architecture for Fast and General Data Processing on Large Clusters
Matei Zaharia
Massachusetts Institute of Technology
ACM Books 11
Copyright 2016 by the Association for Computing Machinery and Morgan Claypool Publishers
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means-electronic, mechanical, photocopy, recording, or any other except for brief quotations in printed reviews-without the prior permission of the publisher.
Designations used by companies to distinguish their products are often claimed as trademarks or registered trademarks. In all instances in which Morgan Claypool is aware of a claim, the product names appear in initial capital or all capital letters. Readers, however, should contact the appropriate companies for more complete information regarding trademarks and registration.
An Architecture for Fast and General Data Processing on Large Clusters
Matei Zaharia
books.acm.org
www.morganclaypool.com
ISBN: 978-1-97000-159-4 hardcover
ISBN: 978-1-97000-156-3 paperback
ISBN: 978-1-97000-157-0 ebook
ISBN: 978-1-97000-158-7 ePub
Series ISSN: 2374-6769 print 2374-6777 electronic
DOIs: 10.1145/2886107 Book
10.1145/2886107.2886108 Preface
10.1145/2886107.2886109 Chapter 1
10.1145/2886107.2886110 Chapter 2
10.1145/2886107.2886111 Chapter 3
10.1145/2886107.2886112 Chapter 4
10.1145/2886107.2886113 Chapter 5
10.1145/2886107.2886114 Chapter 6
10.1145/2886107.2886115 References
A publication in the ACM Books series, 11
Editor in Chief: M. Tamer zsu, University of Waterloo
First Edition
10 9 8 7 6 5 4 3 2 1
To my family
Contents
Preface
Chapter 1 Introduction
1.1 Problems with Specialized Systems
1.2 Resilient Distributed Datasets (RDDs)
1.3 Models Implemented over RDDs
1.4 Summary of Results
1.5 Book Overview
Chapter 2 Resilient Distributed Datasets
2.1 Introduction
2.2 RDD Abstraction
2.3 Spark Programming Interface
2.4 Representing RDDs
2.5 Implementation
2.6 Evaluation
2.7 Discussion
2.8 Related Work
2.9 Summary
Chapter 3 Models Built over RDDs
3.1 Introduction
3.2 Techniques for Implementing Other Models on RDDs
3.3 Shark: SQL on RDDs
3.4 Implementation
3.5 Performance
3.6 Combining SQL with Complex Analytics
3.7 Summary
Chapter 4 Discretized Streams
4.1 Introduction
4.2 Goals and Background
4.3 Discretized Streams (D-Streams)
4.4 System Architecture
4.5 Fault and Straggler Recovery
4.6 Evaluation
4.7 Discussion
4.8 Related Work
4.9 Summary
Chapter 5 Generality of RDDs
5.1 Introduction
5.2 Expressiveness Perspective
5.3 Systems Perspective
5.4 Limitations and Extensions
5.5 Related Work
5.6 Summary
Chapter 6 Conclusion
6.1 Lessons Learned
6.2 Evolution of Spark in Industry
6.3 Future Work
References
Author s Biography
Preface
I am thankful to my advisors, Scott Shenker and Ion Stoica, who guided me tirelessly throughout my Ph.D. They are both brilliant researchers who were always able to push ideas one level further, provide the input needed to finish a task, and share their experience in research. I was especially fortunate to work with both of them at the same time and benefit from both their perspectives.
The work in this book is the result of collaboration with many other people. Chapter 2 was joint work with Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Mike Franklin, Scott Shenker, and Ion Stoica [ Zaharia et al. 2011 ]. Chapter 3 highlights parts of the Shark project developed with Reynold Xin, Josh Rosen, Mike Franklin, Scott Shenker, and Ion Stoica [ Xin et al. 2013b ]. Chapter 4 was joint work with Tathagata Das, Haoyuan Li, Timothy Hunter, Scott Shenker, and Ion Stoica [ Zaharia et al. 2013 ]. More broadly, numerous people in the AMPLab and in Spark-related projects like GraphX [ Xin et al. 2013a ] and MLI [ Sparks et al. 2013 ] contributed to the development and refinement of the ideas in this book.
Beyond direct collaborators on the projects here, many other people contributed to my graduate work and made Berkeley an unforgettable experience. Over numerous cups of chai and Ceylon gold, Ali Ghodsi provided fantastic advice and ideas on both the research and open source sides. Ben Hindman, Andy Konwinski, and Kurtis Heimerl were great fun to hang out with and great collaborators on some crazy ideas. Taylor Sittler got me, and then much of the AMPLab, very excited about biology, which led to one of the most fun groups I ve participated in under AMP-X, with Bill Bolosky, Ravi Pandya, Kristal Curtis, Dave Patterson, and others. In other projects I was also fortunate to work with Vern Paxson, Dawn Song, Anthony Joseph, Randy Katz, and Armando Fox, and to learn from their insights. Finally, the AMPLab and RAD Lab were amazing communities, both among the researchers at Berkeley and the industrial contacts who gave us constant feedback.
I was also very fortunate to work early on with the open-source big data community. Dhruba Borthakur and Joydeep Sen Sarma got me started contributing to Hadoop at Facebook, while Eric Baldeschwieler, Owen O Malley, Arun Murthy, Sanjay Radia, and Eli Collins participated in many discussions to make our research ideas real. Once we started the Spark project, I was and continue to be overwhelmed by the talent and enthusiasm of the individuals who contribute to it. The Spark and Shark contributors, of which there are now over 1000, deserve as much credit as anyone for making these projects real. The users, of course, have also contributed tremendously, by continuing to give great feedback and relentlessly pushing these systems in new directions. Of these, I am especially thankful to Spark s earliest users, including Lester Mackey, Tim Hunter, Dilip Joseph, Jibin Zhan, Erich Nachbar, and Karthik Thiyagarajan.
Last but not least, I want to thank my family and friends for their unwavering support throughout my PhD and the writing of this book.
1 Introduction
The past few years have seen a major change in computing systems, as growing data volumes require more and more applications to scale out to large clusters. In both commercial and scientific fields, new data sources and instruments (e.g., gene sequencers, RFID, and the Web) are producing rapidly increasing amounts of information. Unfortunately, the processing and I/O capabilities of single machines have not kept up with this growth. As a result, more and more organizations have to scale out their computations across clusters.
The cluster environment comes with several challenges for programmability. The first one is parallelism: this setting requires rewriting applications in a parallel fashion, with programming models that can capture a wide range of computations. However, unlike in other parallel platforms, a second challenge in clusters is faults: both outright node failures and stragglers (slow nodes) become common at scale, and can greatly impact the performance of an application. Finally, clusters are typically shared between multiple users, requiring runtimes that can dynamically scale computations up and down and exacerbating the possibility of interference.
As a result, a wide range of new programming models have been designed for clusters. At first, Google s MapReduce [ Dean and Ghemawat 2004 ] presented a simple and general model for batch processing that automatically handles faults. However, MapReduce was found poorly suited for other types of workloads, leading to a wide range of specialized models that differed significantly from MapReduce. For example, at Google, Pregel [ Malewicz et al. 2010 ] offers a bulk-synchronous parallel (BSP) model for iterative graph algorithms; F1 [ Shute et al. 2013 ] runs fast, but non-fault-tolerant, SQL queries; and MillWheel [ Akidau et al. 2013 ] supports continuous stream processing. Outside Google, systems like Storm [ Apache Storm 2016 ], Impala [ Cloudera Impala 2016 ], Piccolo [ Power and Li 2010 ] and GraphLab [ Low et al. 2012 ] offer similar models. With new models continuing to be implemented every year