Table of Contents

Foreword xv
Preface xvii
Acknowledgments xxv
About the Author xxvii

Part I: Directives in the Big Data Era 1

Chapter 1: Four Rules for Data Success 3

When Data Became a BIG Deal 3
Data and the Single Server 4
The Big Data Trade-Off 5
Anatomy of a Big Data Pipeline 9
The Ultimate Database 10
Summary 10

Part II: Collecting and Sharing a Lot of Data 11

Chapter 2: Hosting and Sharing Terabytes of Raw Data 13

Suffering from Files 14
Storage: Infrastructure as a Service 15
Choosing the Right Data Format 16
Character Encoding 19
Data in Motion: Data Serialization Formats 21
Summary 23

Chapter 3: Building a NoSQL-Based Web App to Collect Crowd-Sourced Data 25

Relational Databases: Command and Control 25
Relational Databases versus the Internet 28
Nonrelational Database Models 31
Leaning toward Write Performance: Redis 35
Sharding across Many Redis Instances 38
NewSQL: The Return of Codd 41
Summary 42

Chapter 4: Strategies for Dealing with Data Silos 43

A Warehouse Full of Jargon 43
Hadoop: The Elephant in the Warehouse 48
Data Silos Can Be Good 49
Convergence: The End of the Data Silo 51
Summary 53

Part III: Asking Questions about Your Data 55

Chapter 5: Using Hadoop, Hive, and Shark to Ask Questions about Large Datasets 57

What Is a Data Warehouse? 57
Apache Hive: Interactive Querying for Hadoop 60
Shark: Queries at the Speed of RAM 65
Data Warehousing in the Cloud 66
Summary 67

Chapter 6: Building a Data Dashboard with Google BigQuery 69

Analytical Databases 69
Dremel: Spreading the Wealth 71
BigQuery: Data Analytics as a Service 73
Building a Custom Big Data Dashboard 75
The Future of Analytical Query Engines 82
Summary 83

Chapter 7: Visualization Strategies for Exploring Large Datasets 85

Cautionary Tales: Translating Data into Narrative 86
Human Scale versus Machine Scale 89
Building Applications for Data Interactivity 90
Summary 96

Part IV: Building Data Pipelines 97

Chapter 8: Putting It Together: MapReduce Data Pipelines 99

What Is a Data Pipeline? 99
Data Pipelines with Hadoop Streaming 101
A One-Step MapReduce Transformation 105
Managing Complexity: Python MapReduce Frameworks for Hadoop 110
Summary 114

Chapter 9: Building Data Transformation Workflows with Pig and Cascading 117

Large-Scale Data Workflows in Practice 118
It’s Complicated: Multistep MapReduce
Transformations 118
Cascading: Building Robust Data-Workflow Applications 122
When to Choose Pig versus Cascading 128
Summary 128

Part V: Machine Learning for Large Datasets 129

Chapter 10: Building a Data Classification System with Mahout 131

Can Machines Predict the Future? 132
Challenges of Machine Learning 132
Apache Mahout: Scalable Machine Learning 136
MLBase: Distributed Machine Learning
Framework 139
Summary 140

Part VI: Statistical Analysis for Massive Datasets 143

Chapter 11: Using R with Large Datasets 145

Why Statistics Are Sexy 146
Strategies for Dealing with Large Datasets 149
Summary 155

Chapter 12: Building Analytics Workflows Using Python and Pandas 157
The Snakes Are Loose in the Data Zoo 157
Python Libraries for Data Processing 160
Building More Complex Workflows 167
iPython: Completing the Scientific Computing Tool Chain 170
Summary 174

Part VII: Looking Ahead 177

Chapter 13: When to Build, When to Buy, When to Outsource 179

Overlapping Solutions 179
Understanding Your Data Problem 181
A Playbook for the Build versus Buy Problem 182
My Own Private Data Center 184
Understand the Costs of Open-Source 186
Everything as a Service 187
Summary 187

Chapter 14: The Future: Trends in Data Technology 189

Hadoop: The Disruptor and the Disrupted 190
Everything in the Cloud 191
The Rise and Fall of the Data Scientist 193
Convergence: The Ultimate Database 195
Convergence of Cultures 196
Summary 197

Index 199