pro apache hadoop - gbv · contents chapter2: hadoopconcepts 11 introducing hadoop 11...

Pro Apache Hadoop

Second Edition

Sameer Wadkar

Madhu Siddalingaiah

Contents

J

About the Authors xix

About the Technical Reviewer xxi

Acknowledgments xxiii

Introduction xxv

Chapter 1: Motivation for Big Data 1

What Is Big Data? 1

Key Idea Behind Big Data Techniques 2

Data Is Distributed Across Several Nodes 2

Applications Are Moved to the Data 3

Data Is Processed Local to a Node 3

Sequential Reads Preferred Over Random Reads 3

An Example 4

Big Data Programming Models 4

Massively Parallel Processing (MPP) Database Systems 4

In-Memory Database Systems 5

MapReduce Systems 5

Bulk Synchronous Parallel (BSP) Systems 6

Big Data and Transactional Systems 7

How Much Can We Scale? 8

A Compute-Intensive Example 8

Amdhal's Law 9

Business Use-Cases for Big Data 9

Summary 10

vii

CONTENTS

Chapter 2: Hadoop Concepts 11

Introducing Hadoop 11

Introducing the MapReduce Model 12

Components of Hadoop 16

Hadoop Distributed File System (HDFS) 17

Secondary NameNode 22

TaskTracker 23

JobTracker 23

Hadoop 2.0 24

Components of YARN 26

HDFS High Availability 29

Summary 30

Chapter 3: Getting Started with the Hadoop Framework 31

Types of Installation 31

Stand-Alone Mode 31

Pseudo-Distributed Cluster 32

Multinode Node Cluster Installation 32

Preinstalled Using Amazon Elastic MapReduce 32

Setting up a Development Environment with a Cloudera Virtual Machine 33

Components of a MapReduce program 34

Your First Hadoop Program 34

Prerequisites to Run Programs in Local Mode 35

WordCount Using the Old API 36

Building the Application 38

Running WordCount in Cluster Mode 39

WordCount Using the New API 39

Building the Application 41

Running WordCount in Cluster Mode 41

Third-Party Libraries in Hadoop Jobs 41

Summary 46

viii

CONTENTS

Chapter 4: Hadoop Administration 47

Hadoop Configuration Files 47

Configuring Hadoop Daemons 48

Precedence of Hadoop Configuration Files 49

Diving into Hadoop Configuration Files 49

core-site.xml 50

hdfs-*.xml 51

mapred-site.xml 52

yarn-site.xml 54

Memory Allocations in YARN 55

Scheduler 56

Capacity Scheduler 57

Fair Scheduler 59

Fair Scheduler Configuration 60

yarn-site.xml Configurations 61

Allocation File Format and Configurations 62

Determine Dominant Resource Share in drf Policy 63

Slaves File 64

Rack Awareness 64

Providing Hadoop with Network Topology 64

Cluster Administration Utilities 65

Check the HDFS 66

Command-Line HDFS Administration 68

Rebalancing HDFS Data 70

Copying Large Amounts of Data from the HDFS 71

Summary 72

Chapter 5: Basics of MapReduce Development 73

Hadoop and Data Processing 73

Reviewing the Airline Dataset 73

Preparing the Development Environment 75

Preparing the Hadoop System 75

ix

CONTENTS

MapReduce Programming Patterns 76

Map-Only Jobs (SELECT and WHERE Queries) 76

Problem Definition: SELECT Clause 76

Problem Definition: WHERE Clause 84

Map and Reduce Jobs (Aggregation Queries) 87

Problem Definition: GROUP BY and SUM Clauses 88

Improving Aggregation Performance Using the Combiner 94

Problem Definition: Optimized Aggregators 95

Role of the Partitioner 100

Problem Definition: Split Airline Data by Month 100

Bringing it All Together 103

Summary 106

Chapter 6: Advanced MapReduce Development 107

MapReduce Programming Patterns 107

Introduction to Hadoop I/O 107

Problem Definition: Sorting 109

Problem Definition: Analyzing Consecutive Records 124

Problem Definition: Join Using MapReduce 134

Problem Definition: Join Using Map-Only jobs 140

Writing to Multiple Output Files in a Single MR Job 145

Collecting Statistics Using Counters 147

Summary 150

Chapter 7: Hadoop Input/Output 151

Compression Schemes 151

What Can Be Compressed? 152

Compression Schemes 152

Enabling Compression 153

Inside the Hadoop I/O processes 154

InputFormat 155

OutputFormat 156

Custom OutputFormat: Conversion from Text to XML 157

x

CONTENTS

Custom InputFormat: Consuming a Custom XML file 161

Hadoop Files 170

SequenceFile 171

MapFiles 176

Avro Files 177

Summary 183

Chapter 8: Testing Hadoop Programs 185

Revisiting the Word Counter 185

Introducing MRUnit 187

Installing MRUnit 187

MRUnit Core Classes 187

Writing an MRUnit Test Case 188

Testing Counters 190

Features of MRUnit 193

Limitations of MRUnit 194

Testing with LocalJobRunner 194

Limitations of LocalJobRunner 197

Testing with MiniMRCIuster 197

Setting up the Development Environment 197

Example for MiniMRCIuster 199

Limitations of MiniMRCIuster 201

Testing MR Jobs with Access Network Resources 201

Summary 202

Chapter 9: Monitoring Hadoop 203

Writing Log Messages in Hadoop MapReduce Jobs 203

Viewing Log Messages in Hadoop MapReduce Jobs 206

User Log Management in Hadoop 2.x 209

Log Storage in Hadoop 2.x 209

Log Management Improvements 211

Viewing Logs Using Web-Based Ul 211

xi

CONTENTS

Command-Line Interface 211

Log Retention 212

Hadoop Cluster Performance Monitoring 212

Using YARN REST APIs 213

Managing the Hadoop Cluster Using Vendor Tools 213

Ambari Architecture 214

Summary 215

Chapter 10: Data Warehousing Using Hadoop 217

Apache Hive 217

Installing Hive 218

Hive Architecture 218

Metastore 219

Compiler Basics 219

Hive Concepts 219

HiveQL Compiler Details 223

Data Definition Language 227

Data Manipulation Language 228

External Interfaces 229

Hive Scripts 231

Performance 232

MapReduce Integration 232

Creating Partitions 233

User-Defined Functions 234

Impala 236

ImpalaArchitecture 237

Impala Features 237

Impala Limitations 237

Shark 238

Shark/Spark Architecture 238

Summary 239

xii

CONTENTS

Chapter 11: Data Processing Using Pig 241

An Introduction to Pig 241

Running Pig 243

Executing in the Grunt Shell 244

Executing a Pig Script 244

Embedded Java Program 245

Pig Latin 246

Comments in a Pig Script 246

Execution of Pig Statements 247

Pig Commands 247

User-Defined Functions 252

Eval Functions Invoked in the Mapper 253

Eval Functions Invoked in the Reducer 253

Writing and Using a Custom FilterFunc 260

Comparison of PIG versus Hive 262

Crunch API 263

How Crunch Differs from Pig 263

Sample Crunch Pipeline 264

Summary 269

Chapter 12: HCatalog and Hadoop in the Enterprise 271

HCatalog and Enterprise Data Warehouse Users 271

HCatalog: A Brief Technical Background 272

HCatalog Command-Line Interface 274

WebHCat 274

HCatalog Interface for MapReduce 275

HCatalog Interface for Pig 278

HCatalog Notification Interface 279

Security and Authorization in HCatalog 279

Bringing It All Together 280

Summary 281

xiii

CONTENTS

Chapter 13: Log Analysis Using Hadoop 283

Log File Analysis Applications 283

Web Analytics 283

Security Compliance and Forensics 284

Monitoring and Alerts 284

Internet of Things 285

Analysis Steps 286

Load 286

Refine 286

Visualize 287

Apache Flume 287

Core Concepts 288

Netflix Suro 290

Cloud Solutions 291

Summary 291

Chapter 14: Building Real-Time Systems Using HBase 293

What Is HBase? 293

Typical HBase Use-Case Scenarios 294

HBase Data Model 295

HBase Logical or Client-Side View 295

Differences Between HBase and RDBMSs 296

HBase Tables 297

HBase Cells 297

HBase Column Family 297

HBase Commands and APIs 298

Getting a Command List: help Command 299

Creating a Table: create Command 300

Adding Rows to a Table: put Command 300

Retrieving Rows from the Table: get Command 300

Reading Multiple Rows: scan Command 300

xiv

CONTENTS

Counting the Rows in the Table: count Command 301

Deleting Rows: delete Command 301

Truncating a Table: truncate Command 301

Dropping a Table: drop Command 302

Altering a Table: alter Command 302

HBase Architecture 302

HBase Components 303

Compaction and Splits in HBase 309

Compaction 310

HBase Configuration: An Overview 311

hbase-defaultxml and hbase-site.xml 311

HBase Application Design 312

Tall vs. Wide vs. Narrow Table Design 312

Row Key Design 313

HBase Operations Using Java API 314

HBase Treats Everything as Bytes 314

Create an HBase Table 315

Administrative Functions Using HBaseAdmin 315

Accessing Data Using the Java API 316

HBase MapReduce Integration 320

A MapReduce Job to Read an HBase Table 320

HBase and MapReduce Clusters 323

Scenario I: Frequent MapReduce Jobs Against HBase Tables 323

Scenario II: HBase and MapReduce have Independent SLAs 323

Summary 323

Chapter 15: Data Science with Hadoop 325

Hadoop Data Science Methods 325

Apache Hama 326

Bulk Synchronous Parallel Model 326

Hama Hello World! 327

XV

CONTENTS

Monte Carlo Methods 329

K-Means Clustering 333

Apache Spark 336

Resilient Distributed Datasets (RDDs) 336

Monte Carlo with Spark 337

KMeans with Spark 339

RHadoop 341

Summary 342

Chapter 16: Hadoop in the Cloud 343

Economics 343

Self-Hosted Cluster 343

Cloud-Hosted Cluster 344

Elasticity 344

On Demand 344

Bid Pricing 345

Hybrid Cloud 345

Logistics 345

Ingress/Egress 345

Data Retention 345

Security 346

Cloud Usage Models 346

Cloud Providers 347

Amazon Web Services 347

Google Cloud Platform 349

Microsoft Azure 350

Choosing a Cloud Vendor 350

Case Study: Amazon Web Services 351

Elastic MapReduce 351

Elastic Compute Cloud 354

Summary 356

xvi

CONTENTS

Chapter 17: Building a YARN Application 357

YARN: A General-Purpose Distributed System 357

YARN: A Quick Review 359

Creating a YARN Application 361

POM Configuration 362

DownloadService.java Class 362

Clientjava 365

Steps to Launch the Application Master from the Client 365

ApplicationMaster.java 373

Communication Protocol between Application Master and Resource Manager:

Application Master Protocol 373

Node Manager Communication Protocol: Container Management Protocol 373

Steps to Launch the Worker Tasks 373

Executing the Application Master 378

Launch the Application in Un-Managed Mode 379

Launch the Application in Managed Mode 379

Summary 379

Appendix A: Installing Hadoop 381

Installing Hadoop 2.2.0 on Windows 381

Preparing the Installation Environment 381

Building Hadoop 2.2.0 for Windows 383

Installing Hadoop 2.2.0 for Windows 383

Configuring Hadoop 2.2.0 383

Preparing the Hadoop Cluster 386

Starting HDFS 387

Starting MapReduce (YARN) 387

Verifying that the Cluster Is Running 387

Testing the Cluster 387

Installing Hadoop 2.2.0 on Linux 388

xvii

CONTENTS

Appendix B: Using Maven with Eclipse 391

A Quick Introduction to Maven 391

Creating a Maven Project 391

Using Maven with Eclipse 393

Installing the m2e Maven Eclipse Plug-in 393

Creating a Maven Project from Eclipse 393

Building a Maven Project from Eclipse... 396

Appendix C: Apache Ambari 399

Hadoop Components Supported by Apache Ambari 399

Installing Apache Ambari 401

Trying the Ambari Sandbox on Your OS 401

Index 403

xviii

pro apache hadoop - gbv · contents chapter2: hadoopconcepts 11 introducing hadoop 11...

Documents