hadoop for system administrators

25
Justin Miller Senior Systems Engineer/DevOps at iHealth Technologies Weston Bassler Systems Engineer at Verizon Wireless Hadoop for System Administrators – Ohio Linux Fest 2014

Upload: weston-bassler

Post on 02-Jul-2015

187 views

Category:

Technology


4 download

DESCRIPTION

Hadoop for System Administrators by: Justin Miller & Weston Bassler

TRANSCRIPT

Page 1: Hadoop for System Administrators

Justin MillerSenior Systems Engineer/DevOps at iHealth Technologies

Weston BasslerSystems Engineer at Verizon Wireless

Hadoop for System Administrators – Ohio Linux Fest 2014Hadoop for System Administrators – Ohio Linux Fest 2014

Page 2: Hadoop for System Administrators

What we will be covering:

IntroWhy Hadoop?How Hadoop Works

ArchitecturePlanning Hardware/Storage/NetworkProcessing and Storage HDFS ComponentsYARN Components

OperationsJob schedulingJobs alerts

MonitoringCore ServicesJob scheduler and SLAHardware

Hadoop for System Administrators – Ohio Linux Fest 2014

High AvailabilityYARNHDFSOozie

SecuritySecurity IssuesAuthenticationAuthorizationEncrption

Backup and RecoveryWhat to plan for?How to combat

Hadoop Vendors/DistrosClouderaHortonWorksMapR

Page 3: Hadoop for System Administrators

Why Hadoop?

Hadoop for System Administrators – Ohio Linux Fest 2014

Page 4: Hadoop for System Administrators

Why Hadoop? Cont...

Sort through TB, even PB worth of data in a matter of minutes

Easily sift through LOGS (patterns, data mining) → switch logs, application logs

Batch Processing

History → Inspired by 2 Google Papers on MapReduce and GoogleFS

Implemented By Yahoo!

Hadoop for System Administrators – Ohio Linux Fest 2014

Page 5: Hadoop for System Administrators

Whose using it?

Hadoop for System Administrators – Ohio Linux Fest 2014

Page 6: Hadoop for System Administrators

How Hadoop?

Processing

• MapReduce (MRv1)What is MapReduce?Nobody likes it

• YARN (MRv2)Yet Another Resource NegotiatorNewer better/versatile2 New Roles → Resource Manager and Application ManagerSpark → New Hotness

• Bringing Processing and Storage togetherData locality → avoid network!“MO NODES MO BETTA”

Hadoop for System Administrators – Ohio Linux Fest 2014

Page 7: Hadoop for System Administrators

YARN in Action

Hadoop for System Administrators – Ohio Linux Fest 2014

Page 8: Hadoop for System Administrators

Storage

• HDFS What is HDFS?Why HDFS?

• Components of HDFSNameNode

Metadata → fsimage + fsedits ZooKeeper → HA management

Quorum based journaling3 JournalNodesActive/Passive NameNode

DataNodes – what do they do?Blocks in relation to NameNode MetadataBlock storage

Hadoop for System Administrators – Ohio Linux Fest 2014

Page 9: Hadoop for System Administrators

HDFS Write Path

Hadoop for System Administrators – Ohio Linux Fest 2014

Page 10: Hadoop for System Administrators

Benefits and Limitations of HDFS

BenefitsLow cost per byte → commodity storage High Bandwidth/Scales effectively → “Mo nodes Mo speed”Rock solid data reliabilitySupports distributed computing I/O patternsOPEN SOURCE!!!!!

Hadoop for System Administrators – Ohio Linux Fest 2014

Page 11: Hadoop for System Administrators

Benefits and Limitations of HDFS (Continued...)

LimitationsUpdates → data is immutable (can't be updated only appended)Write OnceOptimized for sequential reads → not for real-time data processingChallenging import/export → requires additional tooling

Hadoop for System Administrators – Ohio Linux Fest 2014

Page 12: Hadoop for System Administrators

Architectur e

• Planning your Hardware/StorageCheap disks Distributed disk approach → replication factor of 3 for HANO LVM and NO Raid and NO swap noatime, nodiratime

• Network considerationsRack awareness affects data distributionPrefer a faster network when available → 10GB if possible

Hadoop for System Administrators – Ohio Linux Fest 2014

Page 13: Hadoop for System Administrators

Hadoop Operations

• JobsWhat is a job?Scheduling jobs with OozieAlerts on JobsOozie SLAs → Start time, end time & durationFile driven Job Configuration

Hadoop for System Administrators – Ohio Linux Fest 2014

Page 14: Hadoop for System Administrators

Example of a Job:

Example of a coordinator:

Hadoop for System Administrators – Ohio Linux Fest 2014

Page 15: Hadoop for System Administrators

Troubleshooting

• Application → Debug Code

Hadoop for System Administrators – Ohio Linux Fest 2014

Page 16: Hadoop for System Administrators

• Job → Debug Execution

Hadoop for System Administrators – Ohio Linux Fest 2014

Page 17: Hadoop for System Administrators

• Service → Debug Linux Process (/var/log/hadoop-*)

Services wont start → port conflicts (nmap, netstat, lsof)

if not application OR job;do

cat /var/log/hadoop-* | grep ERRORdone

Hadoop for System Administrators – Ohio Linux Fest 2014

Page 18: Hadoop for System Administrators

Monitoring

• Core Services HDFSYARNJMX → JVM Monitoring Cloudera Manager

• PerformanceGanglia (HortonWorks)Cloudera Manager

• Hardware → to each his own (traditional monitoring)SNMPNagiosZenossCloudera Manager

Hadoop for System Administrators – Ohio Linux Fest 2014

Page 19: Hadoop for System Administrators

High Availability

• HDFSZooKeeper → quorum based journaling

• YARNZooKeeper

Hadoop for System Administrators – Ohio Linux Fest 2014

Page 20: Hadoop for System Administrators

• Oozie HA

Hadoop for System Administrators – Ohio Linux Fest 2014

Page 21: Hadoop for System Administrators

Security (Because people are evil)

Hadoop for System Administrators – Ohio Linux Fest 2014

Page 22: Hadoop for System Administrators

Security Continued....

• Known issues – Stupid/Lazy People Hadoop can be very secure

• Authentication - Kerberos Principal (user) Realm (group of principals)Keytab file

• AuthorizationLDAPActive DirectoryRole based

• Encryption – For your eyes Only!Kerberos 1st

SSL Certificates**** SSL must be enabled for all core Hadoop services

Hadoop for System Administrators – Ohio Linux Fest 2014

Page 23: Hadoop for System Administrators

Backup and Recovery – When things go wrong (And they will)

What can go wrong? What to plan for?Data CorruptionNode crashesDisk crashes

Ways to combat when things do go wrong

• Data Corruption checksums of metadata fail → NameNode replaces with freshHDFS → hdfs fsck tool

• Node crashes/Disk crashesHDFS saves the day!NameNode HAFirst 2 replicas of data on different hostsHeartbeat detection

Hadoop for System Administrators – Ohio Linux Fest 2014

Page 24: Hadoop for System Administrators

Hadoop Wars - Vendors and Distributions

• ClouderaSpecializes in Enterprise toolsAuditingAccess ControlCluster Management (Cloudera Manager)

• HortonWorksSpecializes in EngineeringAlso Open SourceTop new cool things

• MapRLead developers begin Mahout

Hadoop for System Administrators – Ohio Linux Fest 2014

Page 25: Hadoop for System Administrators

Hopefully you enjoyed!

Slide Share Link: http://www.slideshare.net/mageru/hadoop-for-sysadmin

If interested:

Quick Ways to get started Learning Hadoop

• Free Stuff – Who doesn't like free?Big Data University – Hadoop fundamentals, Pig, Oozie, lots moreUdactity – Intro to Hadoop and MapreduceMapR, Cloudera, HortonWorks – Training Videos

Hadoop for System Administrators – Ohio Linux Fest 2014