hadoop for sys_admin
DESCRIPTION
A presentation for OhioLinuxFest for Hadoop for System AdministratorsTRANSCRIPT
Justin MillerSenior Systems Engineer/DevOps at iHealth Technologies
Weston BasslerSystems Engineer at Verizon Wireless
Hadoop for System Administrators – Ohio Linux Fest 2014Hadoop for System Administrators – Ohio Linux Fest 2014
What we will be covering:
IntroWhy Hadoop?How Hadoop Works
ArchitecturePlanning Hardware/Storage/NetworkProcessing and Storage HDFS ComponentsYARN Components
OperationsJob schedulingJobs alerts
MonitoringCore ServicesJob scheduler and SLAHardware
Hadoop for System Administrators – Ohio Linux Fest 2014
High AvailabilityYARNHDFSOozie
SecuritySecurity IssuesAuthenticationAuthorizationEncrption
Backup and RecoveryWhat to plan for?How to combat
Hadoop Vendors/DistrosClouderaHortonWorksMapR
Why Hadoop?
Hadoop for System Administrators – Ohio Linux Fest 2014
Why Hadoop? Cont...
Sort through TB, even PB worth of data in a matter of minutes
Easily sift through LOGS (patterns, data mining) → switch logs, application logs
Batch Processing
History → Inspired by 2 Google Papers on MapReduce and GoogleFS
Implemented By Yahoo!
Hadoop for System Administrators – Ohio Linux Fest 2014
Whose using it?
Hadoop for System Administrators – Ohio Linux Fest 2014
How Hadoop?
Processing
• MapReduce (MRv1)What is MapReduce?Nobody likes it
• YARN (MRv2)Yet Another Resource NegotiatorNewer better/versatile2 New Roles → Resource Manager and Application ManagerSpark → New Hotness
• Bringing Processing and Storage togetherData locality → avoid network!“MO NODES MO BETTA”
Hadoop for System Administrators – Ohio Linux Fest 2014
YARN in Action
Hadoop for System Administrators – Ohio Linux Fest 2014
Storage
• HDFS What is HDFS?Why HDFS?
• Components of HDFSNameNode
Metadata → fsimage + fsedits ZooKeeper → HA management
Quorum based journaling3 JournalNodesActive/Passive NameNode
DataNodes – what do they do?Blocks in relation to NameNode MetadataBlock storage
Hadoop for System Administrators – Ohio Linux Fest 2014
HDFS Write Path
Hadoop for System Administrators – Ohio Linux Fest 2014
Benefits and Limitations of HDFS
BenefitsLow cost per byte → commodity storage High Bandwidth/Scales effectively → “Mo nodes Mo speed”Rock solid data reliabilitySupports distributed computing I/O patternsOPEN SOURCE!!!!!
Hadoop for System Administrators – Ohio Linux Fest 2014
Benefits and Limitations of HDFS (Continued...)
LimitationsUpdates → data is immutable (can't be updated only appended)Write OnceOptimized for sequential reads → not for real-time data processingChallenging import/export → requires additional tooling
Hadoop for System Administrators – Ohio Linux Fest 2014
Architectur e
• Planning your Hardware/StorageCheap disks Distributed disk approach → replication factor of 3 for HANO LVM and NO Raid and NO swap noatime, nodiratime
• Network considerationsRack awareness affects data distributionPrefer a faster network when available → 10GB if possible
Hadoop for System Administrators – Ohio Linux Fest 2014
Hadoop Operations
• JobsWhat is a job?Scheduling jobs with OozieAlerts on JobsOozie SLAs → Start time, end time & durationFile driven Job Configuration
Hadoop for System Administrators – Ohio Linux Fest 2014
Example of a Job:
Example of a coordinator:
Hadoop for System Administrators – Ohio Linux Fest 2014
Troubleshooting
• Application → Debug Code
Hadoop for System Administrators – Ohio Linux Fest 2014
• Job → Debug Execution
Hadoop for System Administrators – Ohio Linux Fest 2014
• Service → Debug Linux Process (/var/log/hadoop-*)
Services wont start → port conflicts (nmap, netstat, lsof)
if not application OR job;do
cat /var/log/hadoop-* | grep ERRORdone
Hadoop for System Administrators – Ohio Linux Fest 2014
Monitoring
• Core Services HDFSYARNJMX → JVM Monitoring Cloudera Manager
• PerformanceGanglia (HortonWorks)Cloudera Manager
• Hardware → to each his own (traditional monitoring)SNMPNagiosZenossCloudera Manager
Hadoop for System Administrators – Ohio Linux Fest 2014
High Availability
• HDFSZooKeeper → quorum based journaling
• YARNZooKeeper
Hadoop for System Administrators – Ohio Linux Fest 2014
• Oozie HA
Hadoop for System Administrators – Ohio Linux Fest 2014
Security (Because people are evil)
Hadoop for System Administrators – Ohio Linux Fest 2014
Security Continued....
• Known issues – Stupid/Lazy People Hadoop can be very secure
• Authentication - Kerberos Principal (user) Realm (group of principals)Keytab file
• AuthorizationLDAPActive DirectoryRole based
• Encryption – For your eyes Only!Kerberos 1st
SSL Certificates**** SSL must be enabled for all core Hadoop services
Hadoop for System Administrators – Ohio Linux Fest 2014
Backup and Recovery – When things go wrong (And they will)
What can go wrong? What to plan for?Data CorruptionNode crashesDisk crashes
Ways to combat when things do go wrong
• Data Corruption checksums of metadata fail → NameNode replaces with freshHDFS → hdfs fsck tool
• Node crashes/Disk crashesHDFS saves the day!NameNode HAFirst 2 replicas of data on different hostsHeartbeat detection
Hadoop for System Administrators – Ohio Linux Fest 2014
Hadoop Wars - Vendors and Distributions
• ClouderaSpecializes in Enterprise toolsAuditingAccess ControlCluster Management (Cloudera Manager)
• HortonWorksSpecializes in EngineeringAlso Open SourceTop new cool things
• MapRLead developers begin Mahout
Hadoop for System Administrators – Ohio Linux Fest 2014
Hopefully you enjoyed!
If interested:
Quick Ways to get started Learning Hadoop
• Free Stuff – Who doesn't like free?Big Data University – Hadoop fundamentals, Pig, Oozie, lots moreUdactity – Intro to Hadoop and MapreduceMapR, Cloudera, HortonWorks – Training Videos
Hadoop for System Administrators – Ohio Linux Fest 2014