big data and hadoop in cloud - leveraging amazon emr

1. Big Data and Hadoop in Cloud Vijay Rayapati @amnigos1

2. Follow Barcamp Rules! 3. What is Big Data?Datasets that grow so large that they becomeawkward to work with using on-hand databasemanagement tools. Difficulties includecapture, storage, search, sharing, analytics,and visualizing - WikipediaHigh volume of data (storage) + speed of data(scale) + variety of data (diff types) - Gartner 4. World is ON = Content + Interactions = More Data(Social and Mobile) 5. Tons of data is generated by each one of us! (We moved from GB to ZB and from Millions to Zillions) 6. Big Data - Intelligence 7. Big Data - Usefulness 8. Big Data - There is so much more you can do! 9. Everybody has this problem Not just Amazon, Google,Facebook and Twitter! 10. How can we work with Big Data? 11. Why Cloud and Big Data?Cloud has democratized access to largescale infrastructure for masses!You can store, process and manage bigdata sets without worrying about IT!**http://wiki.apache.org/hadoop/PoweredBy 12. Hadoop The data elephant 13. Hadoop makes it easier tostore, process and analyze lot of data on commodity hardware! 14. Who uses Hadoop and How?Everybody (from A to Z ) toSolve complex problems **http://wiki.apache.org/hadoop/PoweredBy 15. Big Data and Hadoop - Its Fun 16. Task Tracker Task Tracker Task TrackerMap Reduce(processing) Job Tracker Name NodeHDFS Layer (storage) Data Node Data NodeData Node Master Node 17. Map Reduce Paradigm 18. Map Reduce - Explained 19. Hadoop Getting Started Download latest stable version - http://hadoop.apache.org/common/releases.html Install Java ( > 1.6.0_20 ) and set your JAVA_HOME Install rsync and ssh Follow instructions - http://hadoop.apache.org/common/docs/r0.20.2/quickstart.html Hadoop Modes Local, Pseudo-distributed and Fully distributed Run in pseudo-distributed mode for your testing and development Assign a decent jvm heapsize through mapred.child.java.opts if younotice task errors or GC overhead or OOM Play with samples WordCount, TeraSort etc Good for learning - http://www.cloudera.com/hadoop-training-virtual-machine 20. Why Amazon EMR?I am interested in using Hadoopto solve problems and not inbuilding and managing HadoopInfrastructure! 21. Amazon EMR Setup Install Ruby 1.8.X and use EMR Ruby CLI for managing EMR. Just create credentials.json file in your EMR Ruby CLI installationdirectory and provide your accesskey & private key. Bootstrapping is a great way to install required components orperform custom actions in your EMR cluster. Default bootstrap action is available to control the configuration ofHadoop and MapReduce. Bootstrap with Ganglia during your development and tuning phase provides monitoring metrics across your cluster. Minor bugs in EMR Ruby CLI but pretty cool for your needs. 22. Amazon EMR Setup Launching a 500 node and fully configured cluster is as simpleas firing one command > elastic-mapreduce --create --alive --plain-output --master-instance-type m1.xlarge --slave-instance-type m2.2xlarge --num-instances 500 --name "Site Analytics Cluster" --bootstrap-action s3://com.bcb11.emr/scripts/bootstrap-custom.sh --bootstrap-action s3://elasticmapreduce/bootstrap-actions/install-ganglia - -bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure- hadoop --args "--mapred-config-file, s3://com.bcb11.emr/conf/custom- mapred-site.xml" > elastic-mapreduce -j ${jobflow} --stream --step-name Profile Analyzer" -- jobconf mapred.task.timeout=0 --mapper s3://com.bcb11.emr/code/mapper.rb --reducer s3://com.bcb11.emr/bin/reducer.rb --cache s3://com.bcb11.emr/cache/customdata.dat#data.txt --input s3://com.bcb11.emr/input/ --output s3://com.bcb11.emr/output 23. Amazon EMR - Service Architecture 24. EMR CLI What you need to know? elastic-mapreduce -j --describe elastic-mapreduce --list --active elastic-mapreduce -j --terminate elastic-mapreduce --jobflow --ssh Look into your logs directory in the S3 if you need any otherinformation on cluster setup, hadoop logs, Job step logs, Taskattempt logs etc. 25. EMR Map Reduce Jobs Amazon EMR supports streaming, custom jar, cascading, pigand hive. So you can write jobs in a you want without worryingabout managing the underlying infrastructure including hadoop. Streaming Write Map Reduce jobs in any scripting language. Custom Jar Write using Java and good for speed/control. Cascading, Hive and Pig Higher level of abstraction. Use a good S3 explorer, FoxyProxy and ElasticFox. Leverage aws emr forum if you need help. 26. EMR Debugging and Performance Tuning 27. Hadoop Debugging and Profiling Run hadoop in local mode for debugging so mapper and reducertasks run in a single JVM instead of separate JVMs. Configure Hadoop_Opts to enable debugging. (export HADOOP_OPTS="-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=8008) Configure fs.default.name value in core-site.xml to file:/// from hdfs:// Configure mapred.job.tracker value in mapred-site.xml to local Create debug configuration for Eclipse and set the port to 8008. Run your hadoop job and launch Eclipse with your Java code so youcan start debugging. Use your favorite profiler to understand code level hotspots. 28. EMR Good, Bad and Ugly Great for bootstrapping large clusters and very cost-effective ifyou need once in a while infrastructure to run your Hadoop jobs. Dont need to worry about underlying Hadoop cluster setup andmanagement. Most patches are applied and Amazon creates newAMIs with improvements. Doesnt have a fall back (secondary name node) only onemaster node. Intermittent Network Issues Sometimes could cause seriousdegradation of performance. Network IO is variable and streaming jobs will be much sluggishon EMR compared to dedicated setup. Disk IO is terrible across instance families and types Please fixit. 29. Hadoop High Level TuningSmall files problem avoid tooTune your settings JVMmany small files and tune your Reuse, Sort Buffer, Sort Factor,block size.Map/Reduce Tasks, Parallel Copies, MapRed Output Compression etc Good thing is that you canKnow what is limiting you at a use small cluster and samplenode level CPU, Memory, input size for tuningDISK IO or Network IN/OUT 30. Hadoop What effects your jobs performance? GC Overhead - memory and reduce the jvm reuse tasks. Increase dfs block size (default 128MB in EMR) for large files. Avoid read contention at S3 have equal or more files in S3compared to available mappers. Use mapred output compression to save storage, processingtime and bandwidth costs. Set mapred task timeout to 0 if you have long running jobs (> 10mins) and can disable speculative execution time. Increase sort buffer and sort factor based on map tasks output. 31. Understand EMR Cluster Metrics 32. Understand EMR Cluster Metrics 33. Common Bottlenecks Monitor Matters 34. Hadoop and EMR What I have learned? Code is god If you have severe performance issues then look atyour code 100 times, understand third party libraries used andrewrite in Java if required. Streaming jobs are slow compared to Custom Jar jobs Overhead and scripting is good for adhoc-analysis. Disk IO and Network IO effects your processing time. Be ready to face variable performance in Cloud. Monitor everything once in a while and keep benchmarking withdata points. Default settings are seldom optimal in EMR unless you runsimple jobs. Focus on optimization as its the only way to save Cost and Time. 35. Hadoop and EMR Performance Tuning Example Streaming : Map reduce jobs were written using Ruby. Inputdataset was 150 GB and output was around 4000 GB. Complexprocessing, highly CPU bound and Disk IO. Time taken to complete job processing : 4000 m1.xlarge nodesand 180 minutes. Rewrote the code in Java job processing time was reduced to70 minutes on just 400 m1.xlarge nodes. Tuning EMR configuration has further reduced it to 32 minutes. Focus on code first and then focus on configuration. 36. Q&A 37. Like what we do? connect with meKuliza.com | vijay.rayapati@kuliza.com | @kuliza vijay.rayapati@kuliza.com @amnigos

big data and hadoop in cloud - leveraging amazon emr

emr cluster

emr cli

emr good

hadoop logs

emr map

hadoop job

lot of data

data elephant

Technology

leveraging docker for hadoop build automation and big...

leveraging hadoop in polyglot architectures

discography - amazon s3 · take five n° emr brass band emr...

federated sql on hadoop and beyond: leveraging mqtt •...

hadoop - strategy and technology - sas can treat hadoop just...

atn no.1 hadoop vs amazon emr

emr 9107 zodiac - bb parts on landscape · 2015. 5. 28. ·...

hadoop 3.x more...

discography · 2016-07-08 · big band emr 13833 emr 13086...

discography - alle-noten.de · collection timofei...

az adatok hatalma - bi...

discography - amazon s3 · 2020. 6. 9. · brass band emr...

for azure marketplace - trifacta documentation · support...

hadoop 2.x on a cluster environment - roma tre...

leveraging hadoop to mine customer insights in a developing...

ebscohost emr integration solutions · 2011-02-23 ·...

discography · take five n° emr brass band emr 3619 emr...

leveraging ambari to build comprehensive management · pdf...

big data analytics using hadoop cluster on amazon emr

leveraging hadoop with obiee 11g and odi 11g - ukoug tech'13