optimizing hadoop map reduce job performance

9
Optimizing Hadoop Map Reduce job performance Constantin Ciureanu Software Architect ([email protected]) Project “Hadoop Data Transformation”

Upload: constantin-ciureanu

Post on 29-Oct-2014

230 views

Category:

Software


4 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Optimizing Hadoop Map Reduce job performance

Optimizing Hadoop Map Reduce job performance

Constantin CiureanuSoftware Architect ([email protected])Project “Hadoop Data Transformation”

Page 2: Optimizing Hadoop Map Reduce job performance

Generic Goals - 1

• Correctly split the input data for maximizing mappers performance (directly affects the number of mappers by simply calculating the split size - in order to make the mapper tasks run a reasonable amount of time)

• Correctly calculate the number of reducers

• Skewed data distribution is a huge problem causing unbalanced load onto some Mapper or Reducer(s) – see for example a complex Tetris brick (L shaped or cross)

• Try to spill just once (read/process/write to disk)

Page 3: Optimizing Hadoop Map Reduce job performance

Generic Goals - 2

• As a thumb rule, no task should take more than 5 minutes to complete

• Of course that is a hard to get goal, but at least any task should be able to report its status before the 10mins passed so that it wouldn't get killed and rerun thus wasting important resources (in this case 10 minutes of processing time)

• Aim to have map tasks running 1-3 minutes

• < 1 minute - wasted resources to start tasks

• > 3 minutes - not enough parallelism, hard to share fleet resources

• Bottleneck – by default no reducer can start until all mappers are finished (fix = Use slow start for reducers after % percentage of mappers finished)

Page 4: Optimizing Hadoop Map Reduce job performance

Performance tuning

• At least try to perfectly tune and optimize the code that gets executed for each row (eg. in mappers / reducers) because that specific code is potentially run millions of times

• In comparison with adding more machines (that instantly help to increase the Hadoop fleet power), adding more developers / optimize code for performance helps too but a bit less :)

• Use enough memory for each task (eg. mapred.reduce.child.java.opts = -Xmx500m, while default value is 200m, not enough for large data to process) – this parameter can be calculated on-the-fly according to the input data size

Page 5: Optimizing Hadoop Map Reduce job performance

Data related optimizations - 1

• Reduce the quantity of data to be read / write from and to disk!

• For example shorter rows would help, do not use long strings repeatedly, use some numeric constants / types instead!

• Use compression for intermediary output (mapred.compress.map.output = true)

• Advantage: reduced data to write on disk / transfer over network

• Disadvantage: Higher CPU load

• Use compression for output files (GZip / Deflate etc.)

• Adding more machines always help! Adding one machine means new available processing power but also new local storage!

Page 6: Optimizing Hadoop Map Reduce job performance

Data related optimizations

• Definitely get rid of the NAS and use local dedicated Hard drives!

• Use combiners (between mappers and reducers to minimize the data that flows over network) - there's no such thing in Cascading framework - but at least some part of the Aggregator block might get executed distributed on mapper's phase

• Whenever is possible use in-memory data structures for faster joins or lookups

Page 7: Optimizing Hadoop Map Reduce job performance

Monitoring resources• Idle resources is the biggest problem!

• Use some monitoring software, even simple utilities help a lot:– Cloudera manager

– Ganglia

– top

– iostat -dmx 5

– …

• Monitor all resources (CPU, memory, disk I/O, network bandwidth and latency)

• Performance tuning cycle (run / analyze / benchmark& identify bottleneck (usually another one than previous iteration bottleneck :) )& then tune to improve resource usage (eg. insufficient CPU usage) ... run again). Also test for scalability

• Use some sort of profiler tool (easiest case -Xprof on mapred.child.java.opts then check the stdout logs)

Page 8: Optimizing Hadoop Map Reduce job performance

Conclusions & final words

•With appropriate tuning some MR job might finish even 2 times faster!

•Optimize for the future data too - plan sizing of all resources (storage, CPU needs, memory, network bandwidth) – because Big Data is inherently in a continuous growth (in both accumulative and income daily data size)

•Move gradually toward YARN (Hadoop 2.0)