hadoop tuning
TRANSCRIPT
Nokia Internal Use Only
Nokia Institute of Technology
Your natural partner to develop innovative solutions
Nokia Internal Use Only
AgendaAgendaAgendaAgenda
• MapReduce Summarization Patterns• MapReduce Coding Best Practices• Ctrending MR Performance Evaluation• Ctrending MR Execution Summary• Code Profiling• Profiling Results
• Code Tuning• Hbase Configuration Tuning• Tuning Results• Refactoring Proposal
Nokia Internal Use Only
MapReduce Summarization MapReduce Summarization PatternsPatternsMapReduce Summarization MapReduce Summarization PatternsPatterns
• Numerical Summarizations• General counting of data set records• Groups records by a custom key, calculating numerical values
per group• Known Uses• Word count, record count, min/max count,
avg/median/standard deviation
Nokia Internal Use Only
MapReduce Summarization MapReduce Summarization PatternsPatternsMapReduce Summarization MapReduce Summarization PatternsPatterns
• Inverted Index• Indexes large data set into keywords• Mapper emits keywords/ids values and the framework handles
most of the work• May use IdentityReducer• Should benefit from Partitioner for load balance
Nokia Internal Use Only
MapReduce Summarization MapReduce Summarization PatternsPatternsMapReduce Summarization MapReduce Summarization PatternsPatterns
• Counting with Counters• Leverages MapReduce framework’s counters.• Counters are all stored in-memory locally on each Mapper, then
aggregated by the framework.• Better performance, however may not exceed tens of counters
definition.• Known Uses
• Count number of records, count small number of groups, summations
Nokia Internal Use Only
MapReduce Coding Best MapReduce Coding Best PracticesPracticesMapReduce Coding Best MapReduce Coding Best PracticesPractices• Define Output Values
• Create custom Writable extending classes to be used as output from Mappers;
• Provides cleaner Mapper code and avoids String parsing on Reducer code side;
• Avoid Local Object Creation• Map and Reduce methods are invoked on very large loops;• Creating local objects inside map or reduce leads to huge
number of objects being attached to Eden space of Young Generation JVM’s Heap;
• Reuse Global instances to decrease Young GC Activity;• Use Combiners on Counting Summarizations
• Combiners reduce bandwidth consuption, as it applies aggregations locally to mappers node, before mapper output is sent to shuffle and sort phase, then made available for reducers
Nokia Internal Use Only
Ctrending MR Performance Ctrending MR Performance EvaluationEvaluationCtrending MR Performance Ctrending MR Performance EvaluationEvaluation
• Ctrending MR Execution Summary• Total MR Jobs Running: 8• Avg of processed tweets: 2.2 Million• Tweets identified as Music related: 10.5%• Total Execution Time: 2 hours and 20 minutes• Slowest MapReduces:• Tweets Counter: 46 minutes• Nokia Entity Id Join: 1 hour and 10 minutes
Nokia Internal Use Only
Ctrending MR Code ProfilingCtrending MR Code ProfilingCtrending MR Code ProfilingCtrending MR Code Profiling• Mainly applied to Nokia Id Join Mapper• Added usage of MapReduce framework’s Counters to collect
execution time metrics• Also used Counters to sum total of entities id being found in
Nokia Id Join mapper• Needed to create Static fields in search strategy implementations
to collect execution time metric
Nokia Internal Use Only
Ctrending MR Profiling ResultsCtrending MR Profiling ResultsCtrending MR Profiling ResultsCtrending MR Profiling ResultsTOTAL_ARTISTS_NMS_FOUND 77
TOTAL_ARTISTS_NOT_IN_CACHE 1,904
TOTAL_CANDIDATES_FORMATTING_TIME 67,873
TOTAL_HBASE_GET_TIME 262,647
TOTAL_NORMALIZATION_TIME 22,452
TOTAL_SEARCH_ARTIST_TIME 611,066
TOTAL_SEARCH_CALCULATION_TIME 5,605
TOTAL_SEARCH_NMS_TIME 3,740,552
TOTAL_SEARCH_TIME 4,098,270
TOTAL_SEARCH_TRACK_TIME 3,486,978
TOTAL_TRACKS_NMS_FOUND 145
TOTAL_TRACKS_NOT_IN_CACHE 4,635
Nokia Internal Use Only
Ctrending MR Code TuningCtrending MR Code TuningCtrending MR Code TuningCtrending MR Code Tuning• Tuning Tweets Count MapReduce• Applied IntSumReducer as combiner.• Ajusted Hbase Scan to fetch and copy records on blocks of
thousands, in order to optimize network usage between nodes.• Also set blockCache to false, as this table will always be read
sequentially at once.
Nokia Internal Use Only
Ctrending MR Code TuningCtrending MR Code TuningCtrending MR Code TuningCtrending MR Code Tuning• Tuning Entity Id Search MapReduce
• Removed unnecessary split/indexof calls• Removed redundant object creation from map method
Nokia Internal Use Only
Ctrending MR Code TuningCtrending MR Code TuningCtrending MR Code TuningCtrending MR Code Tuning• Tuning Entity Id Search MapReduce• Profiling results shows that NMS Search is the bottleneck• It costs more than 90% of all MapReduce execution time
• It also shows that NMS Search is not adding enough value• It founds only 4% of Artists Ids not in cache• It founds only 3% of Tracks Ids not in cache
• This drove the decision to remove NMS search by simply referencing CustomCache ISearchStrategy implementation on Mapper setup method
Nokia Internal Use Only
Hbase Configuration TuningHbase Configuration TuningHbase Configuration TuningHbase Configuration Tuning
• Artists and Tracks Cache is an inverted indexes structure stored on Hbase tables.
• These tables present high level of random access to it’s records (Get operations), while Entity Id Search MapReduce performs searches on the cache.
• This could have performance optimized if Cache table blocks were made available in RegionServer’s memory.
• Hbase provides Table level configuration property that increases blocks priority to be stored on RegionServer’s memory
Nokia Internal Use Only
Hbase Configuration TuningHbase Configuration TuningHbase Configuration TuningHbase Configuration Tuning
• Additional configuration is required on Hbase RegionServer, so that block cache is possible most part of the time.• hbase.regionserver.global.memstore.upperLimit -> defines
maximum % of Heap available for writing in memstores, before put operations are actually written to disk files.
• hbase.regionserver.global.memstore.lowerLimit -> defines minimum % of Heap available for writing in memstores. Flush operations will free memstore until this limit is reached.
• hfile.block.cache.size -> % of Heap to be used to store blocks in-memory
Nokia Internal Use Only
Hbase Configuration TuningHbase Configuration TuningHbase Configuration TuningHbase Configuration Tuning
• Most Ctrending Hbase put operations are done in batch jobs (Twitter Crawler).
• Music entities cache requires many Get operations, while EntityIdSearchMR is executing.
• Simply setting cache tables to be maintained in-memory does not work, if there is not enough memory available.
• More memory can be made available to cache tables blocks on RegionServers by decreasing % of Heap reserved to memstore and increasing it for block cache.
Nokia Internal Use Only
• TweetsCountMR• Total Execution Time Prior Tuning: 46 minutes (average)• Total Execution Time After Tuning: 20 minutes (average)
• EntityIdSearchMR• Total Execution Time Prior Tuning: 1 hour and 10 minutes
(average)• Total Execution Time Adter Tuning: 6 minutes (average)
• CONCLUSION: Do not ever perform HTTP Requests on MapReduces again!!!
Ctrending MR Tuning ResultsCtrending MR Tuning ResultsCtrending MR Tuning ResultsCtrending MR Tuning Results
Nokia Internal Use Only
• Write batch process to read generated rankings and perform requests to NMS for music entities which ID was not found.
• Better implement this as a Java multi-thread standalone process, instead of MapReduce• As input file is small (the filtered rank), Hadoop default
InputFormat implementations will not split it in many Map tasks.
• Unless a custom InputFormat be implemented, develop a MapReduce for this will probably take long time to execute, as it will end up with a single Map task to request NMS for all unknown Ids
• Optimize Heap usage on other MRs by avoiding Object creation on Map methods.
• Enhance code quality (and even performance), by defining OutputValues for Trending MRs
RefactoringRefactoringRefactoringRefactoring
Nokia Internal Use Only
• HBase, The Definitive Guide, Lars George, O'Reilly• MapReduce Design Patterns, Donald Miner, Adam Shook• Hadoop Official WebSite
•http://hadoop.apache.org/
ReferencesReferencesReferencesReferences