compaction and splitting in apache accumulo
DESCRIPTION
TRANSCRIPT
© Hortonworks Inc. 2012
Compaction and Splitting in Apache AccumuloBillie [email protected] 24, 2012
Page 1
© Hortonworks Inc. 2012
What are compaction and splitting?
•Accumulo tables are divided into non-overlapping key ranges called tablets
•Compaction selects a set of sorted files for a single tablet and rewrites them into one file
•Splitting divides a tablet into two tablets
Page 2
© Hortonworks Inc. 2012
Tablet Overview
•When memory fills, new sorted files are created by flushing
•Sorted files are combined together into fewer sorted files
Page 3
© Hortonworks Inc. 2012
How much data are you writing?
•If you never compact – O(N)
•If you always compact – O(N2)
Page 4
…
…
© Hortonworks Inc. 2012
Accumulo Compaction Algorithm
•Compact a set of files when:
Page 5
size of the largest file
compaction ratio
sum of the sizes of files× ≤
table.compaction.major.ratio
© Hortonworks Inc. 2012
In Action (r = 3, N = 1, W = 1)
Page 6
© Hortonworks Inc. 2012
In Action (r = 3, N = 2, W = 2)
Page 7
© Hortonworks Inc. 2012
In Action (r = 3, N = 3, W = 3)
Page 8
© Hortonworks Inc. 2012
In Action (r = 3, N = 3, W = 6)
Page 9
© Hortonworks Inc. 2012
In Action (r = 3, N = 4, W = 7)
Page 10
© Hortonworks Inc. 2012
In Action (r = 3, N = 5, W = 8)
Page 11
© Hortonworks Inc. 2012
In Action (r = 3, N = 6, W = 9)
Page 12
© Hortonworks Inc. 2012
In Action (r = 3, N = 6, W = 12)
Page 13
© Hortonworks Inc. 2012
In Action (r = 3, N = 7, W = 13)
Page 14
© Hortonworks Inc. 2012
In Action (r = 3, N = 8, W = 14)
Page 15
© Hortonworks Inc. 2012
In Action (r = 3, N = 9, W = 15)
Page 16
© Hortonworks Inc. 2012
In Action (r = 3, N = 9, W = 24)
Page 17
© Hortonworks Inc. 2012
In Action (r = 3, N = 27, W = 90*)
Page 18
© Hortonworks Inc. 2012
Amount of data written
•W(rk) = (k+1)rk – (k-1)rk-1
•Thus, W(N) ≈ O(N log N)
Page 19
© Hortonworks Inc. 2012
HBase Compaction Algorithm
•Compact a set of files when:
Page 20
size of the largest file
sum of the sizes of
smaller files≤ compaction
ratio×
hbase.hstore.compaction.ratio
© Hortonworks Inc. 2012
HBase Compaction Algorithm
•Compact a set of files when:
Page 21
size of the largest file
sum of the sizes of
smaller files≤ compaction
ratio×
HBase ratio = Accumulo ratio – 1
1
© Hortonworks Inc. 2012
Other Compaction-related Properties
•Accumulo
•Hbase
Page 22
table.file.maxtserver.compaction.major.thread.files.open.maxtserver.compaction.major.delaytable.compaction.major.everything.idle
hbase.hstore.compactionThresholdhbase.hstore.blockingStoreFileshbase.hstore.blockingWaitTimehbase.hstore.compaction.minhbase.hstore.compaction.maxhbase.hstore.compaction.min.sizehbase.hstore.compaction.max.size
© Hortonworks Inc. 2012
Accumulo Splitting
•Always check to see if a split is needed before compacting
•If it is needed, split first•File names stored in metadata table
Page 23
split threshold
© Hortonworks Inc. 2012
Accumulo Splitting Process
•Tablet closed, no new writes•Three writes to the metadata table–tablet made smaller & marked as splitting–new tablet added–original tablet's splitting marks removed
•Tablet server swaps new tablets for old tablet in its online tablet list
•Master informed
Page 24
© Hortonworks Inc. 2012
Accumulo Splitting Recovery
•Whenever a tablet is brought online, the tablet server checks to see if it has split marks.
•If so, it assumes the splitting process was interrupted and finishes making changes to the metadata table.
Page 25
© Hortonworks Inc. 2012
1
• Simplify deployment to get started quickly and easily
• Monitor, manage any size cluster with familiar console and tools
• Only platform to include data integration services to interact with any data
• Metadata services opens the platform for integration with existing applications
• Dependable high availability architecture
• Tested at scale to future proof your cluster growth
Hortonworks Data Platform
Page 26
Reduce risks and cost of adoption Lower the total cost to administer and provision Integrate with your existing ecosystem
© Hortonworks Inc. 2012
Hortonworks Training
The expert source for Apache Hadoop training & certification
Role-based Developer and Administration training
– Coursework built and maintained by the core Apache Hadoop development team.– The “right” course, with the most extensive and realistic hands-on materials– Provide an immersive experience into real-world Hadoop scenarios– Public and Private courses available
Comprehensive Apache Hadoop Certification
– Become a trusted and valuable Apache Hadoop expert
Page 27
© Hortonworks Inc. 2012
Next Steps?
• Expert role based training• Course for admins, developers
and operators• Certification program• Custom onsite options
Page 28
Download Hortonworks Data Platformhortonworks.com/download
1
2 Use the getting started guidehortonworks.com/get-started
3 Learn more… get support
• Full lifecycle technical support across four service levels
• Delivered by Apache Hadoop Experts/Committers
• Forward-compatible
Hortonworks Support
hortonworks.com/training hortonworks.com/support