webinar: understanding storage for performance and data safety
DESCRIPTION
In this deep dive, we'll look under the hood at how the MongoDB storage engine works to give you greater insight into both performance and data safety. You'll learn about storage layout, indexes, memory mapping, journaling, and fragmentation. This is a session intended for those who already have a basic understanding of MongoDB.TRANSCRIPT
![Page 1: Webinar: Understanding Storage for Performance and Data Safety](https://reader037.vdocuments.us/reader037/viewer/2022110115/5490b381b47959025d8b4572/html5/thumbnails/1.jpg)
Solutions Architect, 10gen
Antoine Girbal
#antoinegirbal
Understanding Storage for performance and data safety
![Page 2: Webinar: Understanding Storage for Performance and Data Safety](https://reader037.vdocuments.us/reader037/viewer/2022110115/5490b381b47959025d8b4572/html5/thumbnails/2.jpg)
Why pop the hood?• Understanding data safety
• Estimating RAM / disk requirements
• Optimizing performance
![Page 3: Webinar: Understanding Storage for Performance and Data Safety](https://reader037.vdocuments.us/reader037/viewer/2022110115/5490b381b47959025d8b4572/html5/thumbnails/3.jpg)
Storage Layout
![Page 4: Webinar: Understanding Storage for Performance and Data Safety](https://reader037.vdocuments.us/reader037/viewer/2022110115/5490b381b47959025d8b4572/html5/thumbnails/4.jpg)
drwxr-xr-x 4 antoine wheel 136 Nov 19 10:12 journal-rw------- 1 antoine wheel 16777216 Oct 25 14:58 test.0-rw------- 1 antoine wheel 134217728 Mar 13 2012 test.1-rw------- 1 antoine wheel 268435456 Mar 13 2012 test.2-rw------- 1 antoine wheel 536870912 May 11 2012 test.3-rw------- 1 antoine wheel 1073741824 May 11 2012 test.4-rw------- 1 antoine wheel 2146435072 Nov 19 10:14 test.5-rw------- 1 antoine wheel 16777216 Nov 19 10:13 test.ns
Directory Layout
![Page 5: Webinar: Understanding Storage for Performance and Data Safety](https://reader037.vdocuments.us/reader037/viewer/2022110115/5490b381b47959025d8b4572/html5/thumbnails/5.jpg)
Directory Layout
• Each database has one or more data files, all in same folder (e.g. test.0, test.1, …)
• Aggressive preallocation (always 1 spare file)
• Those files get larger and larger, up to 2GB
• There is one namespace file per db which can hold 24000 entries per default. A namespace is a collection or an index.
• The journal folder contains the journal files
![Page 6: Webinar: Understanding Storage for Performance and Data Safety](https://reader037.vdocuments.us/reader037/viewer/2022110115/5490b381b47959025d8b4572/html5/thumbnails/6.jpg)
Tuning with options
• Use --directoryperdb to separate dbs into own folders which allows to use different volumes (isolation, performance)
• Use --nopreallocate to prevent preallocation
• Use --smallfiles to keep data files smaller
• If using many databases, use –nopreallocate and --smallfiles to reduce storage size
• If using thousands of collections & indexes, increase namespace capacity with --nssize
![Page 7: Webinar: Understanding Storage for Performance and Data Safety](https://reader037.vdocuments.us/reader037/viewer/2022110115/5490b381b47959025d8b4572/html5/thumbnails/7.jpg)
Internal Structure
![Page 8: Webinar: Understanding Storage for Performance and Data Safety](https://reader037.vdocuments.us/reader037/viewer/2022110115/5490b381b47959025d8b4572/html5/thumbnails/8.jpg)
Internal File Format
• Files on disk are broken into extents which contain the documents
• A collection has 1 to many extents
• Extent grow exponentially up to 2GB
• Namespace entries in the ns file point to the first extent for that collection
![Page 9: Webinar: Understanding Storage for Performance and Data Safety](https://reader037.vdocuments.us/reader037/viewer/2022110115/5490b381b47959025d8b4572/html5/thumbnails/9.jpg)
test.0 test.1 test.2
Internal File Formattest.ns Namespaces
Extents
Data Files
![Page 10: Webinar: Understanding Storage for Performance and Data Safety](https://reader037.vdocuments.us/reader037/viewer/2022110115/5490b381b47959025d8b4572/html5/thumbnails/10.jpg)
Extent Structure
Extentlength
xNext
xPrev
firstRecord
lastRecord
Extentlength
xNext
xPrev
firstRecord
lastRecord
![Page 11: Webinar: Understanding Storage for Performance and Data Safety](https://reader037.vdocuments.us/reader037/viewer/2022110115/5490b381b47959025d8b4572/html5/thumbnails/11.jpg)
Extents and Records
Extentlength
xNext
xPrev
firstRecord
lastRecord
Data Recordlength
rNext
rPrev
Document
{ _id: “foo”, ... }
Data Recordlength
rNext
rPrev
Document
{ _id: “bar”, ... }
![Page 12: Webinar: Understanding Storage for Performance and Data Safety](https://reader037.vdocuments.us/reader037/viewer/2022110115/5490b381b47959025d8b4572/html5/thumbnails/12.jpg)
What about indices?
![Page 13: Webinar: Understanding Storage for Performance and Data Safety](https://reader037.vdocuments.us/reader037/viewer/2022110115/5490b381b47959025d8b4572/html5/thumbnails/13.jpg)
Indexes
• Indexes are BTree structures serialized to disk
• They are stored in the same files as data but using own extents
![Page 14: Webinar: Understanding Storage for Performance and Data Safety](https://reader037.vdocuments.us/reader037/viewer/2022110115/5490b381b47959025d8b4572/html5/thumbnails/14.jpg)
Index Extents
Extentlength
xNext
xPrev
firstRecord
lastRecord
Index Record
Bucketparent
numKeys
K
length
rNext
rPrev K KK
Index Record
Bucketparent
numKeys
K
length
rNext
rPrev K KK
{ Document }
4 9
1 3 5 6 8 A B
![Page 15: Webinar: Understanding Storage for Performance and Data Safety](https://reader037.vdocuments.us/reader037/viewer/2022110115/5490b381b47959025d8b4572/html5/thumbnails/15.jpg)
> db.stats(){
"db" : "test","collections" : 22,"objects" : 17000383, ## number of documents"avgObjSize" : 44.33690276272011,"dataSize" : 753744328, ## size of data"storageSize" : 1159569408, ## size of all
containing extents"numExtents" : 81,"indexes" : 85,"indexSize" : 624204896, ## separate index
storage size"fileSize" : 4176478208, ## size of data files on
disk"nsSizeMB" : 16,"ok" : 1
}
the db stats
![Page 16: Webinar: Understanding Storage for Performance and Data Safety](https://reader037.vdocuments.us/reader037/viewer/2022110115/5490b381b47959025d8b4572/html5/thumbnails/16.jpg)
> db.large.stats(){
"ns" : "test.large","count" : 5000000, ## number of documents"size" : 280000024, ## size of data"avgObjSize" : 56.0000048,"storageSize" : 409206784, ## size of all
containing extents"numExtents" : 18,"nindexes" : 1,"lastExtentSize" : 74846208,"paddingFactor" : 1, ## amount of padding"systemFlags" : 0,"userFlags" : 0,"totalIndexSize" : 162228192, ## separate index
storage size"indexSizes" : {
"_id_" : 162228192},"ok" : 1
}
the collection stats
![Page 17: Webinar: Understanding Storage for Performance and Data Safety](https://reader037.vdocuments.us/reader037/viewer/2022110115/5490b381b47959025d8b4572/html5/thumbnails/17.jpg)
What’s memory mapping?
![Page 18: Webinar: Understanding Storage for Performance and Data Safety](https://reader037.vdocuments.us/reader037/viewer/2022110115/5490b381b47959025d8b4572/html5/thumbnails/18.jpg)
Memory Mapped Files
• All data files are memory mapped to RAM by the OS
• Mongo just reads / writes to RAM in the filesystem cache
• OS takes care of the rest!
• Mongo calls fsync every 60 seconds to flush changes to disk
• Virtual process size = total files size + overhead (connections, heap)
• If journal is on, the virtual size will be roughly doubled
![Page 19: Webinar: Understanding Storage for Performance and Data Safety](https://reader037.vdocuments.us/reader037/viewer/2022110115/5490b381b47959025d8b4572/html5/thumbnails/19.jpg)
Virtual Address Space
32-bit System
232 = 4GB
- 1GB kernel
- .5GB binaries, stack, etc.
= 2.5GB for data
BAD
64-bit System
264 = 1.7 x 1010 GB (16EB)?
0xF0 – 0xFF Kernel
0x00 – 0x7F User
247 = 128TB for data
GOOD
![Page 20: Webinar: Understanding Storage for Performance and Data Safety](https://reader037.vdocuments.us/reader037/viewer/2022110115/5490b381b47959025d8b4572/html5/thumbnails/20.jpg)
Virtual Address Space
Kernel
STACK…
LIBS
…
test.ns
test.0
test.1
…
…HEAP
MONGOD
NULL
0x7fffffffffff
0x0
{ … }
Disk
DocumentProcess Virtual
Memory
![Page 21: Webinar: Understanding Storage for Performance and Data Safety](https://reader037.vdocuments.us/reader037/viewer/2022110115/5490b381b47959025d8b4572/html5/thumbnails/21.jpg)
Memory map, love it or hate it• Pros:
– No complex memory / disk code in MongoDB, huge win!
– The OS is very good at caching for any type of storage
– Pure Least Recently Used behavior– Cache stays warm between Mongo restart
• Cons:– RAM is affected by disk fragmentation– RAM is affected by high read-ahead– LRU behavior does not prioritize things (like
indices)
![Page 22: Webinar: Understanding Storage for Performance and Data Safety](https://reader037.vdocuments.us/reader037/viewer/2022110115/5490b381b47959025d8b4572/html5/thumbnails/22.jpg)
How much data is in RAM?• Resident memory the best indicator of
how much data in RAM
• Resident is: process overhead (connections, heap) + FS pages in RAM that were accessed
• Means that it resets to 0 upon restart even though data is still in RAM due to FS cache
• Use free command to check on FS cache size
• Can be affected by fragmentation and read-ahead
![Page 23: Webinar: Understanding Storage for Performance and Data Safety](https://reader037.vdocuments.us/reader037/viewer/2022110115/5490b381b47959025d8b4572/html5/thumbnails/23.jpg)
Journaling
![Page 24: Webinar: Understanding Storage for Performance and Data Safety](https://reader037.vdocuments.us/reader037/viewer/2022110115/5490b381b47959025d8b4572/html5/thumbnails/24.jpg)
The problem• A single insert/update involves writing to many
places (the record, indexes, ns details..)
• What if the electricity goes out? Corruption…
![Page 25: Webinar: Understanding Storage for Performance and Data Safety](https://reader037.vdocuments.us/reader037/viewer/2022110115/5490b381b47959025d8b4572/html5/thumbnails/25.jpg)
Solution – use a journal
• Data gets written to a journal before making it to the data files
• Operations written to a journal buffer in RAM that gets flushed every 100ms or 100MB
• Once journal written to disk, data safe unless hardware entirely fails
• Journal prevents corruption and allows durability
• Can be turned off, but don’t!
![Page 26: Webinar: Understanding Storage for Performance and Data Safety](https://reader037.vdocuments.us/reader037/viewer/2022110115/5490b381b47959025d8b4572/html5/thumbnails/26.jpg)
• Section contains single group commit
• Applied all-or-nothing
Journal FormatJHeader
JSectHeader [LSN 3]
DurOp
DurOp
DurOp
JSectFooter
JSectHeader [LSN 7]
DurOp
DurOp
DurOp
JSectFooter
…
Op_DbContext
lengthoffsetfileNodata[length]
lengthoffsetfileNodata[length]
lengthoffsetfileNodata[length]
Write Operation
Set database context for subsequent operations
![Page 27: Webinar: Understanding Storage for Performance and Data Safety](https://reader037.vdocuments.us/reader037/viewer/2022110115/5490b381b47959025d8b4572/html5/thumbnails/27.jpg)
Can I lose data on hard crash?• Maximum data loss is 100ms (journal flush). This
can be reduced with –journalCommitInterval
• For durability (data is on disk when ack’ed) use the JOURNAL_SAFE write concern (“j” option).
• Note that replication can reduce the data loss further. Use the REPLICAS_SAFE write concern (“w” option).
• As write guarantees increase, latency increases. To maintain performance, use more connections!
![Page 28: Webinar: Understanding Storage for Performance and Data Safety](https://reader037.vdocuments.us/reader037/viewer/2022110115/5490b381b47959025d8b4572/html5/thumbnails/28.jpg)
What is cost of journal?
• On read-heavy systems, no impact
• Write performance is reduced by 5-30%
• If using separate drive for journal, as low as 3%
• For apps that are write-heavy (1000+ writes per server) there can be slowdown due to mix of journal and data flushes. Use a separate drive!
![Page 29: Webinar: Understanding Storage for Performance and Data Safety](https://reader037.vdocuments.us/reader037/viewer/2022110115/5490b381b47959025d8b4572/html5/thumbnails/29.jpg)
Fragmentation
![Page 30: Webinar: Understanding Storage for Performance and Data Safety](https://reader037.vdocuments.us/reader037/viewer/2022110115/5490b381b47959025d8b4572/html5/thumbnails/30.jpg)
Fragmentation
• Files can get fragmented over time if remove() and update() are issued.
• It gets worse if documents have varied sizes
• Fragmentation wastes disk space and RAM
• Also makes writes scattered and slower
• Fragmentation can be checked by comparing size to storageSize in the collection’s stats.
![Page 31: Webinar: Understanding Storage for Performance and Data Safety](https://reader037.vdocuments.us/reader037/viewer/2022110115/5490b381b47959025d8b4572/html5/thumbnails/31.jpg)
How it looks like
EXTENT
Doc Doc Doc X Doc X
Doc X Doc Doc X
…
BOTH ON DISK AND IN RAM!
![Page 32: Webinar: Understanding Storage for Performance and Data Safety](https://reader037.vdocuments.us/reader037/viewer/2022110115/5490b381b47959025d8b4572/html5/thumbnails/32.jpg)
How to combat fragmentation?• compact command (maintenance op)
• Normalize schema more (documents don’t grow)
• Pre-pad documents (documents don’t grow)
• Use separate collections over time, then use collection.drop() instead of collection.remove(query)
• --usePowerOf2sizes option makes disk buckets more reusable
![Page 33: Webinar: Understanding Storage for Performance and Data Safety](https://reader037.vdocuments.us/reader037/viewer/2022110115/5490b381b47959025d8b4572/html5/thumbnails/33.jpg)
Conclusion
• Understand disk layout and footprint
• See how much data is actually in RAM
• Memory mapping is cool
• Answer how much data is ok to lose
• Check on fragmentation and avoid it
![Page 34: Webinar: Understanding Storage for Performance and Data Safety](https://reader037.vdocuments.us/reader037/viewer/2022110115/5490b381b47959025d8b4572/html5/thumbnails/34.jpg)
Solutions Architect, 10gen
Antoine Girbal
#antoinegirbal
Questions?