mongodb memory management demystified

Hello everyone, my name is Alon Horev, I’m based in Israel and I’m working at intucell which was acquired by Cisco. I’m a python developer and I lead intucell’s data team. About two years ago we migrated our product off MySQL and started working with MongoDB. I want to start off by introducing our use case of MongoDB: We’ve built a system that opJmizes cellular networks automaJcally. OpJmizing cellular networks is about making your data connecJon faster and improve the quality of your calls.

2

The way we do this is preOy simple, we collect a lot of staJsJcs about what goes on in the network, like how many calls are taking place or how many users are connected to the antenna. We then analyze this informaJon to idenJfy things like which antennas are loaded. Once we know what are the problems in the network we act, we change parameters in the network , for example, we would force your phone to use a different antenna so you’ll get a beOer service. Now, as you see this process is cyclic, we’ll collect more staJsJcs to make further changes and make sure we improved the network. This happens all the Jme, even here right now, with AT&T. In the process of working with MongoDB we learned a lot about database performance and server performance. I personally spent a lot of Jme monitoring and opJmizing the storage and memory usage which brings me to this lecture.

3

Today I’m going to try and give you an understanding of how MongoDB manages memory. So, first, what is 'memory management' when it comes to MongoDB? Well, memory is a fast but limited and expensive resource, memory management is about deciding what data to save in memory.

4

Why should you care about memory management? memory management has a huge impact on performance and costs. This relates both to developers and dbas, as a developer you can opJmize the schema and queries for beOer memory usage, As a dba you can monitor and predict performance issues related to memory usage. I’m preOy sure every mongodb administrator asked himself atleast once: how much memory do I really need?. Before we dive in I want to tell you a liOle secret: MongoDB doesn’t actually manage memory. It leaves that responsibility to the operaJng system.

5

Within the operaJng system there’s a stack of components which MongoDB depends on to manage memory. Each component relies on the component below it. (!) This talk is structured around this stack of components. We’ll start from the low level components which are storage devices: disks and RAM We’ll conJnue with the page cache and memory mapped files which are a part of the operaJng system’s kernel And we’ll finish off with MongoDB’s usage of these mechanisms. (!) Let’s talk about storage.

6

There are different types of storage devices with different characterisJcs, we’ll review hard disk drives, solid state drives and RAM. Let’s start by breaking these into categories: (!) HDDs and SSDs are persistent and RAM isn’t, but RAM is really fast. That’s why every computer has both types of storage, one persistent (a HDD or a SSD) and one is volaJle (RAM).

7

Now let’s compare throughput. As I said before, RAM is fast, it could go as fast as 6400 MBPS for reads and writes. SSDs are 10 Jmes slower than RAM, modern SSDs can reach a read rate of 650 MBPS and a liOle less for writes. HDDs are much slower, ranging from 1 MB to 160 MB per second for reads and writes. The reason there’s such variance in HDD speed is because throughput is highly affected by access paOerns. Specifically with HDDs, random access is much slower than sequenJal access, and that’s because a HDD contains a mechanical arm that needs to move on almost every random access. Sadly for us, databases do a lot of random I/O. which means, if you’re running a query on data that’s not in memory and therefore, it has to be read from disk, you’re seeing a penalty of about two mulJtudes on response Jmes. The next characterisJc is price. (!) For making the comparison easier we’ll compare the price per GB. It’s not surprising that there’s a correlaJon between price and throughput, meaning, the more you pay for each GB, you get beOer throughput. So hard drives are really cheap at 5 cents per GB, SSDs are 10 Jmes more expensive and RAM is 100 Jmes more expensive.

8

Is this informaJon sufficient to choose the opJmal hardware configuraJon? I think it’s not, your applicaJon’s requirements are also a part of the equaJon. For example, if your applicaJon is an archive that saves huge amounts of data that is rarely accessed, you can go for a large HDD and save a lot of money. Later on we’ll see how can you take measurements of things like RAM and capacity and then you’ll be able to determine what kind of hardware configuraJon you need.

9

Now lets zoom out of storage and and move up to the next layer which is the page cache.

10

The page cache is a part of the operaJng system’s kernel and whenever a program does file I/O like reads and writes it always goes through the page cache. The page cache makes reads faster by saving popular chunks of data in memory and makes writes faster by lehng the applicaJon write to memory and not to disk. So we can say the page cache was invented to combine the disk’s persistence with the memory’s speed. It’s about having the best of both worlds.

11

So.. It’s called the page cache but what is a page? A page is a 4K chunk of data. Each file is broken into pages. The number of pages belong to a file is simply the file’s size divided by 4K. (!) Looking at the example, you can see a file spanning 3 pages because it’s 10 kilobytes in size, that grey area is an unused part of the last page as the file’s size isn’t a mulJple of 4 kilobytes. The page cache’s job is to determine which pages to save in memory.

12

Lets dive a liOle deeper and see what happens behind the scenes when we read from a file. (!) We have a process running in user space and it’s reading 100 bytes from a file. (!) Through a system call we get to the kernel where the page cache handles the read request. (!) First, the page cache translates the posiJon and count of bytes to read to a list of pages. If we would read a 100 bytes from the beginning of the file, the result of this step would be the first page. (!) The next thing the page cache will do is check if the page exists in the cache, (!) if it’s not, the data has to be read from disk and then it will be stored in the cache. Once the page is in the cache we reach the last step, (!) which is to copy the data to the user space applicaJon. So that’s how a read works.

13

The page cache also handles writes. (!) This Jme our process is calling the write system call. (!) The page cache copies the data from the process to the relevant pages and marks them as dirty. That’s all it does, change data in memory. It gives the impression the data has been wriOen, where in fact it has been wriOen only to memory and not to disk. If an applicaJon would read from the file it would get the latest the data from memory because dirty pages must stay in the cache. Having dirty pages is somewhat dangerous for two reasons: first, they will be lost if the operaJng system crashes. Second, if there’s a lack of memory they can’t be freed. The soluJon for these problems is to flush the dirty pages to the disk. (!) There’s a thread in the kernel that flushes pages aler they stay in the cache for some Jme or when memory needs to be freed. If a process wants to make sure the data is flushed to disk it can call the fsync system call that can trigger a flush for a specific file or even the enJre file system. MongoDB calls that every 30 seconds to make sure data is backed by disk.

14

I menJoned how the page cache frees pages when memory is running low, this procedure is called page reclamaJon. There are different page reclamaJon policies. A page reclamaJon policy is an algorithm that answers a simple quesJon: “what’s the next page that can be freed?” In linux, the simple answer is: “The one that is the least recently used”. Turns out page reclamaJon is happening all the Jme even on healthy systems, it doesn’t mean you’re out of memory. That’s because the page cache is greedy and will try to use all the free memory on your machine to cache the file system. In order to understand how much memory is used by the page cache you can use the free command.

15

Free is a linux program that displays memory usage staJsJcs. Lets try to interpret its output. When running free with –g it prints units in GBs. The first line reveals the total amount of memory which is 64GB, out of these 61GB are used and 3GB are free. Then, out of the 61GB that are used, 55GB are of of cached data. These are pages in the page cache. The second line interprets the cached data as free so we suddenly have only 5GB of used memory. This is memory directly allocated by programs. The reason cached memory can be considered free is because even though the memory is used it will be freed if programs need it. As soon as programs allocate memory and the free memory runs out the page cache shrinks and frees pages.

16

The next component up the stack is memory mapped files.

17

Memory mapping of files is an alternaJve mechanism for reading and wriJng from files. Instead of calling the read() and write() system calls, a process can map a part of file into memory and every access the process makes to memory translates to a file read or write. On the lel you can see a process with a memory region which is mapped to a segment of a file. So memory addresses 100 to 200 are mapped to a file segment that starts at 400 and ends at 500. A write to memory address 100 is translated to a write to the file at address 400. Mapping a file into memory doesn’t necessarily load its data into memory, if a process reads from a page that is not in memory the infamous page fault is triggered. The code in the kernel that handles page faults tells the page cache to load the required pieces of data from disk and then serves the read. So memory mapping has several advantages over regular file I/O: First, it’s fast, there’s no system call involved and no copying of memory. Reads and writes access memory that is allocated in the page cache. Second, it takes the responsibility of memory management from the user. As we’ve seen earlier, the page cache will determine what’s actually stored in memory.

18

In this example two processes map the same region of a file into memory. Only one copy of this data will occupy memory or even less if it’s not accessed. Historically this mechanism was invented to reduce the memory usage of processes. Whenever you execute a program, the program’s code and it’s shared libraries are mapped to memory. So if you open 10 instances of chrome, its code sJll appears once in memory.

19

Now lets see how Mongo uses this stack of components

20

(!) Mongo maps all it’s data into memory. This includes the documents, the indexes and the journal. (!) When running top you can actually see how much memory is mapped and how much is used. (!) The lel column called VIRT stands for virtual memory, once a process maps files to memory they’re accounted under virtual memory. When using journaling mongo actually maps the data files twice, so this figure is twice the amount on disk which is about 273GB. RES stands for resident memory and is the amount of memory that’s actually located in RAM out the virtual memory. SHR stands for shared resident memory. So out of the 24GB of resident memory, 23GB is data from memory mapped files which is sharable.

21

Turns out this very cool strategy for managing memory also has problems. The biggest problem is MongoDB (!) has no control of what is saved in memory. You can’t tell mongo: promise me this document or collecJon is stored in memory and by that ensuring fast access. Why is this a problem? I’ll give you some examples: 1.  (!) The first example is warm-‐up – aler restarJng your server, none of the data is

stored in memory, for every page that is accessed for the first Jme, a page fault will be triggered and the query will take longer.

2.  (!) The second example is what I call expensive queries – expensive queries are queries that aren’t indexed well or request data that is hardly ever accessed. When these things happen documents are loaded into memory at the cost of freeing other documents who are more important. Why does this happen? As we’ve seen before the page cache frees the least recently used pages first.

There are things you can do to miJgate this problem.

22

What we did is (!) protect MongoDB with an API. The API enforces index usage so mongo reads less documents into memory. Another thing the API does is pass a query Jmeout to make sure costly queries are being cancelled. The API doesn’t have to be complicated, it could be a simple module sihng on top of the MongoDB driver. Lets look at an example, (!) this is (!) a python funcJon called find_samples and it’s used whenever we want to run a find query on the collecJon named samples. The funcJon accepts two parameters that define a date range: start_Jme and end_Jme. By forcing the user to pass a date range we make sure the query is indexed. You could add further validaJons to make sure the range isn’t too big or doesn’t go too far back in history.

23

Another challenge worth menJoning is (!) the lack of prioriJzaJon between processes. When processes allocate a lot of memory the page cache shrinks automaJcally, and since mongo relies on the page cache, you could say mongo’s memory shrinks automaJcally. In other words, mongo has a lower priority than other processes over memory. Since mongo will just become slower if it doesn’t have enough memory you need to be careful with other processes running on the same server. You can miJgate this phenomenon by isolaJng mongo. (!) Don’t run it on the same server along with memory or disk intensive applicaJons. The last challenge I’d like to tackle is (!) esJmaJng how much memory is required, also known as the size of the working set.

24

So what is the working set? this is the data that your applicaJon reads regularly and should be returned in a Jmely manner, therefore it should fit in memory. The working set contains (!) more than documents, it also includes indexes and some padding. To emphasize the padding issue lets look at an example memory page. (!) As I menJoned before, a page’s size is 4k. This page includes 3 documents, between the documents there’s some padding. This padding accounts for expansion of exisJng documents or inserJon of new ones. Out of the three documents, only document number 2 is accessed regularly. So even though a small part of this page is actually used, the whole page is saved in memory. the page cache can’t save half pages in memory. This brings us to the conclusion that it’s really hard to measure the size of the working set by simply looking at the count or size of the documents being queried. SJll, there are several tools to help you esJmate how much memory a collecJon should require.

25

The tools fall into two categories: planning and monitoring.

26

Planning is about predicJng how much memory each collecJon is going to need. Lets take a real world example. In one of our collecJons we save a month long of history, out of that month we know our applicaJon olen queries the last two weeks and someJmes the week before that. The last two weeks are considered “hot data” because they have to be stored in memory, the week before that is considered warm, it doesn’t have to be in memory but we should sJll take into account so it won’t push out the hot data. If we’re going to take some spares to compensate for padding and such, it’s safe to assume 3 out of the 4 weeks should fit in memory. (!) You can use the collecJon stats command to get important metrics like the size of indexes and the size of the data and roughly calculate how much memory the collecJon is going to require. Once you have a running database you can use several monitoring tools to analyze the working set.

27

When I think about monitoring tools they generally fall into two categories: 1.  (!) One is online monitoring which is basically seeing what’s going on at the

moment. This category includes running linux commands like top and iostat or mongo commands like currentOp, mongostat and mongomem.

2.  (!) The second category is offline monitoring which is more about collecJng and aggregaJng historical data. One example would be the profiling collecJons that collects slow queries over Jme. another example is the MMS or other graphing tools like graphite that collect different metrics over Jme. these are used for idenJfying trends, correlaJons and predicJng growth.

Lets start from the online tools.

28

Mongomem is a great tool for memory use analysis. It’s wriOen in python by the people at a company called wish so you’ll have to install it manually, it doesn’t come packaged with mongodb. Mongomem won’t tell you how much memory you need but it will tell you how much memory each collecJon is using at the moment. Here’s an example output, (!) each line shows how many megabytes of the collecJon are in memory. The top collecJon in this example is the oplog with more then 11GB of data in memory out of almost 50GB of data. So about 22% of the collecJon is in memory. The last line shows the total amount of memory used by mongo out of the total data size, so in this example we have 16GB of data in memory out of 280GB of total data. Since I’ve got 16GB of memory on this machine, we can see all the memory is being used. But what does this say about the working set? Is it larger than memory? In other words, do we have enough memory? Well, we can’t say, because it’s possible there’s data in memory that is hardly ever accessed.. The page cache just didn’t have to reclaim these pages.

29

What you can do in order to test how much RAM mongo actually uses is the following procedure: 1.  First thing you have to do is stop the database 2.  Then, you need to clear the page cache, the following command invokes some

code in the kernel that drops all pages from memory. 3.  The next step is to start the database 4.  And aler that you need to invoke the queries that should cover your working set.

Queries that should access all the data you expect to have in memory. 5.  At this point, when running mongomem you’ll be able to get a more accurate

picture of how much memory is required.

30

Before looking at addiJonal tools I want to answer a simple quesJon: how do we know when something is wrong? what do we need to monitor? And since we’re talking about memory, how do we know we don’t have enough of it?. Well, the phenomenon of not having enough memory is called thrashing. When the OS is thrashing, it’s because an applicaJon is constantly accessing pages that are not in memory, the OS is busy handling the pagefaults, reading the pages from disk. So the first thing to monitor is page faults (!), and since it’s hard to tell how many page faults are too much, you should also look at disk uJlizaJon, if the disk is uJlized 100% of the Jme, you’re in trouble. There are a lot of other things that go wrong like (!) a lot of queries being queued and high locking raJos but these just are symptoms

31

I usually use iostat for looking at disk utlizaJon. Here’s an example output of the command, the rightmost column shows this disk uJlizaJon and reveals a disk that is busy a 100% of the Jme. The second column show the disk serves 570 reads per second and the third column shows the number of writes per second which is zero. If this is happening constantly, the working set does not fit in memory. Along with iostat, I frequently use mongostat

32

Mongostat comes packaged with MongoDB and uses the underlying (!) serverStatus command. It displays a bunch of interesJng metrics like (!) the number of page faults and queued reads. It’s preOy hard to say how many page faults are too much but more than one or two hundred page faults per second are an indicaJon of a lot of data being read from disk. If this happens over long periods of Jme it could be an indicaJon the working set does not fit in RAM. If the number of queued reads is larger than a hundred over long periods of Jme it could also be an indicaJon the working set doesn’t fit in RAM. It’s olen important to look at these parameters over Jme in order to determine if there’s a sudden spike or repeaJng problem. This brings me to offline monitoring.

33

Tools like the (!) MMS or graphite can show you these important metrics over Jme. Using one of these tools is (!) mandatory for a producJon system. I cannot tell you how useful they are. Whenever we get a Jcket about a performance problem we put our Sherlock hats on and start an invesJgaJon. We look at metrics related to our applicaJon but also, a lot of metrics related to mongo and how they change over Jme: we look at the number of queries, the number of documents in collecJons and tens of other metrics. I’d like to show you an example workflow of a Jcket. Try to picture this: it was a quiet evening, I was about to go to sleep, when I get an automated email that one of our shards is misbehaving, what were the symptoms? it had more than 300 queries just waiJng in queue. What do I do next?

34

I immediately open graphite, this is a screenshot of the number of page faults in green and the number of queued readers in blue. By looking at the history you can spot two trends: 1. First, there’s a spike of high load every hour. This is actually normal since we’re doing hourly aggregaJons of our data. 2. The second trend, is a massive rise in page faults and queued queries at exactly 20:00. At this point there’s an impact on users as a lot of queries take a very long Jme. Why is this happening? Has the working set outgrown memory?

35

Lets look at another screenshot of the same Jme frame. This Jme we look at other metrics: in blue are the numbers of queries, in green are the number of updates and in red is the disk uJlizaJon. Remember that disk uJlizaJon is measured in percentage so even though the graph is lower than others we can sJll see that at 20:00 the disk was constantly uJlized at a 100%. When looking at the updates vs. queries it’s obvious that a huge amount of updates is hurJng the query performance. We were busy wriJng to disk. In this case an applicaJon change was the root cause of the problem, the applicaJon simply started updaJng a lot more documents. So using graphite, we were able to trace the problem to a specific change in our applicaJon and later on modified our schema to reduce the document size and the load on disk. This brings me to next topic which is opJmizaJon.

36

When opJmizing memory usage the main target is to reduce the amount of required memory for your applicaJon. (!) Smaller the collecJons and documents are, the faster the queries will be. not just in terms of memory but also disk, if documents are smaller less disk access is required to read them. There are several opJmizaJons you can do when it comes to schema: (!) 1.  first, shorten the keys. we’ve started with long names like firstName, then,

shortened them to a single word or acronym and finally used one or two leOers since it had a huge impact on the size of our data. By shortening the keys we reduced the size of our data in more than 50%. There is a huge downside for doing this because it obscures the data but fortunately, we have an API that hides this ugly implementaJon detail so it doesn’t have an impact on our users.

2.  Another thing to consider is the tradeoff between the number of documents and their size, in many use cases it’s more efficient to store a smaller amount of large documents vs. a large amount of small ones.

We previously seen how padding occupies memory, by changing the padding factor and running repair every some Jme you can reduce the padding overhead. The next thing you can opJmize is indices

37

First thing you should know is that unused indices are sJll accessed whenever documents are being inserted, updated or deleted. Try to idenJfy those and remove them. (!) Use sparse indices when only some of the documents will have the indexed aOribute as they use less space. (!) The last thing I want to talk about is how much of the index is located in memory. The answer is: it depends. If the enJre index is accessed by queries then the enJre index should be located in memory. If only a single part of the index is used, only that part has to fit in memory. Lets look at a few examples to emphasize the difference, you can imagine an index (!) as a segment of memory, the red marks are locaJons frequently accessed by queries. (!) The first example is an index on a date field called creaJon_Jme. Each inserted document inserts the largest value of all previous ones so the right most part of the index is updated. In many such indexes only the recent history is olen accessed so only the right-‐most part of the index will be located in memory. (!) The second example is an index on a person’s name, the index accesses will probably distribute evenly across the enJre index so most of it will be located in memory.

38

So lets summarize what we’ve learned: 1. We’ve seen how memory management works, we’ve started from the disk and RAM, went up the stack to the page cache whose sole purpose is to improve read and write performance by using the memory. We conJnued to memory mapped files which translate memory accesses like reads and writes to file reads and writes. And we finished with MongoDB’s usage of these mechanisms. 2. We’ve talked about the challenges this strategy presents: like predicJng and measuring the size of the working set. 3. We then talked about monitoring, which is something you have to do if you have a DB running in producJon. 4. We finished with schema and index opJmizaJons which are crucial for cuhng costs and improving performance.

39

And that’s it! I hope you enjoyed my talk and thanks for having me.

40

mongodb memory management demystified

Data & Analytics