If you’ve been a DBA for a while, you’ve heard people talking about PLE. PLE stands for Page Life Expectancy; most DBAs (and really anyone that works with SQL Server for more than 5 seconds) called it PLE. I think this is for two reasons; 1 Expectancy is hard for a lot of people to say; 2 we’re lazy efficient.
PLE is reported by SQL Server as a number, somewhere between 0 and ~2B. The number that it’s showing you is a number of seconds. Those number of seconds are the amount of time that SQL Server estimates that it will be able to keep a page loaded in memory for before that page is swapped out. So if a page of data was loaded into memory right now, and PLE was 100 seconds, then in ~100 seconds we could expect that page to be removed from the buffer pool in order to make room for another page.
I want to be really clear here, 100 is not a target number (neither is 300). It’s just a number that I am using for my explanation here.
When you look at PLE on a larger server, you may see a few numbers. That’s because each NUMA node has a PLE value, and those numbers are calculated together to get the servers PLE. The actual calculation isn’t an average or anything quite that simple. You take the PLE value of the node and multiply it by 1000. You do this for all the NUMA nodes. Then you sum up these values, and divide the total by the number of NUMA nodes, then divide by 1000.
This means that PLE for one NUMA node can drop, while PLE for another NUMA node can be stable, and the server-wide PLE number may not change (or change much).
The reason that each NUMA node has its own PLE value is that SQL Server manages each NUMA node independently. Understanding why PLE for each node can vary requires an understanding of how SQL Server works. So, let’s dive a little deeper. (For this article we’re going to assume a 2 CPU server, which has 2 NUMA nodes. It keeps things easier, but all this applies if you have 4 or 8 nodes as well.)
So let’s calculate PLE for our sample server. If both NUMA nodes have a PLE of 100, then the math looks like this.
((100 * 1000) + (100 * 1000)) / 2 / 1000 = 100
If one node has a PLE of 100 and the other has a PLE of 60, then the math looks like this.
((100 * 1000) + (60 * 1000)) / 2 / 1000 = 80
If you have a four-node box, then the value gets even odder looking. Let’s assume we have three NUMA nodes with a PLE of 100 and one node with a PLE of 60 and see how the math looks.
((100 * 1000) + (100 * 1000) + (100 * 1000) + (60 * 1000)) / 4 / 1000 = 90
When a table is loaded into memory, that table is assigned to a NUMA node; and that’s the only NUMA node that the table will be assigned to. The reason for this is that SQL Server only wants to assign CPU processors from that NUMA node to work on that table. Because of this, if there’s a report against that table (or a bad query that causes a scan of the table), that’ll impact the PLE of that NUMA node because the table needs to be loaded into memory so that the CPUs can process that table. (All sorts of things can cause PLE issues to crop up, this is just one example.)
So if we’ve got one really large table in our database, and lots of small tables (and everything in SQL Server and the OS set correctly) we should expect to see two very different numbers for PLE. One for the NUMA node that is working on the large table, and one on the NUMA node that’s hosting all the smaller tables.
You may be asking yourself how can I assign tables to a specific NUMA node, and the answer is that you can’t. Even if you could, there’s pretty much no reason to. SQL Server does a really good job of assigning tables to NUMA nodes all by itself. If you think you’re having a problem with SQL Server because it’s assigning to many tables to a NUMA node, you’re probably got other problems that need to be dealt with (hint, we can help you with that).
There are lots of PLE recommendations out there. Most of them are built on the number 300 which you’ll find in Books Online. Any advice about PLE, which has a specific number as a target, is worth the amount of money you paid to read the blog post (if you did pay to read that number I’m sorry you got screwed). The target PLE is very system-specific, and it should be treated as such. Some systems should have a really low PLE because they churn data so often. Other systems need to have a really high PLE so that they never have to hit the disk.
One of our clients has a PLE of somewhere around 35,000. If it gets below 28,800 (8 hours) we start to worry. When we were designing this system, we decided that when customers ran the end of day reports (the customers are only open for 8 hours a day) for their office, we didn’t want to hit the hard drive. That means that all the data over the last 8 hours needed to be loaded into memory and it needs to stay there until the end of the day. Is this a guarantee that there are no queries that will hit the disk at the end of the client’s day? No, of course not, but we’ve done everything we can to make this a reality for the client. And given that this client has thousands of customers, and tens of thousands of unique users per day (if not more) and the system has been stable for years, we’ve done a pretty good job getting to that point.
So what number should you shoot for when looking at PLE? As high a value as you can afford to get. Increasing PLE is a matter of index tuning and buying more RAM. If you can afford to get more RAM, get it. The more data that SQL Server can cache, the higher PLE is going to go, and the happier SQL Server will be.
While you probably don’t need to know what PLE is in the long run, it’s one of those things that you’ll want to keep an eye on, because if it drops you want to know why. PLE dropping isn’t a problem or even a symptom of a problem; unless it is. It’s a potential warning that some investigation on the server needs to be done and from that investigation, you might need to do more, or you might not.