PostgreSQL 9 High Availability Cookbook
上QQ阅读APP看书,第一时间看更新

Applying bonus kernel tweaks

Most operating system kernels are optimized for generalized use. While this does not preclude operation as a server, we have to change a few settings to fully utilize our available hardware. This isn't simply a series of configuration modifications meant to increase performance but critical kernel-related tweaks meant to prevent outages.

Though, while we're on the subject, there's no reason to not include purely performance-enhancing changes. Getting the most out of our hardware prevents unnecessary operating strain on existing resources. A server running too close to its limits cannot be considered highly available; an unexpected increase in demand can render a server unusable under the right circumstances.

Getting ready

While the following settings are based on Linux servers, some of the concepts are universal. We'll try to provide enough information to illustrate this. However, keep that in mind for this recipe. Otherwise, look for a directory named /etc/sysctl.d. Any system with this directory can be easily configured by adding a file that contains extra settings here. Otherwise, we need to find a file named /etc/sysctl.conf, which servers a similar purpose but requires direct modification.

The settings we are going to change include the following:

kernel.sched_migration_cost = 5000000
kernel.sched_autogroup_enabled = 0
vm.dirty_background_ratio = 1
vm.dirty_ratio = 5
vm.zone_reclaim_mode = 0
vm.swappiness = 0

How to do it...

If there's a /etc/sysctl.d directory, follow these steps to activate:

  1. Create a file named 30-postgresql.conf in the /etc/sysctl.d directory with the settings we mentioned earlier.
  2. Execute this command as a root-capable user to activate:
    sudo sysctl -p /etc/sysctl.d/30-postgresql.conf
    

Otherwise, follow these steps:

  1. Place the settings in /etc/sysctl.conf.
  2. Execute this command as a root-capable user to activate:
    sudo sysctl -p
    

How it works...

In this case, it's all about the settings. Each of our two illustrated steps simply ensures that the settings are in a location where they become permanent parts of the server. Any future reboot will automatically apply these newly selected values instead of the defaults. The sysctl command activates them immediately, so we don't need to reboot to modify system behavior.

The sched_migration_cost setting is the total time the scheduler will consider a migrated process cache hot and, thus, less likely to be remigrated. By default, this is 0.5 ms (500000 ns). As the size of the process table increases, the complexity inherited by the process scheduler eventually results in high CPU overhead, merely to assign processors to PostgreSQL tasks.

Depending on the count of database clients, we have observed overhead as high as 70 percent, greatly reducing database performance. Our suggested setting of 5 ms gives PostgreSQL enough time to process one or more queries before the task is eligible for migration and prevents the CPU task scheduler from being overworked.

The sched_autogroup_enabled setting causes the operating system to group tasks by origin to improve perceived responsiveness. On server systems, large daemons such as PostgreSQL are launched from the same system task. As they're all in the same large group, they can be effectively choked out of CPU cycles in favor of less important tasks. The default setting is 1 (enabled) on some platforms. By setting this to 0 (disabled), PostgreSQL query performance can be improved by up to 30 percent on databases with hundreds of user connections.

We modify zone_reclaim_mode to completely disable its operation by setting it to 0. According to the Linux kernel documentation, it may be beneficial to switch off zone reclaim when memory should be used for caching files from disk. Without this, the kernel aggressively balances memory between zones, causing excess overhead and reducing available memory for caching disk data.

The dirty_background_ratio setting is a percentage, which we've set to 1. This is the amount of memory that can be marked as modified before the operating system begins writing data to disk in the background. It is closely tied to dirty_ratio, which is the percentage of memory where the operating system blocks all other write activities and aggressively writes dirty memory until everything has been flushed. This kind of occurrence effectively stops all database activity until the flush is complete.

By setting the background ratio to such a low value, the constant background writes make it much less likely that we will reach that trigger point. A highly available server can not afford long unplanned periods of stopped query handling. The constant writing actually slightly reduces performance, which is a risk we have to weigh against the stability of the server.

Lastly, we set swappiness to 0; this disables memory swapping. When Linux runs low on memory, it normally starts moving idle processes to disk to free up RAM. We don't want to risk any of our PostgreSQL clients getting this treatment, so we tell Linux to only swap if there is no other option. This is common to dedicated servers such as a critical PostgreSQL system.

There's more...

Some kernel settings have different names with different versions. For instance, sched_migration_cost is renamed sched_migration_cost_ns in the newer kernel releases. In addition, dirty_background_bytes and dirty_bytes have been added to newer systems due to the amount of memory available on new servers.

Imagine a server with 512 GB of RAM. In such a case, up to 5 GB of memory could be dirty before the operating system writes anything to disk. In the event of an emergency flush, the disk subsystem may not be capable of handling such a large amount. The new settings allow us to use the same logic as before, but with bytes instead of percentages. In systems with more than 64 GB of RAM, these settings should be used instead of dirty_ratio and dirty_background_ratio.

A good place to start for setting dirty_background_bytes is up to double the size of the RAID or disk controller cache. This ensures that there is never more memory waiting to be written than the controller can handle. Similarly, we can set dirty_bytes to eight to ten times the size of the controller cache. This prevents long flushing delays if the background writer ever falls behind.

As always, your mileage may vary. Some PostgreSQL servers may experience slightly faster writes with larger amounts of dirty memory buffers. However, the goal of this book is to reduce the overall risk, even if that's at the cost of some performance. Long periods of database timeouts due to an overwhelmed disk subsystem do not fit this model.