"Amigos y nadie más. El resto, la selva"
-- Jorge Guillén

How to squeeze the most performance from a Linux system?

We have a very specific application that we want to tweak so our response times move from the milliseconds range into the micro or nanoseconds range. How do we achieve that? Tweaking Linux to handle your hardware specifically to your load is the only way to achieve this. When you install a distribution on your server, it comes preconfigured for "generic" tasks. In other words, there is no real optimization done as there is no way to know what your system will end up doing. Are you using multiple CPUs? Are you using more than one network card? Are you doing a lot of I/O to disk? Or is your app running mostly from memory? All of these requirements will be handled differently. This article will attempt to make the tweaking of those settings more understandable and hopefully you will be able to pick and choose the ones that apply to your work load.

Optimizing a multi-homed system with no I/O load

Say that we have a multi-core, multi-CPU system which has 2 (or more) network interfaces and our application will be running mostly from RAM. Say that we have a time-to-first-byte (TTFB) of roughly 200ms when carrier signal are off to when it is on (the network cable was unplugged) and you would like to get this to a more reasonable 10 to 30ms range. This is a very common problem as most sysadmins are trying to squeeze as much performance as they can from their existing datacenter infrastructures.

Handling interrupts

Our first step is to determine how does the kernel handles interrupts.

cat /proc/interrupts
          CPU0       CPU1       
58:     951107          0         PCI-MSI  eth0
66:     643786          0         PCI-MSI  eth1

As you can see here, CPU0 is handling all the interrupts for the first and second network interfaces (eth0 and eth1 respectively). You can modify the affinity of the operating system so it adheres to a particular CPU for a given interrupt. You could do the following so that each CPU handles 1 interface like:

echo 1 > /proc/irq/58/smp_affinity
echo 2 > /proc/irq/66/smp_affinity

Which translates roughtly to: CPU0 will handle all IRQ 58 calls and CPU1 will do all IRQ 66 calls. The number we are passing here is a bitmask (base 2) which works roughly the same way as umask and netmasks do, for file creations and networks respectively. So you are saying "limit IRQs for this particular CPU range". Note that isolating tasks this way can cause performance issues on your system. Refer to the kernel-parameter documentation (see Additional Information) in particular isolcpus.

If you need to change the CPU that handles a process that's already running, then you can use taskset -p CPU# PID# to change it. Example, for a process with ID 4200 to be handle exclusively by CPU0:
taskset -p 0 4200

Note that these settings are done on a running system and they will be reset on reboot. To make the changes permanent, you will need to add your modifications to /etc/sysctl.conf or /etc/rc.local (same commands you typed on a terminal). Here is an example that changes the coredump pattern to /tmp/core.%p.%e.%u (%p is process id. %e is executable name. %u is user id):

Essentially, if you are doing echo 1 > /proc/sys/foo/bar, you can drop the /proc/sys from the name and replace / with . (dot) follow by = and the value.
foo.bar = 1

Improving time calls like gettimeofday()

Most applications do many requests to the system's gettimeofday() (or similar time functions) from kernel space to user-space. This translates into some latency as the information needs to interrupt the kernel to get the data from kernel to user-space. Some systems are tuned so these time calls can be done entirely in user-space (Redhat 5 and above for instance). Search for VDSO for your particular distribution's glibc implementation. In order to turn on this behavior, you need to execute the following:

echo 1 > /proc/sys/kernel/vsyscall64

And to make this change permanent across reboots:
echo "kernel.vsyscall64 = 1" >> /etc/sysctl.conf

Note that real-time patched glibc implementations allow further optimizations for these calls. They can, for instance, "know" that the last call was made very recently and not actually request a new time if it happens within the last ms (hz) of your clock cycle. In order to use this behavior instead, you need to pass 2 to the VDSO kernel parameter (vsyscall64).

I/O load optimizations

One of the simplest ways to get the most performance out of your local disks is to tell your system not to do unnecessary date operations like updating the access time (see man 2 stat and read atime) on your files. Granted, unnecessary does not mean it's not useful or required for other types of systems. You would need to make this determination for your own data. Also, keep in mind this will turn off access time information for all files under the mounted tree. If you have control of the application source you use, you might want to change the flag of how you open filehandles so they use O_NOATIME (see man 2 open)

cat /etc/fstab
# <file system> <mount point>   <type>  <options>       <dump>  <pass>
# /dev/mapper/bootdisk-root 
UUID=uuid_goes_here /srv               ext4    noatime 0       2

If you can't use noatime for some valid reason, you might want to look into relatime (man mount) which uses a different approach to update your access timestamps to essentially update the time in relation to change and modify time and permits applications that depend on this information to work properly. You should create a different mount point to your application so your impact on other system application is minimal.

These settings are specific to filesystems, so your particular filesystem might not have a proper implementation of access time (or a way to turn it off). However, most modern filesystems do. This leads to the next best way to optimize your disk I/O: switch to a modern filesystem.

This however, implies loosing many hours to properly move your data to the new filesystem so plan ahead. A quick online research would yield results that could guide you on which filesystem is more adequate to your given work load. Typically the plethora of options can be summarized like this:

  • reiserfs: good for work loads that requires read-access to many small files in a single directory or deeply nested directories
  • ext(1-4): family of filesystems are good for general purpose, but they fail miserably when saving large files and if your load requires deleting (unlink) many files
  • XFS: (my personal favorite) is overall a solid filesystem that works well for large data and fairly well for small files (like reiserfs). Performing operations like checking for file corruption, growing the filesystem size on a live (mounted) filesystem, and others, are also possible with XFS unlike some ext* filesystems (anything older than version 3)

You should rethink your approach to journaling as well. Do you really need your filesystem to keep track of things before they are done? Can you move the journaling to another disk so you can safe those spinning cycles on your main disk i.e. where your load takes place?

There are other tricks that you can perform to improve your I/O performance like spreading the work load over an array of disks. This is known commonly as a RAID 0 array of disks but there are other ways to do it. RAIDing disks comes with its set of limitations and pitfalls, so you need to plan accordingly and keep current (tested!) backups both local and in a remote location -- which in turn increases you cost. One very common oversight about RAID 0 arrays is that as the number of disks in the array increases, so does the possibility that you can loose all your data. This is because of the fact that all disks fail at some point and you need to be able to tolerate those failures gracefully. Using tricks like RAID 1+0 can help but at a higher cost. Add backup costs to the mix, and at that point you might consider investing on a commercial storage solution.

Some commercial vendors, like 3par, use a smarter approach than simply using RAID to improve the I/O load to your system. They allow the data chunks in the filesystem to be spread over a wider selection of a mixture of solid state disks (SSDs), used for caching data, and spinning disk, for cheap storage, with redundancy. Providing both performance and cost efficiency!

When disks start to fail, they impact your ability to read/write data gradually. You might or might not catch these errors as they are reported by the kernel in dmesg or in system logs. The best way to know is to tap into the disk built-in feature called SMART. SMART can predict possible failures so you have enough time to replace disks before they fail.

Install smartmontools package which provides the smartctl command. You can then use this to tell disks to self-check and you can verify the status with:

sudo smartctl -H /dev/sdd


In conclusion, optimizing your linux servers to squeeze the most performance out of them is both a science and an art on its own. It takes patience, research, thinking outside of the norm, and possibly spending a bit more money, but in the end your results will pay for themselves as your total cost of ownership (TCO) would flatten over time.

Additional Information

  1. Documentation/kernel-parameters.txt
  2. stat
  3. RAID
  4. CPU affinity


New Comment

* optional