Friday, July 15, 2016

Major Linux I/O Bug? -- UPDATED 2016-07-19

Update 19 Jul 2016: I still believe there's an issue here, but I've got reason to believe my tests were not as apples-to-apples as I'd originally thought. So I'm re-engineering the tests and trying again to get updated numbers.  I'm also checking to see if I can replicate results with Informix completely removed from the mix, doing some testing with simple dd.

It is also possible that there's no problem at all, and I may have to cue Emily Litella.

Further Update: Not only were apples not apples, they were also not oranges or Buicks. Turns out a big part of the reason I saw the huge discrepancies was because I had SSD on one system and spinning disk on another.  Needless to say, the numbers from the original post are meaningless because of that mistake. But there's still the mystery of the performance degradation in production that coincides with when updates were installed to CentOS.

As a sanity check, I re-ran the same test on various platforms and compared those results to one another and to results I got back in September when I ran a similar battery.  In every case, the results were better today than in September EXCEPT on RHEL/CentOS, where they were considerably worse than they had been in September. AIX on old hardware (with several storage configurations) smoked all combinations on Linux. So I'm thoroughly confused, and still researching.
End Update

I believe I've stumbled upon a major bug in the Linux kernel I/O subsystem.  I've been working on testing newly-acquired hardware to see how it compares against the hardware we're replacing, and in my initial tests, the results I was getting were terrible.  Read performance looked pretty good, but write performance was abysmal.

At first, I was blaming it on the new DAS, thinking that there must be some kind of setting causing it to heavily prioritize reads over writes.  I'd run the same test on AIX machines, and those tests lost on read performance but won easily on write performance.  Then, as a sanity check, I decided to do the same test on a VM, and got the same abysmal results.  I'd run the test on an identically-configured VM before, several months ago, and got much different results, so that tripped my inner, "Hey! Wait a minute!"

This reminded me of a lingering performance issue that we've got on a production Linux VM that seems to have started back in mid-May. So I went onto the VM in question and ran "yum history." That shows that I had done a "yum update" back on May 18, right about the time performance issues started to be reported. "Aha!" I thought.  I've got a culprit now.  In setting up the new hosts, I was doing a "yum update" almost immediately.  So what happens if I re-install the system and don't install updates?

You can guess where this is going: performance looks fantastic.

I'm still trying to narrow it down.  My hope is to figure out which particular package update causes the problem, and I'll report back if/when I find it.  So far, here's what I can tell you:

  • The problem replicates on both RHEL 7 and CentOS 7.
  • If you install cleanly from the latest media (as of about 2 weeks ago), I/O performance looks great. (Kernel release: 3.10.0-327.el7.x86_64)
  • If you run "yum update" then after the updates have installed, I/O performance will go down the crapper: read performance drops by a little more than 10%; write performance drops by nearly 80%! (Kernel releases 3.10.0-327.18.2.el7.x86_64 and 3.10.0-327.22.2.el7.x86_64)
  • On systems, VMs where the updates have been installed, disabling kernel asynchronous I/O (KAIOOFF=1) seems to help, bringing read performance back to what I'd expect, and getting back about half the difference on write performance.

Note: the test I've been running is a large, single-threaded purge, starting from the same baseline and purging exactly the same records.  To isolate read performance, I have a version of the purge that scans the table to find the records to be purged but doesn't do the actual purging.  Here are the results of those tests:

Without updates, read-only: 18182.94 rows/second

With updates, read-only: 16170.75 rows/second
With updates, read-only, KAIOOFF=1: 18,389.99 rows/second

Without updates, full purge: 2010.49 rows/second
With updates, full purge: 438.46 rows/second
With updates, full purge, KAIOOFF=1: 1125.58 rows/second

[Still to test: Without updates, KAIOOFF=1]


  1. I think this post will be a fine read for my blog readers too, could you please allow me to post a link to my blog about the real time file replication I am sure my guests will find that very useful.
    real time file replication

  2. Now Discover the latest news about freebies, games hacks & tips, gift cards and much more. Keep coming for Latest Gaming Updates, Tech news, and Guides. Now you should know that How To Get Free Nintendo eShop Codes Easily In 2019 - Unused

  3. TreasureBox is operated by a group of young, passionate, and ambitious people that are working diligently towards the same goal - make your every dollar count, as we believe you deserve something better.
    headboard nz
    laptop table nz


  4. مكافحة حشرات بالخبر مكافحة حشرات بالخبر
    مكافحة حشرات بمكة مكافحة حشرات بمكة
    مكافحة حشرات بالمدينة المنورة شركة مكافحة حشرات بالمدينة المنورة
    مكافحة حشرات بالدمام شركة مكافحة حشرات بالدمام

  5. It is amazing and wonderful to visit your site.Thanks for sharing this information,this is useful to me...

  6. Everything has been described in a systematic manner in your article. Thanks for taking your time to share. Looking for Procurement Cloud Training in India then is the best option for you. Visit us on Procurement Cloud Training India