Slow RAID performance with our new Linux storage

During the last months we periodically experienced performance problems with our storage system. Investigating the cause for the slow performance was problematic as we did not have direct shell access and could only rely on crippled information from the web GUI. Yesterday my collagues migrated the storage system from the proprietary operating system solution to Fedora 22.

After some problems with LVM and directory permissions for Samba, the storage went back online today in the morning. We noticed really fast, that our steady slow storage transformed into a “sometimes running fast, sometimes really slow” machine. One thing was, that copying ISOs from and to a Samba share resulted into really bad I/O performance on every VM which uses mounted iSCSI disks. For example, during a copy through SMB our internal JIRA and Confluence were no longer usable as the proxy timed out. Both VMs (JIRA/Confluence and proxy) were stored on the iSCSI disks provided by the storage.

We excluded the Samba daemon and the operating system as root causes for this issue. We tested the performance with help of dd and compared the results with Thomas Krenns. Our eyes exploded as we saw that the performance of our RAID was a magnitude (s)lower than the reference values. Even a software RAID were four times (!) faster than our hardware RAID. For direct read/writes we received constant slow throughput of 40 MByte/s. WTF? We thought about this issue and came to the conclusion that it had to be something with the LSI 9261-8i RAID controller of the storage. A defect on the controller itself seemed to be unlikely. But then we realized, that the Backup Battery Unit of the RAID controller had a defect. Could it be that an erroneuos battery could have such an impact? And indeed, Thomas Krenn supported this thesis: A defect or disabled BBU ensures that the RAID cache gets disabled and with that the performance.

Our BBU replacement is ordered and I am optimistic that we will fix the performance issue. I’ll update this blog post as soon as we have the new battery installed.

Update 2015-09-14: BBU has been installed. The RAID performance is fine now.