BlueArc Storage System
The BlueArc Titan III NAS (network-attached storage) that serves the Hoffman2 cluster is architected in a robust manner to provide uninterrupted service and high throughput. Three high performance NAS servers comprise a cluster, each node having full access to over a third of a petabyte of usable storage on almost 800 physical drives. Each node has dual 10GBps Ethernet intra-cluster connectivity, as well as dual 10GBps Ethernet links to the ATS high speed network. The servers are capable of hot-sparing for a failed node with minimal downtime (measured in seconds) and no data loss. Behind the server nodes are eight discrete pools of disk storage, based upon multiple technologies and drive sizes, from 7200 RPM enterprise SATA disks to the latest 15k RPM SAS disks.
The highest throughput filesystems, such as /u/scratch, are located on the highest performance disks and distributed over many spindles (disks) to increase aggregate performance, especially with large numbers of compute-node clients simultaneously accessing the storage.
Other filesystems such as /u/home comprise in actuality multiple underlying pools of storage to allow for parallelization, higher throughput, and expandability.
A side effect of this is that users may see different performance on home directory /u/home/joe than they see on /u/home/fred, because the directories are located on different pools of storage.
The recent performance degradation that has manifested itself on some of our pools of storage has several causes: the vendor of the storage subsystems has identified issues with the high performance disk controllers used in several of our disk storage subsystems, and we are awaiting new parts. In addition, our NAS vendor introduced a bug in a previous firmware version that, rather than keeping in balance the amount of data read or written to discrete blocks of disk storage, began to write data asymmetrically (data were not written in equal amounts across all of the available disks), which in turn caused severe contention for I/O resources and subsequent congestion which feeds upon itself to severely cripple performance.
We have already upgraded to the latest vendor firmware, but effects of the bug have caused structural issues in the distribution of data that will continue to produce degraded performance.
In order to restore performance to expected, and acceptable, levels, we will be moving data from the impacted storage to new storage, and then back again to re-balance the load. In order to effect this transfer, we will be scheduling two cluster outages: one to switch over to our temporary storage, and one to switch back to our permanent storage.