[Video] Building my lab: Progress so far - 10GBe Performance issues and disk failures

Exploring disk failures and poor network performance.

Uh oh! A disk failed. It was always going to be a bit of experiment about whether the WD Blue SSD's were going to be up to the task of running VM's, but I didn't expect that I would be seeing problems so soon. I woke up yesterday to find FreeNAS was a touch unhappy and that one of the arrays was degraded.

It looks as though the disk has totally failed:

Dashboard view, showing that the pool is degraded
FreeNAS shows the disk as unavailable

FreeNAS thinks that the disk affected is unavailable, this is a behaviour I kind of expected because the HP Smart Array utility will prevent the logical drive from being presented to the OS if there is any kind of issue. I have also done a reboot on the box to see if it was something weird like the drive needing to be pulled and put back in, but that doesn't seem to be the case.

ILO is unhappy with the failed drive

Even ILO seems to suggest that the disk has totally failed. There is no way that this disk has managed to get anywhere near the TBW (TerraBytes Written) for this disk. I originally selected it because of a reasonable balance between cost (factoring in the possibility of failures) and performance. This could be a mistake, we will see if any more failures happen on the array. This failed disk should be covered under warranty, so it'll be interesting to see what WD have to say, I'm expecting to be able to RMA it and have a replacement sent out.

In other news I also ran into a bunch of issues with throughput on the FreeNAS box with iperf. I was able to get full (near as damn) line speed when receiving on the FreeNAS box, and a petty 480mbps upstream when transmitting. At first I thought my problems were being caused by buffer settings in FreeNAS itself, but after tweaking and tuning them nothing seemed to be helping. In a last ditch effort to try and sort it before I threw my toys out of the pram and made a switch to CentOS 7 and building the iSCSI targets by hand I swapped out the Chelsio T420-CR that I had with a new one that I had spare from the original build of this environment.

As it turned out the new card that I put in had an older version of firmware installed - 1.19 vs 1.24 and I later discovered that FreeBSD ships the Chelsio drivers as part of the kernel and part of what the driver does is to load an appropriate version of the firmware onto the card - only this time it didn't. I had broken my own cardinal rule and updated the firmware on the cards - based on experiences I had a while ago with Mellanox Connect-X3 40Gbe cards that I'm running in production. It turns out that updating the firmware on the FreeNAS box was a bad idea and caused the performance issues I saw.

Annoying problem to have, but in the end I managed to get the hardware offloading working correctly too (it was just a case of setting TOE on the interface).

This seems to have had a positive affect on CPU - nothing scientific, though the network traffic no longer gets captured in the network stats via netdata, so I will need to figure out another way of measuring those. If you want to read a bit more about configuring hardware offload for the Chelsio cards, there is a forum thread at FreeNAS. I haven't had much luck yet with iSCSI offloading, but that doesn't seem to be a limiting factor at all in this set up.

Some time later: So I've been running this configuration now for a couple of months and I have to say that it is running quite nicely! I've even been able to go so far as setting up iscsi multipathing to take advantage of having multiple 10Gbe links. I've also managed to do a FreeNAS update in this time, and I was quite surprised I didn't need to change anything to continue to take advantage of the TOE offloading!

Over the last month or so I have added 3 shelves of disks to the FreeNAS box, coming in at around 94 disks now and I'm moving my workload away from the SSD array on to 10k spinning rust until I can do a bit more testing. Without any statistics and purely going by gut feeling, the spinning rust array seems to be more performant with my VM's - but I would expect that since I've gone from 20 sata SSD's to 70 odd 10k disks. I got the shelves from ebay for a pretty decent price (£300 iirc) which included 24 600GB 10k disks which is plenty for me.

Anyway - I hope this was helpful!