The guide to enterprise storage: disks

24 Sep, 2013 · Submitted 2013-09-24T08:01:22Z · Read in about 14 min · (2826 Words)
tags: · tech ·

Storage is one of my favorite things. I love everything about it: NAS, SAN, networks, SCSI, SSDs! However, most home users and amateurs never get the chance to do much with storage besides format it with Windows or maybe toss an SSD into their system. There’s a lot to learn with storage, and it’s been around long enough to have plenty of legacy cruft hanging around. Rather than trying to cover everything at once, I want to talk about storage in pieces. I am going to start with disks (spinning, hard disk drives) and work my way up the stack to SANs, and I’ll probably talk about SSDs separately.

The pursuit of 4K

While 4K televisions are the new hotness, 4K has been around in the hard drive world for a few years. When you go buy your 2TB hard drive from Western Digital on NewEgg, it’s probably partitioned into a ton of 4KB pieces called sectors. Drives with 4K sectors are said to be Advanced Format drives, while the old drives are just Old Drives. Prior to 2010 or so, drives were cut up into 512-byte sectors. If you do your math, you’ll see that 8x 512-byte sectors add up to 1x 4-kbyte sector, so there are 8x fewer sectors on a newer drive. This allows for less overhead to be wasted, and since the size of the average write size has grown, there isn’t much space savings.

When a hard drive goes to read some data, it reads it sector by sector. If you have 12KB of data to read, your hard drive has to go find the three different 4KB sectors that the data is stored in. The sector is the smallest unit of space that your hard drive can read or write at a time, and how fast it can find and read a sector is critical to determining the drive’s performance.

Alignment woes

One of the fading issues from the storage world is alignment, or rather, misalignment. The problem stems from older operating systems (Windows XP and Server 2003) only being programmed for 512-byte sector sizes. When Windows XP is formatting a drive, it blocks out the first 63 sectors in a row for the partition offset. Unfortunately, 63 is not divisible by 8 (31.5 KB instead of 32 KB), so when Windows XP or Server 2003 starts laying out the partition table, everything is off by 512 bytes.

Now, when Windows tries to read a single NTFS block, because the data is spread across multiple disk sectors, the hard drive has to read both sectors, doing twice the work. If you’re doing a large read or write operation, the penalty is N+1 (one extra read or write), while if you’re doing a lot of small operations, the penalty can grow to 2X (double the number of reads/writes required). Usually the penalty falls somewhere in the middle. I have typically observed a 30-40% drop in performance on a misaligned system, and as high as the worst-case (2N) with some workloads.

Misalignment is mostly a thing of the past. Nobody is installing Windows XP or 2003 from scratch, and anything after (and including) Vista or Server 2008 has migrated to a 1MB partition offset (which is easily divided into 4K sectors). If you’re not dealing with legacy operating systems, you’re probably fine. But, it’s still important to know for companies that are migrating from 2003, especially with VMs. To fix your VM, there are tools that can modify the VMDK or VHD file to stuff in some extra zeroes and ones to the front of the file to get everything back into alignment.

The quintessential input/output operation

You’ve probably heard the term “IOPS” at least once. It means input/output operations per second, but what’s an input/output operation anyway? The typical I/O request is a 4K read or write. When you’re navigating your OS menus, browsing the internet, or working on a file, your computer is doing a ton of 4K read/write operations. Checking permissions, writing cookies and registry values, remembering where you left the position of a particular window. Because 4K is the smallest operation possible, it tends to happen a lot.

How fast can your hard drive perform a single 4K operation? That’s a good question. Hard drive performance is tied very closely to the seek time of that particular drive. The seek time describes how fast your hard drive can find a random spot on the hard drive (any arbitrary 4K block). Let’s say you have a reasonable seek time (plus read latency) of 18 milliseconds (ms). That’s typical for a Western Digital Black: consumer-grade, but designed for enthusiasts. How many times can your hard drive seek in 1 full second? Well, 1000ms/(18ms/seek) = 55.5 seeks/second. So, if you tell your hard drive to go find 55 random 4K blocks, it will take about a second to pull that off. If you do the math, 55 blocks at 4KB per block is only 220 KB per second, which is just slightly faster than a T1 connection. That’s pretty slow when it comes to hard drive speeds! How is this possible?

Now, that’s assuming those blocks are distributed randomly. If all of the blocks are lined up in a row (by “in a row”, that means physically in a row on the actual platter), the hard drive can read those much faster. Much, much faster. Your average WD Black drive can put out at least 100 MB/s, or even higher at full blast. That’s equivalent to 25,000 of those 4K blocks per second, or almost 500 times faster than our last example. Because the seek time is reduced to near-zero, you can get great performance out of your drive. If you’re just telling the hard drive “go read these 1,000 blocks in a row”, that’s what’s called a sequential operation. If you tell a hard drive “go read these 1,000 blocks in random spots on the drive”, that’s a random operation.

Your hard drive will have very different performance characteristics when it comes down to sequential vs random transfers. That’s why it’s so important to know your workload’s transfer characteristics when sizing your storage. You’re never going to see 100% sequential or 100% random operations, so figuring out your mix is crucial. Applications like SQL are notorious for their focus on small block size random IO, while other applications like Exchange (2010 and above) are designed specifically for sequential ops.

The operating system can request data in larger than 4K sizes. For example, it could ask for a 4MB chunk, which is equivalent to a thousand 4K blocks. Your hard drive doesn’t need to seek to each block individually, it can grab the whole 4MB in one sweep (if there’s no fragmentation). I/O requests can range from 4K (smallest) to 8MB (largest). If you want to max your IOPS, you go 4K. If you want to max your bandwidth, you go 8MB. 32K is often seen as the “sweet spot” since most files (Word documents, pictures, etc) are somewhere around that size.

When someone is talking about IOPS, they’re almost always talking 4K performance, but it could be either random or sequential performance. 4K random is the worst case scenario, so you’ll see relatively small numbers, but they’re accurate as the minimum performance you’ll see from the drive. I like dealing with 4K random IOPS numbers, because ensures everyone is on the same page and that I’m looking at the worst possible performance. You want to size things for the worst case, not the best cast.

How many IOPS?

Obviously, every hard drive is different, at the tech is advancing quickly. But here’s my cheat sheet for IOPS calculations:

SATA drive / 7200 RPM (nearly all consumer drives) — 50-75 IOPS (13-20ms)
SAS drive / high-end SATA (Cheetah) / SFF drives 10000 RPM — 100-125 IOPS (8-12ms)
Enterprise SAS drives / 15000 RPM — 150-200 IOPS (5-6ms)

As a note, look at a SAN full of SATA drives. When your minimum disk latency is 13-20ms, and you tack on some network latency, you will never get back to the requesting system in under 20-25ms. That’s considered pretty poor performance. Remember that latency plays a part!

5ms = hearing the crunch of something you stepped on, speed of GOOD local disk
10ms = very quick response for a SAN, not much slower than local disk
15ms = acceptable
20ms = borderline, DBs (SQL) will start underperforming
25ms = unacceptable, VMware will get mad
30ms+ = users will start to notice their file shares are slow, etc

Humans can easily notice latency differences of 5ms and lower, so the difference between 10ms and 15ms is recognizable, and the difference between 5ms and 10ms is monumental.

How many IOPS does something use? Your average desktop user (VDI or physical) is somewhere between 10-20 IOPS. I have run setups of ~100 VM servers on about 2,000 IOPS during the day (20 IOPS/server). Obviously this depends on the workload: a single server can push 2,000 IOPS if it is busy enough (but not commonly).

$$$ MONEY $$$

If 15k RPM SAS drives are so great, why don’t we use them for everything? Because they’re really expensive.

Based on some LIST NUMBERS that are at least a year old now (you should be able to get 60%+ off list EVERY TIME), let me break down some stats for you. The best IOPS/$ (performance per dollar) is a SAS disk, at around 70 IOPS/$1,000 being a common stat. You should expect around 3,500 IOPS out of a 15k RPM system that is priced at $50k. On the low end, SATA drives only offer about 15 IOPS/$1,000. That same $50k system would only furnish 750 IOPS, which is very low. Note that the difference is somewhere around 5X, that is, you get 5X the performance per dollar with faster disk.

Now, when it comes to GB/$ (storage per dollar), the role is reversed. A system full of 2TB or 3TB drives will give you about 700 GB/$1,000, or 35TB raw for about $50k. A system full of 450 GB SAS drives will yield only 200 GB/$1,000, or only 10TB raw. The slower drives offer a 3X improvement in storage per dollar. Obviously you need to know what you’re buying storage for.

However, there is an often overlooked stat, which is IOPS/GB, which determines how many IOPS you get for every GB of storage. Having 500 TB of storage is pointless if you don’t have the IOPS to back up your workload. You need to know the rough IOPS/GB to make sure you’re buying the right disk speed AND size. When you have 1-4TB SATA drives available, for example, you need to know which one meets your performance needs AND your storage needs. Buying 10 of the 4TB drives will yield one quarter of the performance of 40 of the 1TB drives, but the same amount of storage.

As a quick aside, don’t buy the bullshit about “NL-SAS” drives. They’re either 7200 RPM or they’re not. A 7200 RPM drive, even in a nice NL-SAS wrapper, is still only capable of so many IOPS based on the physical limitations. Definitely don’t pay 10k RPM prices for them!

10k RPM drives tend to be the SFF, small form factor - 2.5” instead of 3.5”, drives. These are usually only required if they fit the IOPS/GB requirement nicely, or if there’s a particular need for the extra density (storage per rack unit) they provide. They also tend to use less power, which could be important in certain environments. You’re usually better off just buying more fast drives for storage or more big drives for performance, though.

AHCI and NCQ and…

While I could talk about desktop storage at length, let’s keep this focused on enterprise storage. It’s par for the course to assume every drive you use will be communicated with via the SCSI protocol with all of the basics like NCQ baked in. These are optimizations that can make the drive performance much faster. NCQ (usually called TCQ with SCSI) allows the drive to queue up a few I/O requests to find the best path across the disks to get all of the blocks, so that it takes the most efficient route instead of zooming about randomly due to a FIFO input queue. Let’s assume that everyone is sourcing their disks from the same places (they are) and have the same tech baked in (they do) and that per-disk performance across vendors is roughly equal (it is). Now, there are a lot of things you can do with those disks that are very cool, particularly in regards to free space allocation and fragmentation, but that’s a file system discussion.

SAS vs SATA, and controllers

What’s the difference between a SATA drive and a SAS drive? What are SATA and SAS anyway? Where does SCSI play in?

SCSI is a protocol for a computer to talk to a hard drive. SCSI is old, but also very good. You want your computer (or SAN) to be able to talk to your disks via SCSI, to get the best performance and most options. SATA is another protocol that does the same thing, but not as well. In general, SCSI>SATA. SAS is the newest form of SCSI. It stands for serial attached SCSI, which can use serial-style physical connectors (SATA connectors, NOT protocol) to connect to SCSI/SAS drives.

You can plug a SATA drive into a SAS card, but you can’t plug a SAS drive into a SATA card. Only the very lowest quality storage solutions will still be using SATA controllers or true SATA drives. Every solution worth buying will be using SAS, and when we talk about “SATA drives”, those are usually slower 7200 RPM drives that have a SAS controller tacked on (NL-SAS). So, they’re capable of talking SAS (thanks to the tacked-on controller) but still have the slower physical characteristics of a SATA drive. Everything is talking SAS/SCSI, typically.

When we’re talking about a “SAS drive”, what that means is that the drive itself is capable of talking the SAS/SCSI protocol. The big difference advantage of SCSI as a protocol is that the endpoint (drive) has intelligence built-in and can help offload and optimize some of the storage requests. That’s why enterprise storage drives are more expensive: they’re capable of greater physical speeds (twice the RPM), they have built-in controllers and software (SCSI), and they need to be rated for extremely high durability (MTBF).

Up until a couple of years ago, the SAS and SATA protocols were both capable of moving up to 3Gbps of data per port. Often, SATA controllers would share that 3Gbps across multiple drives, while SAS controllers could typically provide full bandwidth to each hard drive. Newer versions of both protocols (SAS 3 and SATA 3) can increase that limit to 6Gbps. While only SSDs are capable of even coming close to capping out either of those limits (spinning hard drives still can’t cap 3Gbps), these speeds do play an important part in your storage configuration.

It is common for a shelf of storage (say 24 disks) to be attached to your SAN via a SAS cable. Those 24 disks are now all sharing that same 3Gbps/6Gbps connection to the controller, which means you could be limiting your throughput. What’s the point in having 10Gbps or more in network bandwidth to your SAN when your controller can’t even access the disk that fast? The solution is to use multiple SAS ports to aggregate your bandwidth, and to use SSDs and other controller caching options to allow for additional data to be served without hitting the disk.

NetApp, for example, sells the DS4243 shelf and the DS4246 shelf. The shelves are identical in almost every way (both hold 24x 3.5” drives), but the DS4243 is only connected at 3Gbps while the DS4246 is connected at 6Gbps. The first digit (4) refers to the number of rack units (U) the shelf consumes, the second and third digits (24) refer to the number of hard drives it can hold, and the final digit is the speed at which it is connected. So the DS2246 is a 2U shelf that holds 24 disks (2.5” SFF) and is connected at 6Gbps.

There also used to be FC drives, which were attached to your controller via fiber channel (but still talking SCSI the whole time). These are pretty much deprecated now. It was still SCSI commands back and forth, but using FC as a transport. FC could reach speeds of 1Gbps, 2Gbps, or 4Gbps, but had some other limitations. SAS has all but replaced FC-connected disk in the enterprise. FC for targets, however, is still alive and well.

Questions?

Let me know if you have any questions or comments on disks, or if there’s anything you think I should add. The next article will probably be on file systems/caching/writing, with protocols to follow.