Over the years, end users have expressed the general concern that data management isn’t worth their time. In all fairness, this misconception is understandable due to the inexpensive and continually decreasing cost of consumer-grade disk drives. Ultimately, firms should work towards changing that mindset because the true cost is far greater in an enterprise’s production environment. To identify the associated costs, I have prepared the following analysis of raw storage consumption, performance impacts, and resources needed to store data in a locally-hosted Microsoft Windows server environment.
Raw Storage Needs for a 1GB File in a Sample Production Environment
Referencing Figure 1, a 1GB file not only consumes raw storage across multiple storage platforms (e.g. local storage, backup volumes, etc.), but at a higher quantity than in its original form. This is a direct result of high availability and fault tolerance achieved when using a Redundant Array of Independent Disks, better known as RAID.
One requirement for this essential feature with RAID level 5 and 6, as depicted in Figure 1, is the additional raw storage capacity needed to maintain parity across all drives. RAID parity is the mathematical method used to calculate used storage on each drive, which is striped (i.e. spread) across all other drives in the array. This allows a RAID volume to operate continuously and unimpeded if a single drive were to fail (or two drives with RAID 6), and protects from unrecoverable sector read errors. It’s worth noting there are other RAID levels designed to improve fault tolerance and/or performance, such as RAID 1, 10, 50, and 60. Each level has its own distinct advantages, however, no matter which RAID type is chosen, additional raw capacity is needed to support it.
In order to create and manage the RAID structure, a hardware controller or software utility is needed. Many different factors determine whether a software or hardware solution is appropriate for a firm’s RAID needs, but in either case, managing an array of disks requires resources. For a RAID hardware controller, these resources come in the form of a physical controller, with its own dedicated CPU, RAM, and specialized firmware designed to manage the disk array. Similarly, a software-based RAID needs the same resources, but instead places this burden on the server’s CPU and RAM. With either solution, computational resources are needed to operate these systems – the cost of each is directly driven by the size and number of disk drives in the array, which in turn is determined by the quantity of data stored.
The cost of storing a single 1GB file is further compounded by the price of enterprise-class drives that are used in servers and storage arrays. These drives, such as server-grade SATA or NL-SAS drives, are designed for 24/7 operation in a production environment and cost approximately 4 times more per GB than consumer-grade drives. SATA or NL-SAS are generally used when capacity is needed over performance, however, firms who require the highest-level of I/O performance must employ enterprise-class SSD or SCSI drives, which come with a substantially higher price per GB.
One might think, “Well, these costs just apply to firms hosting large files, small files don’t matter”. This is another misconception that many end users have. Although small files impact raw storage in a different manner (later explained), their biggest cost comes in the form of processing individual file records. New Technology File System, or simply NTFS, is the file system used by Windows operating systems on servers and workstations to store data on disk drives. This file system relies on a Master File Table (MFT), which is the heart of the NTFS volume structure. The MFT contains a record (its metadata) for every single file stored on the disk, consuming an additional 1KB of raw storage for each, and defines the file’s: file name, attributes, security descriptor, object ID, etc. Whenever an operation is performed on NTFS, each of those records must be processed. You have likely seen the impact this has on performance when executing various operations, such as copying a 1GB file vs. several smaller files that consume the same quantity of storage.
Small files also increase data fragmentation, which can reduce read/write speed due to additional seek times; albeit, this plays a greater factor with mechanical drives vs. solid state. Nevertheless, storing unnecessary small files can have a notable detriment on performance when carrying out the most basic operations.
Another way small files impact performance and drive cost is how they’re stored within the file system. When a disk drive is formatted, raw storage is broken into chunks called clusters, with each cluster representing a fixed size of bytes. When a file is stored on a disk drive it is allocated to as many clusters as it needs. To maintain optimal disk drive performance, cluster size should be increased as drive capacity increases – fewer clusters to seek, means less seek time. One consequence of larger cluster size is the impact it has when storing small files. For example, many drives today use a 4KB cluster size due to their high capacity – that means a 4KB file needs one cluster to store its data. Consequently, a 1 byte file also needs 4KB to store its data because 4KB is the smallest logical unit available for storage, leaving the remaining part of the cluster unused. This may seem negligible, but when you have a storage volume with millions of files that are smaller than the smallest cluster size, it consumes precious storage space; and don’t forget about the performance impact related to the MFT!
The cost of large and small files on a disk drive is worth noting, but another consequence comes in the form of indirect costs. This includes the labor and computational resource consumption associated with managing, indexing and searching, backing up, and other types of data processing, which puts an unnecessary burden on equipment and personnel.
In conclusion, the cost of storing unnecessary files, small or large, can’t be overstated. Firms who make a concerted effort to appropriately manage their data inevitably reduce their IT expenditures and associated management costs. End users may find it difficult to see the value of efficiently managing the data they create, but when the costs are aggregated across a firm’s entire IT systems, it’s apparent that data management really does matter.
Patterson, David; Gibson, Garth A.; Katz, Randy (1988). A Case for Redundant Arrays of Inexpensive Disks (RAID). SIGMOD Conferences. Retrieved 2018-06-20.
Chen, Peter; Lee, Edward; Gibson, Garth; Katz, Randy; Patterson, David (1994). "RAID: High-Performance, Reliable Secondary Storage". ACM Computing Surveys. 26: 145–185. Retrieved 2018-06-20.
Scott Lowe (2009-11-16). "How to protect yourself from RAID-related Unrecoverable Read Errors (UREs). Techrepublic". Retrieved 2018-06-21.
Intel Corporation (2017-10-02). “Defining RAID Volumes for Intel Rapid Storage Technology”. Retrieved 2018-06-10.
Code Idol (2010-09-17). “NTFS On-Disk Structure”. Retrieved 2018-03-02.
Microsoft Corporation (2009-10-08) “How NTFS Works”. Retrieved 2018-03-02.Return to Article List