Mass File Content Reader
Recently, I was analyzing a slow backup job problem. There’s some basic things to take a look at when a backup job is slow: such as verifying compression works through to the tape drive (HP StorageWorks Library and Tape Tools tests this), and trying to identify a hardware bottleneck.
Tape writes can be split into two sections of analysis:
[...............section A...........][.........section B............] HDD>System>Backup process>network>System>adapter>cable>tape drive>tape
Including “HDD,” “System” can be broken down into the big four: RAM, CPU, Network, and HDD. None of the first three items appeared to be introducing a bottleneck, but trying to get a good handle on HDD was proving difficult. I had no baseline for HDD read and write operations from any benchmark nor any Performance Monitor logs. What I did have was an identical server with the identical file set (thanks to DFS-R), which wasn’t causing any slowness in the backup job where it was included.
Thinking about what can cause abnormal slowness on the NTFS file system, two things primarily came to mind: poorly configured sub-system and fragmented files.
The sub-system on the server consisted of a single RAID-5 span across four 10k RPM SAS 3Gbps bus drives. This span was presented to the OS as a single logical drive, which was then split into two NTFS volumes/partitions; one containing the system and one containing the data files. This is not exactly optimal, since the page file, system, even DFS-R Install and Staging folders, as well as the highly utilized SMB shares are all fighting over read and write access to the same spindles. It was clear that changing this configuration would reduce read and write times of the end user applications and likely the backup software.
Additionally, due to there being a large amount of I/O activity on one of the NTFS volumes, fragmentation was a great concern. After primary analysis, it became clear that the data was quite fragmented across the RAID span. So a defragmentation took place over a weekend and then was scheduled to occur daily while out of production, and over the weekends, both when the backup job itself was not configured/predicted to be running.
To explain the difference in speed, possibly due to fragmentation, I created a program that reads all of the bytes out of a directory and its sub-directories.
Instead of tediously working with a graphing API, I decided that I would utilize dumping interval data to a CSV files, for analysis at a later time.
The program is simple:
- Give it a directory
- Do you wish to read all the files in this directory, and recursively, all the files in the files in the directories within this directory?
- What size chunks would you like to read the data?
- How frequently should I flush the average speed of reading (in MB/sec) to the file?
- After how many chunks should I check to see if we’ve hit the given flush-to-file frequency?
The program is a benchmark of the read speed of the actual data that is being read by the backup software, and, as an added bonus, clients, DFS-R and any other programs that have high input and output.
Writing is of lesser concern, as it’s more closely based on the amount of latency introduced by the disk sub-system.
It would be nice to incorporate importing the output of an SQL query against Backup Exec’s DB, so contents of a Selection List can more easily be tested.
Keep in mind are what this program analyzes: It does not analyze NDMP or SMB read speed. It shouldn’t be used for such without the complement of additional benchmarking tools. It should solely be used to read data locally.