Troubleshooting cache data corruption
© Tony Lawrence, aplawrence.com
This goes back a few years, but it's a condition that is always possible. The symptoms were occasional data corruption but only in frequently used files. All the usual suspects were hauled out and examined; network problems, application bugs, user error, hard disk sectors: everything passed muster.
I was called in, and after listening to the sad tale, I wrote a little shell script very similar to this:
#!/bin/bash cd /yourdatadir # change this to wherever your files are echo "Flushing cache" ls -lR / echo "Testing.." sum * > /tmp/firstread while : do sum * > /tmp/a sleep 300 sum * > /tmp/b diff /tmp/a /tmp/b || break done echo "Corruption detected!" echo "a vs. firstread" diff /tmp/a /tmp/firstread echo "b vs. firstread" diff /tmp/b /tmp/firstread echo "Flushing cache" ls -lR / sum * > /tmp/newread echo "firstread vs. newread:" diff /tmp/firstread /tmp/newread
The idea here is that data read from the disk should always have the same sum (assuming a quiescent system). The data files were small enough that all data read would be cached, and the only thing that each "sum" would do after the first is read from cache. Therefor, if there was any change in the sums, cache would be the problem. Indeed, after twenty minutes or so, the script exited, announcing a difference.
As there was no difference betwwen "firstread" and "newread", nothing had changed on the disk itself (unless it coincidentally switched back; rather unlikely): cache definitely was looking very guilty. But which cache? Was it the system buffer cache or the raid controller? To determine that, I disabled the disk cache (fortunately easy to do with that controller). The test was repeated, and no errors were observed after an hour. I then re-enabled the disk cache, and was able to repeat the sum errors within a few minutes. That seemed to be pretty definite proof of where the problem was, so the hardware was replaced the following week and, as expected, the corruption problem disappeared.
(OLDER) <- More Stuff -> (NEWER) (NEWEST)
Printer Friendly Version