When you write data, it doesn't necessarily get written to disk
right then. The kernel maintains caches of many things, and disk
data is something where a lot of work is done to keep everything
fast and efficient. That's great for performance, but sometimes
you want to know that data really has gotten to the disk drive. This
could be because you want to test the performance of the drive, but
could also be when you suspect a drive is malfunctioning: if you
just write and read back, you'll be reading from cache, not from
actual disk platters.
So how can you be sure you are reading data from the disk? The
answer actually gets a little complicated, particularly if you are testing
for integrity, so bear with me.
Obviously the first thing you need to do is get the data in the cache
sent on its way to the disk. That's "sync", which tells the kernel
that you want the data written. But that doesn't mean that a subsequent
read comes from disk: if the requested data is still in cache, that's
where it will be fetched from. It also doesn't necessarily mean that
the kernel actually has sent the data along to the disk controller: a
"sync" is a request, not a command that says "stop everything else you
are doing and write your whole buffer cache to disk right now!". No,
"sync" just means that the cache will be written, as and when the kernel
has time to do so.
Traditonally, the only way to be sure you were not reading back from
the cache was to overwrite the cache with other data. That required
two things: knowing how big the cache is at this moment, and having
unrelated data of sufficient size to overwrite with. On older
Unixes with fixed sized buffer caches, the first part was easy enough,
and since memory was often expensive and in shorter supply than it is
now, the cache wasn't apt to be all that large anyway. That's changed
radically: modern systems allocate cache memory dynamically and
while the total cache is still small compared to disk drives, it
can now be gigabytes of data that you need to overwrite.
Well, that's not always so hard: for a large filesystem and
relatively small memory, a simple "ls -lR" might be enough. If
not, a "dd" redirected to /dev/null can fill it up. Just make
sure that you are looking at different disk blocks than what you
first wrote. Note that you really didn't even need the "sync"
if this is what you are doing: the overwrite forces the sync itself.
Modern Linux kernels make this a bit easier: in /proc/sys/vm/
you'll find "drop_caches". You simply echo a number to that to
To free pagecache:
- echo 1 > /proc/sys/vm/drop_caches
To free dentries and inodes:
- echo 2 > /proc/sys/vm/drop_caches
To free pagecache, dentries and inodes:
- echo 3 > /proc/sys/vm/drop_caches
You absolutely need to call "sync" before doing that. I haven't
looked at how this is implemented; I assume that the pending syncs
would be done before the cache is actually thrown away, and
that in the meantime the cache is now seen as invalid so subsequent
reads would have to wait for the sync write before returning. It
would be simple enough to test this.
Actually, maybe not. I tried testing this on a Suse instance
in a virtual machine, and couldn't do it. The script I used looked like this:
date > /tmp/t
echo 3 > /proc/sys/vm/drop_caches
# this sets ctrl-alt-del not to call sync
echo 1 > /proc/sys/kernel/ctrl-alt-del
echo "ctrl-alt del now"
What I expected was for /tmp/t not to have the latest date. However,
it always did, probably because the Reiserfs would fix up partial
transactions. You'd need a system without a journaled file system to
But even that didn't seem to work: I created an ext2 fs on another
virtual hard drive and tried this:
date > /hdc/t
echo 3 > /proc/sys/vm/drop_caches
dd if=/dev/hda3 of=/dev/null
mount /dev/hdc1 /hdc
But that didn't behave as I thought it would either. Possibly VM
caching is throwing this off? Nope: I tried the same thing on
a real system; the file doesn't lose its updates. So I'm
not sure you can trust drop_caches.
However, if testing for integrity, and perhaps even if doing serious
performance testing, this isn't enough: disk drives almost always
do their own caching. If we really need to be certain that our
reads came directly from the platters and not from ram on the
controller, we still need to go back to the idea of knowing how
big that cache is and writing enough data to force it to be flushed.
So, we are still going to do "dd"'s or "ls -lR"'s or something
If you are examining integrity and suspect corruption, keep
in mind that aging can affect your results: you might need
data to sit in cache (kernel or disk hardware) for some period
before the problem occurs. Quick overwrites might mask it.
Tracking down this kind of problem can be very difficult.
See also Caches
and cache data corruption
By the way, if your aim is simply to bypass cache buffering, you
can do that: Raw Disk I/O is what you want. And (as some databases do) you
could simply write data to a raw partition (no filesystem).
More Articles by Anthony Lawrence
Find me on Google+
© 2012-08-01 Anthony Lawrence