Troubleshooting: Preserve the scene
You've seen it on television or in the movies: important
evidence is lost at a crime scene because someone wasn't careful
enough to preserve it. A crowd of curiousity seekers destroys
important footprints, a helpful housekeeper thoughtfully washes the
bathroom sink, wiping away fingerprints. The police investigators
show up and are angry at the carelessness of the patrol officer who
should have protected the scene. A crime scene investigator always
wants to preserve and protect evidence.
Obviously, computers provide evidence too. If information on a
hard drive were part of a crime investigation, that hard drive
would immediately be removed and kept safe for analysis. The
forensic investigators would clone it, and work with the copy so
that their work didn't accidentally destroy original evidence. The
procedures for doing this are quite involved, and will include
taking MD5 hashes of the original and the copy so that integrity
can be maintained and insured.
In a way, any computer problem is a crime. Perhaps not in the
legal sense, but it is misbehaviour - maybe by a person using the
computer, or maybe by the computer itself. It may be petty, or
quite serious. It can cost every bit as much or more than a real
crime might. It only makes sense that you should approach any
problem with an eye toward not destroying information you might
need. Most computer problems aren't important enough to justify all
that a professional forensic investigator would do (or if they are
that important you probably don't have the luxury of time to do all
that), but before you jump in to fix things, you should sit back
for just a moment and think about whether there is any evidence
that needs to be preserved, and what you might need to do - or more
usually NOT do - to preserve it.
As an example of this, we'll take the case of a backup that did
not contain a file that should have been on it. The situation was
that Machine A was backed up to tape nightly, and that tape in turn
was restored to Machine B every morning. The purpose of this was to
have a reasonably current system that could be pressed into service
in the event of sudden failure of the main machine. Nowadays, most
companies with a similar need would employ some form of clustering
and redundant storage instead, but this backup-restore scheme is
inexpensive and does meet minimal needs for some situations.
Any redundancy plan is pointless if it isn't tested. It's not
enough to just restore data to the standby machine: you have to
verify that the system is capable of doing the work it may have to
do. Toward that end, this company had a procedure where the person
responsible for restoring the tape would also login to the standby
and run a particular report. The report chosen was one that would
access many files, so if the correct output were produced, it could
be reasonably assumed that data was properly restored and the
system was capable of use in an emergency. This was just a final
check: the original backup would have also been verified while
still on Machine A, and of course the restoration to B would also
report any errors encountered. This report was just another
assurance that all was well.
But one morning things were not well. The backup on the live
machine had completed sucessfully, and had even read back
everything written to tape and verified that each file matched what
was stored on the hard drive. The restoration on the standby
reported no errors reading the tape there. The standby hard drive
didn't complain of any problems writing the data. But the report
didn't produce the right results. More specifically, it seemed to
be getting part of its data from an older file. Indeed, the
specific file was easily identified, and it could be seen the
version on Machine B was older than expected. This was not a file
that gets modified or even used at all every day, but it was still
important data. Fortunately, this problem was discovered by the
integrity test rather than by real need, but the next question
obviously was "How do we fix this?"
I bet you have answers, or at least guesses, already. This
shouldn't be a very difficult problem: either the file is on the
tape and it did not restore, or it was not on the tape at all. In
the first case we look for the problem on the standby box, and in
the second we go look at the backup procedures on the live box.
Simple stuff, this should be done in minutes, right?
Sure. It should be simple, and the overwhelming likelihood is
that it is a quick and easy diagnosis and fix. We can probably
knock this one off before our coffee gets cold. Piece of cake, no
sweat, why do they even bother me with such trivial stuff?
But what if it isn't? What if this somehow turns into one of
those nasty little problems where you keep muttering things like
"but it can't.." and your brain is bouncing off two apparently
contradictory observations? What then? Your coffee is getting cold,
your boss wants to know why this isn't fixed yet, and you have
absolutely no idea. Why do they always give you the hard ones?
So, before "piece of cake" turns into "Huh?", let's think about
what we can do to preserve evidence we might need if this thing
does turn nasty on us.
First thing, the tape. That should be pretty much self
preserving - reading it again doesn't hurt anything. Except.. tapes
can snap, or get jammed up so that you can't do a thing with them.
Not likely, of course, but if something like that did happen, there
goes our evidence. Is there anything else we could do? Sure - we
could look at the logs of the restoration to see if they even
mention this file. But they don't, so now we have to look at the
Not just yet. Yes, we want to check the tape. But we also want
to look at some older tapes and see if that file was on those sets.
If it never has been on any recent tape, that's very different from
it having been there right up until last night. If it is only
missing from last night's tape, something has changed over on the
live machine very recently. That's important to know, because it
could save us time and trouble - we'd know we were looking for a
recent change. We do know that at some point this file did get
backed up and did get restored here, because we do have it - older
than it should be, but it is there.
Are you thinking we should go look at the live machine now? We
know that file isn't used every day, so maybe we should go run that
report over there and see what it says? NO! Do not do that. Don't
do it, don't even think it. We'll get over there in a minute. You
are absolutely right, and this whole thing might be a false alarm:
the file over there might be older than what we expect it to be.
That might still be a problem that needs solving, but it would
absolve the backup and restore of responsibility. But it's not time
to do that yet, tempting as it is. We started out with the theory
that the file is either on the tape or it isn't, and we need to
follow that through. If it is on last night's tape, and has an old
date, then we'll move on to wherever that leads us. But first we'll
list a couple of tapes. I did just that.
Good thing this isn't the old days where you had to list through
everything to get what you wanted, right? Nowadays tape backup
systems have indexes, and can fast forward to the spot we want. My
coffee was still warm and I had the answer: the file was not on the
tape. Not on last night's, the night before, or even last month's
archive set. Never backed it up. Case solved, the problem is over
at the live backup, go fix it.
There you go, getting cocky again. Yes, the problem is over at
the live backup. Maybe this is going to be simple and easy. But
maybe it isn't. So let's just keep on with the slow, methodical
path. Boring? Yes. Which is better: a little pedantic investigation
now or an exciting disaster later? Besides, we can't really say it
was NEVER backed up - it had to have been backed up and restored
fairly recently. All I really knew at this moment was that it
wasn't on the three tapes I chose to look at. It's the familiar old
story: if you don't see something right now, that doesn't prove it
wasn't there when you weren't looking. It's a human trait to
extrapolate from the data we have, and that often works very well.
But we do need to remind ourselves that extrapolation isn't
The first thing to check at the backup machine is the logs. We
know the file wasn't backed up, and the logs may tell us why.
Unfortunately, they didn't. There was nothing in there that
referenced that important little file. I checked the configuration
next, but that said that the filesystem this file lives in is
supposed to be backed up, so was unhelpful. Exclusions? Yes,
certain files are excluded from the backup - /tmp for instance, and
other temporary files. But definitely not this file or anything
like it. What next? Well, Unix keeps track of the time a file was
last accessed (so do Windows systems - see the "Things to Know" at
the end of this article). The file doesn't have to be changed, just
looking at the contents of a file will change its last access date.
That could be very helpful to us here: if no one else has used that
file, the backup certainly should have. So its access date
shouldn't be older than the time of the backup, no matter what.
With luck, we might learn something from checking that date.
It's very easy to lose this kind of evidence. You should be
thinking about whether you might need access date information
before you do anything - even before you log in. For example,
suppose you have reason to suspect that someone may have
compromised an account and used it inappropriately. If that account
is "root" (the Unix all-powerful, full administrator account),
logging in as root yourself can destroy some of the evidence of the
The Unix command to show access dates is "ls -lut", so we'll
just cd over to the directory the file is in and..
WAIT! One thing you might want to know before you do that. It's
just possible that we might be interested in the access date of
that directory itself. Suppose that whole directory has not been
backed up? That might be something worth knowing. Just cd'ing into
a directory won't change an access date, but doing a directory
listing will - because to get a listing, the directory has to be
opened and read. Watch this:
# ls -lud /tmp
drwxrwxrwt 9 root root 4096 Dec 23 12:01 /tmp
# cd /tmp
# ls -lud /tmp
drwxrwxrwt 9 root root 4096 Dec 23 12:01 /tmp
# ls /tmp
CPAN-1.76 orbit-root orbit-tony ssh-Qcl22102 ssh-zgPU1825 texconf.V1nM1i
# ls -lud /tmp
drwxrwxrwt 9 root root 4096 Dec 24 09:54 /tmp
(I'm using command line tools here. Modern Unix and Linux systems
have graphical environments just as Windows does, but for detective
work, the command line is usually the place to be.)
The "ls" is the Unix equivalent to a DOS "dir". The "-l" asks
for details, the "u" says I want to see that access date we've been
talking about, and the "d" says that I want all this for the
directory itself; I don't want to look inside. As you can see, just
cd'ing into /tmp doesn't change it, but doing the plain "ls" (which
lists the contents) absolutely does. So if we are going to find out
information about the directory our little problem file lives in,
we need to NOT do any listings within it until we get that
information. So, if the directory of interest is /data1, I might
# ls -lud /data1
In this case, the results weren't surprising. We had the
opportunity to do all this before the staff arrived to login and do
work, so the access date was, as expected, within the backup
window. Although not absolute proof, it looks like the backup
program did at least look through this directory as it should have.
So now I needed to ask for the information about the actual
# ls -lu /data1/config.main
This time there was a different, and somewhat confusing result.
The time stamp was AFTER the backup. That was unexpected, because
supposedly this file is seldom used, and also no one else has
arrived at work yet. To be sure I had the whole picture, I needed
to look at the entire directory:
# ls -lut /data1
That added "t" says to sort the listing by the time I asked for
with "u". That makes it easy to pick out any unusual times, and
indeed, this file stood out separate from all the others. The rest
all showed access dates from the time of the backup; this one file
was the only one that was different. At least some of the other
files definitely would be accessed if any user had logged in and
started work without me noticing, so why did this file have a date
later than the backup? The coffee is definitely getting cold, isn't
it? This little puzzle isn't at all as easy as we expected it to
It's now time to refresh that coffee (I'll take tea, thanks) and
review what we know:
- The file definitely was not backed up, at least some of
- Unless there is some strange bug in the backup program, there's
no reason it should have been skipped.
- The file has an unexpected access date.
When trying to solve problems, it's always a good idea to do
this kind of review periodically throughout the process. We humans
too easily forget things that we shouldn't. It's not that we've
necessarily forgotten fact X entirely, but we forget to take X into
account when we learn Z. Listing known facts, even if only in our
minds, can help us remember to examine these possible
relationships. For really complex problems, it can help to write
each "fact" on a separate piece of paper, and deliberately mix up
the order as we run through them yet another time. But back to the
Is it possible that the backup program just couldn't read the
file? You'd think something like that would get reported in the
logs, but just for the heck of it, let's check that we can copy
# cp /data1/config.main /tmp
# diff /tmp/config.main /data1/config.main
Unix grew up in the days of teletype machines that had ugly
mechanical keys that took a lot of work to push down. That's why
Unix has "cp" instead of "copy" - if you used an old teletype
machine just once, you'd understand. The "diff" command checks two
text files and reports any differences. I did that just to be sure
that the file really was copied. It was. No errors. No reason the
backup couldn't read the file just as easily. Or could it?
Supposedly the backup runs with the same privileges as root (you
could guess that I'm logged in as root from the "#" signs that are
shown in the examples - that's not an absolute indication, but it
would be a fair guess on your part) so it should have no trouble
with any file - and if that weren't true there would have been
plenty of other things not backed up. Also, again, we'd expect
complaints in the logs.
But darn it, it sure looked like the backup had some issue with
this file. Painful as the thought was, that theory had to be
tested. Most backup programs have some way to do quick individual
backups, so I fired this up with a fresh tape and asked it to
backup /data1. There was nothing very large in there, so it didn't
take long. Checking both the tape and the logs showed that it had
included the config.main file in what it wrote to tape and was
perfectly happy to do so. No issues, no problems.
When you are in a similar fix, remember that this is where it's
OK to start talking to yourself. First, it makes you feel better
and helps convince you that this all isn't just a bad dream.
Secondly, it lets the people around you know that you can do more
than just sit staring dumbly at the screen. If you are talking, you
must be thinking. By the way (I speak from experience here), this
is probably not the best time to start humming that "If I only had
a brain" tune from the Wizard of OZ. Muttering is fine. Keep it low
so folks won't know you aren't actually making any sense.
Seriously now, have we really tested the theory? The nightly
backup is scheduled by a "cron" job, which is how Unix tends to
repetitive tasks like this. While running under cron usually has
the same powers and privilege as being logged in, it is also
slightly different: the environment - the values of variables set
in the working shell - is not exactly the same. This can sometimes
cause problems for scripts run under cron, and thoroughly confuses
administrators who aren't aware of that difference. To be certain,
I'd need to run under the same conditions. That's fairly easy to
do: just change the schedule to run the regular backup now. To make
it less painful and faster, I temporarily changed the output for
the job to /dev/null rather than the tape. That /dev/null is a
special Unix file that just tosses out whatever you write to it -
much quicker than writing to tape, and the backup logs would show
me whether or not config.main was skipped.
It was not. Time for another review:
- The file definitely was not put on tape during at least some
- There is no apparent reason it should have been skipped.
- The file has an unexpected access date.
We have to qualify the "no reason it should have been skipped"
because our quick test with /dev/null isn't precisely the same as
the actual backup. It's possible that timing problems with the tape
had somehow caused this problem, but I couldn't think of any
circumstance that would cause that. The /dev/null test is
reasonable. We could always come back and revisit this if
necessary, but it looks like the backup should have archived the
That strange access date bothered me. Fortunately, it was still
early, so no one else had logged in yet. The "last" command (which
shows a log of who has logged in and out and what time they did so)
confirmed that no one but me had used this machine since the
previous afternoon. Yet the access time on config.main had shown as
being after the backup. I had been careful not to touch it before
getting that information, so what had changed that time stamp?
Lot's of possibilities there. Just because no one is logged in
doesn't mean programs aren't running. Every operating system runs
daemons that take care of various things, and there's "at" (similar
to cron but for "one-offs") and, of course, "cron". Maybe a cron
job modifies this file? That would certainly affect the access
time. I went looking.
Unix systems can have lots of different cron jobs. Each user
might have a cron job, even if they aren't a login user. Cron jobs
clean up temporary files, rotate logs, and do all sorts of useful
things. Linux systems have separate cron directories for things
that are supposed to run hourly, daily, weekly and monthly. There
can be a lot of different places to check, and you need to check
"at" jobs too because although these just run once at the time
specified, nothing stops them from rescheduling themselves. But
this turned out out to be easy to find: a little script triggered
to fire off at 7:00 AM every morning. It was both a simple script,
and yet odd. All it did was copy a "config.safe" file to
"config.main". Why would that be necessary and why would that cause
a problem for the backup?
Well, logically, it wouldn't. The backup doesn't care about the
copy - even if it checked and saw that the file was being used by
that "cp" at the very instant it wanted to read the file, this
backup software is quite smart enough to skip the file momentarily
and come back for another try later. And it would log that
information anyway. So the cron job copy shouldn't hurt the backup
in any way.
What if the file just wasn't there?
That would explain a lot, wouldn't it? If a file isn't there,
the backup program sure as heck can't be responsible for not
backing it up. Maybe that's what that silly little copy program is
for - does something sometimes accidentally destroy this file and
that program's purpose is to recreate it every day? Yes, as it
turned out, that was exactly the case. Some bug in the application
could remove this file during its nightly reindexing which was
scheduled after work, but before the backup. It didn't happen all
the time, but often enough to be very annoying. The application
administrator wrote this little script to fix the problem early in
the morning before the users showed up to start doing work.
Apparently he made sure that he duplicated any changes of
config.main to config.safe before the reindexing started, and this
system thereby prevented the users from ever seeing any problem.
But if the file was deleted, it wouldn't get backed up. That was
exactly the problem.
If I had not preserved the access time information, it might
have taken me much longer to realize what the problem was.
Actually, armed with that information, I should have realized it
sooner; that odd access time was a big red flag that should have
said "Look at me! Look at me!". That's often the biggest failing we
all have when doing this kind of work: we pay more attention to our
habitual assumptions than the things waving at us frantically from
Things to know
Unix systems keep two other dates: modify, and inode change. The
modified date is what you see when you use "ls -l" and is the date
that the file has been changed (or created, if nobody has changed
it since). The inode change date (ls -lc) reflects the time that
permissions or ownership have changed (but note that anything that
affects the modified time also affects this).
Some systems have a command line "stat" command that shows all
three times (and more useful info) at once. Here's the output from
"stat" on a Red Hat Linux system:
$ stat .bashrc
Size: 124 Filetype: Regular File
Mode: (0644/-rw-r--r--) Uid: ( 500/ tony) Gid: ( 500/ tony)
Device: 3,6 Inode: 19999 Links: 1
Access: Fri Aug 18 07:07:17 2000(00000.00:00:13)
Modify: Mon Oct 25 03:51:14 1999(00298.03:16:16)
Change: Mon Oct 25 03:51:14 1999(00298.03:16:16)
Note that "Modify" and "Change" are the same. That's because
"Change" will also reflect the time that a file was modified.
However, if we used chmod or chown on this file, the "Change" date
would reflect the time of that and "Modify" would not.
Windows (modern Windows, anyway) keeps the same three dates.
Unfortunately, Windows isn't always been careful about preserving
the access date from changing accidentally. You can see the access
date by selecting it in the "Choose Details" from the View menu
dialog of Explorer, or you can conveniently see it just by right
clicking on the file and choosing "Properties". Make sure you write
it down, though, because choosing Properties changes the access
time to right now! You might not notice that without doing a manual
"Refresh" of the View, but it will change - unless Microsoft has
fixed that since I wrote this.
If this page was useful to you, please help others find it:
More Articles by Tony Lawrence
- Find me on Google+
Have you tried Searching this site?
Unix/Linux/Mac OS X support by phone, email or on-site:
This is a Unix/Linux resource website. It contains technical articles about Unix, Linux and general computing related subjects, opinion, news, help files, how-to's, tutorials and more. We appreciate comments and article submissions.
Publishing your articles here
Jump to Comments
Many of the products and books I review are things I purchased for my own use. Some were given to me specifically for the purpose of reviewing them. I resell or can earn commissions from the sale of some of these items. Links within these pages may be affiliate links that pay me for referring you to them. That's mostly insignificant amounts of money; whenever it is not I have made my relationship plain. I also may own stock in companies mentioned here. If you have any question, please do feel free to contact me.
I am a Kerio reseller. Articles here related to Kerio products reflect my honest opinion, but I do have an obvious interest in selling those products also.
Specific links that take you to pages that allow you to purchase the item I reviewed are very likely to pay me a commission. Many of the books I review were given to me by the publishers specifically for the purpose of writing a review. These gifts and referral fees do not affect my opinions; I often give bad reviews anyway.
We use Google third-party advertising companies to serve ads when you visit our website. These companies may use information (not including your name, address, email address, or telephone number) about your visits to this and other websites in order to provide advertisements about goods and services of interest to you. If you would like more information about this practice and to know your choices about not having this information used by these companies, click here.