Troubleshooting: Preserve the scene
You've seen it on television or in the movies: important evidence is lost at a crime scene because someone wasn't careful enough to preserve it. A crowd of curiousity seekers destroys important footprints, a helpful housekeeper thoughtfully washes the bathroom sink, wiping away fingerprints. The police investigators show up and are angry at the carelessness of the patrol officer who should have protected the scene. A crime scene investigator always wants to preserve and protect evidence.
Obviously, computers provide evidence too. If information on a hard drive were part of a crime investigation, that hard drive would immediately be removed and kept safe for analysis. The forensic investigators would clone it, and work with the copy so that their work didn't accidentally destroy original evidence. The procedures for doing this are quite involved, and will include taking MD5 hashes of the original and the copy so that integrity can be maintained and insured.
In a way, any computer problem is a crime. Perhaps not in the legal sense, but it is misbehaviour - maybe by a person using the computer, or maybe by the computer itself. It may be petty, or quite serious. It can cost every bit as much or more than a real crime might. It only makes sense that you should approach any problem with an eye toward not destroying information you might need. Most computer problems aren't important enough to justify all that a professional forensic investigator would do (or if they are that important you probably don't have the luxury of time to do all that), but before you jump in to fix things, you should sit back for just a moment and think about whether there is any evidence that needs to be preserved, and what you might need to do - or more usually NOT do - to preserve it.
As an example of this, we'll take the case of a backup that did not contain a file that should have been on it. The situation was that Machine A was backed up to tape nightly, and that tape in turn was restored to Machine B every morning. The purpose of this was to have a reasonably current system that could be pressed into service in the event of sudden failure of the main machine. Nowadays, most companies with a similar need would employ some form of clustering and redundant storage instead, but this backup-restore scheme is inexpensive and does meet minimal needs for some situations.
Any redundancy plan is pointless if it isn't tested. It's not enough to just restore data to the standby machine: you have to verify that the system is capable of doing the work it may have to do. Toward that end, this company had a procedure where the person responsible for restoring the tape would also login to the standby and run a particular report. The report chosen was one that would access many files, so if the correct output were produced, it could be reasonably assumed that data was properly restored and the system was capable of use in an emergency. This was just a final check: the original backup would have also been verified while still on Machine A, and of course the restoration to B would also report any errors encountered. This report was just another assurance that all was well.
But one morning things were not well. The backup on the live machine had completed sucessfully, and had even read back everything written to tape and verified that each file matched what was stored on the hard drive. The restoration on the standby reported no errors reading the tape there. The standby hard drive didn't complain of any problems writing the data. But the report didn't produce the right results. More specifically, it seemed to be getting part of its data from an older file. Indeed, the specific file was easily identified, and it could be seen the version on Machine B was older than expected. This was not a file that gets modified or even used at all every day, but it was still important data. Fortunately, this problem was discovered by the integrity test rather than by real need, but the next question obviously was "How do we fix this?"
I bet you have answers, or at least guesses, already. This shouldn't be a very difficult problem: either the file is on the tape and it did not restore, or it was not on the tape at all. In the first case we look for the problem on the standby box, and in the second we go look at the backup procedures on the live box. Simple stuff, this should be done in minutes, right?
Sure. It should be simple, and the overwhelming likelihood is that it is a quick and easy diagnosis and fix. We can probably knock this one off before our coffee gets cold. Piece of cake, no sweat, why do they even bother me with such trivial stuff?
But what if it isn't? What if this somehow turns into one of those nasty little problems where you keep muttering things like "but it can't.." and your brain is bouncing off two apparently contradictory observations? What then? Your coffee is getting cold, your boss wants to know why this isn't fixed yet, and you have absolutely no idea. Why do they always give you the hard ones?
So, before "piece of cake" turns into "Huh?", let's think about what we can do to preserve evidence we might need if this thing does turn nasty on us.
First thing, the tape. That should be pretty much self preserving - reading it again doesn't hurt anything. Except.. tapes can snap, or get jammed up so that you can't do a thing with them. Not likely, of course, but if something like that did happen, there goes our evidence. Is there anything else we could do? Sure - we could look at the logs of the restoration to see if they even mention this file. But they don't, so now we have to look at the tape, right?
Not just yet. Yes, we want to check the tape. But we also want to look at some older tapes and see if that file was on those sets. If it never has been on any recent tape, that's very different from it having been there right up until last night. If it is only missing from last night's tape, something has changed over on the live machine very recently. That's important to know, because it could save us time and trouble - we'd know we were looking for a recent change. We do know that at some point this file did get backed up and did get restored here, because we do have it - older than it should be, but it is there.
Are you thinking we should go look at the live machine now? We know that file isn't used every day, so maybe we should go run that report over there and see what it says? NO! Do not do that. Don't do it, don't even think it. We'll get over there in a minute. You are absolutely right, and this whole thing might be a false alarm: the file over there might be older than what we expect it to be. That might still be a problem that needs solving, but it would absolve the backup and restore of responsibility. But it's not time to do that yet, tempting as it is. We started out with the theory that the file is either on the tape or it isn't, and we need to follow that through. If it is on last night's tape, and has an old date, then we'll move on to wherever that leads us. But first we'll list a couple of tapes. I did just that.
Good thing this isn't the old days where you had to list through everything to get what you wanted, right? Nowadays tape backup systems have indexes, and can fast forward to the spot we want. My coffee was still warm and I had the answer: the file was not on the tape. Not on last night's, the night before, or even last month's archive set. Never backed it up. Case solved, the problem is over at the live backup, go fix it.
There you go, getting cocky again. Yes, the problem is over at the live backup. Maybe this is going to be simple and easy. But maybe it isn't. So let's just keep on with the slow, methodical path. Boring? Yes. Which is better: a little pedantic investigation now or an exciting disaster later? Besides, we can't really say it was NEVER backed up - it had to have been backed up and restored fairly recently. All I really knew at this moment was that it wasn't on the three tapes I chose to look at. It's the familiar old story: if you don't see something right now, that doesn't prove it wasn't there when you weren't looking. It's a human trait to extrapolate from the data we have, and that often works very well. But we do need to remind ourselves that extrapolation isn't fact.
The first thing to check at the backup machine is the logs. We know the file wasn't backed up, and the logs may tell us why. Unfortunately, they didn't. There was nothing in there that referenced that important little file. I checked the configuration next, but that said that the filesystem this file lives in is supposed to be backed up, so was unhelpful. Exclusions? Yes, certain files are excluded from the backup - /tmp for instance, and other temporary files. But definitely not this file or anything like it. What next? Well, Unix keeps track of the time a file was last accessed (so do Windows systems - see the "Things to Know" at the end of this article). The file doesn't have to be changed, just looking at the contents of a file will change its last access date. That could be very helpful to us here: if no one else has used that file, the backup certainly should have. So its access date shouldn't be older than the time of the backup, no matter what. With luck, we might learn something from checking that date.
It's very easy to lose this kind of evidence. You should be thinking about whether you might need access date information before you do anything - even before you log in. For example, suppose you have reason to suspect that someone may have compromised an account and used it inappropriately. If that account is "root" (the Unix all-powerful, full administrator account), logging in as root yourself can destroy some of the evidence of the other login.
The Unix command to show access dates is "ls -lut", so we'll just cd over to the directory the file is in and..
WAIT! One thing you might want to know before you do that. It's just possible that we might be interested in the access date of that directory itself. Suppose that whole directory has not been backed up? That might be something worth knowing. Just cd'ing into a directory won't change an access date, but doing a directory listing will - because to get a listing, the directory has to be opened and read. Watch this:
# ls -lud /tmp drwxrwxrwt 9 root root 4096 Dec 23 12:01 /tmp # cd /tmp # ls -lud /tmp drwxrwxrwt 9 root root 4096 Dec 23 12:01 /tmp # ls /tmp CPAN-1.76 orbit-root orbit-tony ssh-Qcl22102 ssh-zgPU1825 texconf.V1nM1i # ls -lud /tmp drwxrwxrwt 9 root root 4096 Dec 24 09:54 /tmp #(I'm using command line tools here. Modern Unix and Linux systems have graphical environments just as Windows does, but for detective work, the command line is usually the place to be.)
The "ls" is the Unix equivalent to a DOS "dir". The "-l" asks for details, the "u" says I want to see that access date we've been talking about, and the "d" says that I want all this for the directory itself; I don't want to look inside. As you can see, just cd'ing into /tmp doesn't change it, but doing the plain "ls" (which lists the contents) absolutely does. So if we are going to find out information about the directory our little problem file lives in, we need to NOT do any listings within it until we get that information. So, if the directory of interest is /data1, I might do:
# ls -lud /data1
In this case, the results weren't surprising. We had the opportunity to do all this before the staff arrived to login and do work, so the access date was, as expected, within the backup window. Although not absolute proof, it looks like the backup program did at least look through this directory as it should have. So now I needed to ask for the information about the actual file:
# ls -lu /data1/config.main
This time there was a different, and somewhat confusing result. The time stamp was AFTER the backup. That was unexpected, because supposedly this file is seldom used, and also no one else has arrived at work yet. To be sure I had the whole picture, I needed to look at the entire directory:
# ls -lut /data1
That added "t" says to sort the listing by the time I asked for with "u". That makes it easy to pick out any unusual times, and indeed, this file stood out separate from all the others. The rest all showed access dates from the time of the backup; this one file was the only one that was different. At least some of the other files definitely would be accessed if any user had logged in and started work without me noticing, so why did this file have a date later than the backup? The coffee is definitely getting cold, isn't it? This little puzzle isn't at all as easy as we expected it to be.
It's now time to refresh that coffee (I'll take tea, thanks) and review what we know:
- The file definitely was not backed up, at least some of
- Unless there is some strange bug in the backup program, there's no reason it should have been skipped.
- The file has an unexpected access date.
When trying to solve problems, it's always a good idea to do this kind of review periodically throughout the process. We humans too easily forget things that we shouldn't. It's not that we've necessarily forgotten fact X entirely, but we forget to take X into account when we learn Z. Listing known facts, even if only in our minds, can help us remember to examine these possible relationships. For really complex problems, it can help to write each "fact" on a separate piece of paper, and deliberately mix up the order as we run through them yet another time. But back to the current problem:
Is it possible that the backup program just couldn't read the file? You'd think something like that would get reported in the logs, but just for the heck of it, let's check that we can copy it:
# cp /data1/config.main /tmp # diff /tmp/config.main /data1/config.main
Unix grew up in the days of teletype machines that had ugly mechanical keys that took a lot of work to push down. That's why Unix has "cp" instead of "copy" - if you used an old teletype machine just once, you'd understand. The "diff" command checks two text files and reports any differences. I did that just to be sure that the file really was copied. It was. No errors. No reason the backup couldn't read the file just as easily. Or could it? Supposedly the backup runs with the same privileges as root (you could guess that I'm logged in as root from the "#" signs that are shown in the examples - that's not an absolute indication, but it would be a fair guess on your part) so it should have no trouble with any file - and if that weren't true there would have been plenty of other things not backed up. Also, again, we'd expect complaints in the logs.
But darn it, it sure looked like the backup had some issue with this file. Painful as the thought was, that theory had to be tested. Most backup programs have some way to do quick individual backups, so I fired this up with a fresh tape and asked it to backup /data1. There was nothing very large in there, so it didn't take long. Checking both the tape and the logs showed that it had included the config.main file in what it wrote to tape and was perfectly happy to do so. No issues, no problems.
When you are in a similar fix, remember that this is where it's OK to start talking to yourself. First, it makes you feel better and helps convince you that this all isn't just a bad dream. Secondly, it lets the people around you know that you can do more than just sit staring dumbly at the screen. If you are talking, you must be thinking. By the way (I speak from experience here), this is probably not the best time to start humming that "If I only had a brain" tune from the Wizard of OZ. Muttering is fine. Keep it low so folks won't know you aren't actually making any sense.
Seriously now, have we really tested the theory? The nightly backup is scheduled by a "cron" job, which is how Unix tends to repetitive tasks like this. While running under cron usually has the same powers and privilege as being logged in, it is also slightly different: the environment - the values of variables set in the working shell - is not exactly the same. This can sometimes cause problems for scripts run under cron, and thoroughly confuses administrators who aren't aware of that difference. To be certain, I'd need to run under the same conditions. That's fairly easy to do: just change the schedule to run the regular backup now. To make it less painful and faster, I temporarily changed the output for the job to /dev/null rather than the tape. That /dev/null is a special Unix file that just tosses out whatever you write to it - much quicker than writing to tape, and the backup logs would show me whether or not config.main was skipped.
It was not. Time for another review:
- The file definitely was not put on tape during at least some
- There is no apparent reason it should have been skipped.
- The file has an unexpected access date.
We have to qualify the "no reason it should have been skipped" because our quick test with /dev/null isn't precisely the same as the actual backup. It's possible that timing problems with the tape had somehow caused this problem, but I couldn't think of any circumstance that would cause that. The /dev/null test is reasonable. We could always come back and revisit this if necessary, but it looks like the backup should have archived the file.
That strange access date bothered me. Fortunately, it was still early, so no one else had logged in yet. The "last" command (which shows a log of who has logged in and out and what time they did so) confirmed that no one but me had used this machine since the previous afternoon. Yet the access time on config.main had shown as being after the backup. I had been careful not to touch it before getting that information, so what had changed that time stamp?
Lot's of possibilities there. Just because no one is logged in doesn't mean programs aren't running. Every operating system runs daemons that take care of various things, and there's "at" (similar to cron but for "one-offs") and, of course, "cron". Maybe a cron job modifies this file? That would certainly affect the access time. I went looking.
Unix systems can have lots of different cron jobs. Each user might have a cron job, even if they aren't a login user. Cron jobs clean up temporary files, rotate logs, and do all sorts of useful things. Linux systems have separate cron directories for things that are supposed to run hourly, daily, weekly and monthly. There can be a lot of different places to check, and you need to check "at" jobs too because although these just run once at the time specified, nothing stops them from rescheduling themselves. But this turned out out to be easy to find: a little script triggered to fire off at 7:00 AM every morning. It was both a simple script, and yet odd. All it did was copy a "config.safe" file to "config.main". Why would that be necessary and why would that cause a problem for the backup?
Well, logically, it wouldn't. The backup doesn't care about the copy - even if it checked and saw that the file was being used by that "cp" at the very instant it wanted to read the file, this backup software is quite smart enough to skip the file momentarily and come back for another try later. And it would log that information anyway. So the cron job copy shouldn't hurt the backup in any way.
What if the file just wasn't there?
That would explain a lot, wouldn't it? If a file isn't there, the backup program sure as heck can't be responsible for not backing it up. Maybe that's what that silly little copy program is for - does something sometimes accidentally destroy this file and that program's purpose is to recreate it every day? Yes, as it turned out, that was exactly the case. Some bug in the application could remove this file during its nightly reindexing which was scheduled after work, but before the backup. It didn't happen all the time, but often enough to be very annoying. The application administrator wrote this little script to fix the problem early in the morning before the users showed up to start doing work. Apparently he made sure that he duplicated any changes of config.main to config.safe before the reindexing started, and this system thereby prevented the users from ever seeing any problem. But if the file was deleted, it wouldn't get backed up. That was exactly the problem.
If I had not preserved the access time information, it might have taken me much longer to realize what the problem was. Actually, armed with that information, I should have realized it sooner; that odd access time was a big red flag that should have said "Look at me! Look at me!". That's often the biggest failing we all have when doing this kind of work: we pay more attention to our habitual assumptions than the things waving at us frantically from the sidelines.
Things to know
Unix systems keep two other dates: modify, and inode change. The modified date is what you see when you use "ls -l" and is the date that the file has been changed (or created, if nobody has changed it since). The inode change date (ls -lc) reflects the time that permissions or ownership have changed (but note that anything that affects the modified time also affects this).
Some systems have a command line "stat" command that shows all three times (and more useful info) at once. Here's the output from "stat" on a Red Hat Linux system:
$ stat .bashrc File: ".bashrc" Size: 124 Filetype: Regular File Mode: (0644/-rw-r--r--) Uid: ( 500/ tony) Gid: ( 500/ tony) Device: 3,6 Inode: 19999 Links: 1 Access: Fri Aug 18 07:07:17 2000(00000.00:00:13) Modify: Mon Oct 25 03:51:14 1999(00298.03:16:16) Change: Mon Oct 25 03:51:14 1999(00298.03:16:16)
Note that "Modify" and "Change" are the same. That's because "Change" will also reflect the time that a file was modified. However, if we used chmod or chown on this file, the "Change" date would reflect the time of that and "Modify" would not.
Windows (modern Windows, anyway) keeps the same three dates. Unfortunately, Windows isn't always been careful about preserving the access date from changing accidentally. You can see the access date by selecting it in the "Choose Details" from the View menu dialog of Explorer, or you can conveniently see it just by right clicking on the file and choosing "Properties". Make sure you write it down, though, because choosing Properties changes the access time to right now! You might not notice that without doing a manual "Refresh" of the View, but it will change - unless Microsoft has fixed that since I wrote this.
(OLDER) <- More Stuff -> (NEWER) (NEWEST)
Printer Friendly Version