If this isn't exactly what you wanted, please try our Search (there's a LOT of techy and non-techy stuff here about Linux, Unix, Mac OS X and just computers in general!):
From: Bela Lubkin <belal@caldera.com>
Subject: Re: Server crashes - need help! :(
Date: Thu, 2 Jan 2003 08:06:37 GMT
References: <3E0F2797.1040102@dniq-online.com> <20021229115029.I10531@mammoth.ca.caldera.com> <3E0FDE64.505@dniq-online.com> <20021230043506.K10531@mammoth.ca.caldera.com> <3E13B6D9.7090704@dniq-online.com>
Farlander wrote:
> Bela Lubkin wrote:
>
> > Well, that's clearly in the Dialog driver (and then a double-panic in
> > the panic dump writing code...!)
> >
> > When it hits this double-panic (2nd panic in bcpalign()), has it printed
> > any of the dump-in-progress dots?
>
> Yes it did. Usually it prints at least one or two points. Sometimes
> it prints almost all the points, then double panics - and then again
> prints a few points.
I need to be really clear here: a double-panic while printing the dots
is possible (it could panic in the host adapter driver). What I can't
make sense of is a double-panic with the stack you showed: a trap in
bcpalign() (which is bcopy()) called from dumpnextpage(), with a
different panic already on the stack.
Seeing that makes me think there is some fundamental problem with the
hardware -- something like cache coherency problems.
> > If the double-panics are happening after the dump has printed some dots,
> > there is something bad happening in the hardware. Something like a DMA
> > transfer being written to a wrong address, corrupting memory not owned
> > by the driver.
>
> On 13 servers in the same manner? I don't know... It would be kind of
> strange, wouldn't it?

Yep.
How similar is the hardware on the 13 servers? (Specifically:
motherboards & CPUs)
> > What is the value of register CR2 in these dumps? That's the address it
> > got the fault on. Should be the same as either %esi or %edi in the last
> > trap frame in the stack trace (si:dffda000 di:c0110000 in the 2nd
> > example).
>
> 0x0000000C in the first one, 0x0000000E - in the second one.
See, that doesn't make sense. bcpalign is a label within assembly code:
allcopy+2D nop
bcpalign movl %ecx,%ebx
bcpalign+2 shrl %ecx,2
bcpalign+5 rep movs (%esi),(%edi)
bcpalign+7 andl %ebx,3
bcptail movl %ecx,%ebx
(scodb disassembly). There is only one memory reference in that code,
and it reads from %esi and writes to %edi. If we panic at that address,
%cr2 _has_ to match either %esi or %edi. Or you're looking at a corrupt
dump, or the hardware is seriously busted.
> > For that matter, how are you displaying these stacks?! Those are
> > crash(ADM) output. To get crash output on a panic, you would need a
> > finished panic dump, but these show the system going down in flames in
> > mid-dump! I could understand scodb traces, you could be using a serial
> > console and capturing the output, but crash output from a double-panic
> > in the dump code?!?
>
> That's a very good question :) Please let me know if you find an
> answer :)
Hmmm. This isn't a real situation, you're making it up to see what kind
of response you get?
I would seriously like to know how you're showing crash(ADM) output of a
double-panic that aborted sysdump() in mid-dump.
> But seriously, what I'm going to do now is I want to remove a few
> boards from some of servers and see if they still will be crashing.
>
> By the way, as far as I know, they have a patched version of STREAMS
> driver installed, that SCO has made especially for us - some kind of
> patch for some error on multiple CPU systems, I don't know any details,
> unfortunately. What I know is when I've installed the same driver on the
> single CPU server, I got exactly the same kind of crashes - in 'freeb'
> and in 'msgcount'. When I rolled back the original driver the single CPU
> (P4) crashed in 'hlted' two times, but since I've removed three boards
> from it - seems to be pretty stable (but, on the other hand, the load on
> it has not yet been as high as it usually is, so lets wait till the
> holidays are over).
If you're using a patched STREAMS driver, you're going to have to
identify it much more strongly. That changes the whole picture --
you're running something unique, who knows what kind of behavior it has?
Then again, you say you crashed in `hlted'?! That's assembly again,
only has two instructions in it. To panic there, the CPU's `HLT'
instruction would have to somehow screw up.
Many years ago I helped someone debug a panic immediately after a `HLT'
instruction. The cause was a very low-level hardware problem. They
were the vendors of a 386SX "daughterboard" that plugged into a 286
socket. 386 Xenix was panicing. It turned out that they were not
following the CPU bus protocol for interrupts closely enough. A device
raises an interrupt, eventually the CPU strobes `INTA' (interrupt
acknowledge), the device presents the IRQ number, the CPU reads it, then
strobes `INTA' again. The 386SX daughterboard's timing for presenting
the IRQ number was off, so the CPU would read a nonsense IRQ number.
Result: CPU is sitting idle (in a `HLT' instruction), an IRQ comes in,
CPU reads a nonsense IRQ number, tries to jump to a nonexistent offset
in the interrupt vector table. Kernel sees this as a panic in `HLT'.
The panic wasn't seen in other instructions because if the CPU was
running an instruction that actually did something, the timing was
different and the daughterboard's late presentation of the IRQ number
wasn't late. Only when the CPU was sitting idle in `HLT', with nothing
on its "todo" list but handle the next incoming IRQ, was it fast enough
to see the problem.
You should make sure you have the latest microcode updates for your
CPUs. I see our `p6update' supplement (oss471f) is from late 2000, so
it probably doesn't cover your CPUs. :-(
Meanwhile, if you're having trouble with a patched STREAMS driver, you
_need_ to go back up the support channel you got that patch from. I'm
not going to be able to help with that. Without knowing what you're
really running, I'm only guessing. It's not useful.
>Bela<
Enter your email address for automatic notification of new posts here
(be sure to whitelist 'feedburner.com' if you use spam filtering)
| Views for this page | ||||
|---|---|---|---|---|
| Today | This Week | This Month | This Year | Overall |
| 2 | 3 | 16 | 235 | 1,043 |
/Bofcusm/1929.html copyright 1997-2004 Bela Lubkin All Rights Reserved
Have you tried Searching this site?
Unix/Linux/Mac OS X support by phone, email or on-site: Support Rates
This is a Unix/Linux resource website. It contains technical articles about Unix, Linux and general computing related subjects, opinion, news, help files, how-to's, tutorials and more. We appreciate comments and article submissions.

Add your comments