APLawrence.com -  Resources for Unix and Linux Systems, Bloggers and the self-employed

Unixware 7 Non stop Clusters

THE PRODUCT DESCRIBED HERE IS NO LONGER OFFERED BY SCO

However, this article is a decent introduction to distributed computing (although dated) . For the best overview of distributed computing, see In Search of Clusters by Gregory Pfister (though that's also out of date).

All of us need some level of reliable access to our programs and data. The level of that need will vary- I can easily put up with my personal system being unavailable for even several days, but I can't have my web server down for more than a few hours. Other people have circumstances where they don't want systems to ever be down for more than a few minutes, and the general trend is more toward that.

Fault Tolerance

The buzzwords that pop up here are Fault Tolerant and High Availability. A good example of Fault Tolerance is a UPS (Uninteruptible Power Supply): the power goes off, or drops below an acceptable voltage, and the UPS kicks in and takes over. There's no effect whatsoever on your computer; that's perfect fault tolerance.

Other examples may not be quite so perfect. Let's take a mirrored drive as our first example. Lose one drive, and the controller just reads and writes to its mirror- no loss of data. Perfect fault tolerance, right? No, not quite, because the mirrored drives also provided another benefit- faster reads (that's just because the controller can get the data from whichever drive has its heads closest to what the controller wants: see RAID). It would be even worse with a RAID-5 system- the loss of a drive really slows things down there. So the data is still available in either case, but if you need it faster than the system can provide it, your "fault tolerant" system isn't.

There are also boundary conditions to be considered. Every fault tolerant design makes certain assumptions. For example, let's look at the UPS again. Suppose it's plugged into a 15 amp circuit and has a capacity of 1200 watts. Great.A typical computer system might draw 2 amps or roughly 200 watts. If you plug 20 of them into this unit, you could draw more power than the 15 amp circuit can handle, so the fuse blows, and now the 1200 watt UPS kicks in- but it can't handle it either, so down you go.

The Raid systems make assumptions, too. In the examples I used, two of the assumptions are that only one drive will fail and that you don't absolutely need the speed benefits that the array provided. But there are more subtle assumptions. A hard drive that has corrupted data because of bad memory or a bad CPU isn't really "available". Physically, it's fine, but its data is useless. It isn't the fault of the drive or the Raid, but it's useless just the same. The drive depends on other parts of the system, so you need to make those fault tolerant as well.

Therefor there really is no such thing as a Fault Tolerant- there are only levels of fault tolerance, and the more fault tolerance there is, the more difficult (expensive) the design is. Although some elements of fault tolerance have become inexpensive (UPS's), high level fault tolerance is still out of the reach of most individuals and small businesses.

High Availability

Obviously a system that is Highly Available must be fault tolerant to some degree. That's the whole idea- keep the systems running. But the difference is that Fault Tolerance is usually taken to mean something more like Zero Tolerance- the wall power fails, but the UPS handles it. A drive fails, but the highly optimized triple deep striped and mirrored array handles it with no perceptible performance loss. There's a bit error in the memory, but parity circuitry takes care of it. A CPU dies, but its twin, which has been running in lockstep all this time, picks up without even a stutter (note- that's NOT an SMP system).

Many of us don't need that. I want my web pages running 24X7, and yes, it's annoying for visitors who want something RIGHT NOW if they can't get it, but it's really not so horrible if it's down for a little while every now and then. What we need is high availability- we need our systems up most of the time.

There's more than just being down, of course. There's also data loss. Many web pages are mostly static- shut 'em down for an hour or a day and nothing's really lost. But that's not the case if you are selling books or taking airline reservations. So "most of the time" means different things to different people.

So what do we do about it? For critical data, we do regular backups and use redundant disks (see RAID). But if the main system dies, we're down. That's the situation most small businesses have: a critical point of failure at the CPU itself. We want our systems to be available, and some of our systems need to be available more or most of the time or even just about all of the time, but we have an Achille's heel that can cripple us.

Some people like to have that spare machine ready to go- they originally bought two machines, configured the hardware identically, and the spare sits around waiting for the need. That need may never come, and it's a shame to have expensive hardware sitting around gathering dust, but if something awful does ever happen, they'll be glad they have it. We'll call that a Cold Standby.

Often, people who use this scheme keep the standby machine running, and regularly transfer data from the live machine to keep it current and ready to go. This is a Hot Standby, and how "hot" it is just depends on how current the data is. This can be a pretty reasonable set up- it requires some manual intervention, but it's pretty quick.

Of course, some people need more than that. It would be nice to have the switch to the standby machine be automatic, and it would even be nicer if that standby machine did some work rather than just waiting in the wings for a call that may never come. Some of the familiar names that offer that include Microsoft's Wolfpack, SCO's Reliant HA and Linux Beowolf.

Douglas Steel, a Clustering Engineer at SCO, offered some comments on this:

I probably wouldn't use Beowulf as an example of a failover cluster, it's more aimed at running large (specially written) parallel programs, instead have a look at LinuxHA or TurboClusters, other things that may be worth mentioning are SunClusters, IBM HACMP, Digital TruClusters, etc.

According to SCO, the problem with these solutions is that they don't scale well- that is, you don't get twice as much performance from a 4 machine system as from a 2 machine system. They also aren't all that easy to manage: while there are some provisions for managing the group of machines concurrently, there's plenty of room for improvement.

SCO thinks that Unixware 7 Non Stop Clusters is that improvement.

What's a Non Stop Cluster?

A non-stop cluster is 2 or more Unixware 7 systems connected together with specialized hardware and software. Disk storage should be external to both machines, and accessible by both, but that isn't absolutely necessary. Each machine is referred to as a node, and the entire cluster can be seen as one machine for many functions- this is called Single System Image (SSI). Each machine runs its own kernel, and its devices (CDROM's, local file systems like /stand, etc.) are visible from the SSI view. Machines don't have their own unique users; adding users to the cluster makes the user login available everywhere. Nor does each machine need its own user licenses: NUL's (Network User Licenses) float between the nodes. Each node also has its own IP address, but the cluster itself has a virtual IP address.

SCO says that they've tested clusters of up to 30 nodes. However, the speed of the interconnect (see below) becomes the limiting factor, and of course that depends also on how much inter-node communication is going on- there will always be some, but certain applications will demand more than others. Therefore it is difficult to say what the practical limits are for the maximum number of nodes.

The system is Highly Available- if one node fails, others will take over the load. It's upgradeable- you can deliberately take out a node and replace it with a higher performance machine. Note that this implies that the nodes in a cluster do not have to be identical-in theory you could have a cluster that ran for decades and every node had been replaced during that time, perhaps more than once. You might have started with 6 Pentium 400's and ended up with 3 Itanium 4 Gigahertz machines, and the cluster as a whole might never have stopped running. That's quite a thought, isn't it? All the machines replaced, maybe more than once, and absolutely no down time.

Even the OS is upgradeable, although that does require a reboot. But the upgrade itself does not- that can be installed (on a spare division) while the cluster is running. Note that the whole cluster comes down for this reboot: although each machine runs its own kernel, it's the same OS, so this is a condition where the cluster has to stop, at least long enough to reboot. The "shutdown" command has been modified so that it brings down the entire cluster. If you do need to bring down an individual node, you use "clusternode_shutdown". There's also a "nodedown" command, which is described as more like "haltsys".

Scalable

The cluster is scalable, too: that is, adding more nodes to the cluster gives more performance. SCO claims 96.5% scalability from one to six nodes, which means that you get almost the same performance as you would from 6 individual machines. Such claims have to be taken with large doses of salt, however. In the first place, they are arrived at through the use of bench marks that really don't have much meaning for clusters. Don't blame SCO, though: everybody uses these bench marks, so it would be really dumb to use anything different. And it's perfectly possible that your particular application mix might scale very well. Of course, it's also possible that you'd get mediocre results. Unfortunately, you could also get awful results: there are programs that run more slowly on clusters than they would just sitting by themselves on a lone machine. That's rare, but it's possible, so if you are counting on a cluster for improved performance in addition to high availability, you are advised to test first (SCO actually has Optimization Centers that can be used for this purpose).

But scalability in general is a complex subject. This is true whether we're dealing with SMP (Symmetrical Multiprocessing), clusters, or CC-NUMA (Cache Coherent Non-Uniform Memory Access- a whole other ball game, very similar to clusters, but not offering high availability- at least not in design.).

NUMA clusters work by extending memory across multiple machines. Conceptually, it's just more virtual memory, though slower because the CPU's access to it comes from another node rather than local paging. The advantage of this is obvious: program code doesn't change. A CPU loads and stores memory addresses in another node with the exact same instructions it uses for local memory. However, that simplicity has to be paid for in complex hardware to make this all work. That's a weakness, but then again hardware always gets faster and cheaper, so in the long term NUMA may spread into the lower ends of the market.

The first problem was recognized a long time ago (clusters have been being designed for a long time- main frame clusters were very common- it's only affordable clusters that are new) by Gene Amdahl. Gene recognized that part of every program is "serial"- that is, part of it has to run in one place, cannot be split up into multiple jobs. A very obvious example of that is printing. Let's say we have a program that sorts our Accounts Receivable into aging buckets and then prints it out in alphabetical order. You could have multiple machines or CPU's working on the aging. You could even divide the sorting part up. But when it comes to printing, there's no advantage to multiple processes. You could have multiple processes, even one for every line to be printed, but there's no advantage, because the lines have to be printed in alphabetical order, which means that the processor responsible for printing Zippy's Auto Body has to wait until everything before Zippy gets printed.

Let's say that the printing takes 15 seconds, and that on a single machine, the aging and sorting takes 45 seconds. No matter how many CPU's or machines you have to work on this, it's always going to take more than 15 seconds to do this work. Always. So that means that no matter what horsepower you have, you cannot get more than a 4X improvement- the job originally took 1 minute, and the very best you can ever do is something more than 15 seconds.

And of course, it isn't just printing that's serial. Every program has serial aspects. In our aging example, we could, in theory, assign every customer to a different processor. But where does that customer data come from? From disk, of course, so those processors are going to have to get in line again for access to the data files. Even operating systems have serial parts: sections of code that update data have to be carefully handled so that things don't get munged up. So even though you might, in theory, be able to scale infinitely, your overall throughput is going to hit a brick wall.

And you can't really scale infinitely, either, not even in theory. Let's go back to the aging again, and this time we'll pretend we have 2 CPU's (or two nodes- it doesn't really matter) working on the job. Each gets 50% of the customers, and each can do its part in 22.5 seconds. That means our total job gets done in 37.5 seconds, right? No, it isn't exactly, because the 2 jobs are not going to start running at the same time. In fact, just splitting them up takes some extra overhead and changes to the program, but let that go for just a second. We doubled the CPU's, and almost halved the job time- we got 75% scalability, in fact.

Let's add more CPU's or nodes. Make it two more. Now each gets 25% of the customers, does its part in 11.25 seconds, and the whole job gets done in 26.25 seconds. Oops. Not quite so good. Twice the horsepower, but we only cut our time by another 11 seconds or so.

Now add a fifth CPU or node, and split the work up to 20% each. Now each CPU can do its part in 9 seconds, so the whole job gets done in 24 seconds. Gee, this is fun, but it isn't really scaling very well, is it? Actually, it is, at least for our theoretical case here. We're pretending that the nodes actually scale 1 for 1 on the actual work to be done. But that isn't what people expect- they expect that the whole job will get done more quickly with more machines. And it does, but it's diminishing returns: the lone machine did the job in 60 seconds, 2 machines cut that to not quite half, and the fifth machine only cut 2 seconds off. But at least we keep cutting time off, right?

Not necessarily.

As I mentioned above, the printer isn't the only serial point in the program. The disk is also serial (not RS232 or USB, I mean serial in its access method), and no doubt the program will have other areas. But let's just look at the disk: at some number of processors, an individual CPU is going to wait long enough for its turn that some other CPU will have already finished its work.

Let's take that slowly. We have N processors, each trying to get its 1/N chunk of the A/R data. We'll pretend that processor 1 goes first, gets its data and starts aging. Processor N waits. Processor two gets it's chunk and goes to work. Processor N waits. Processor three.. OK, you get the idea. If N is large enough, or the disk is slow enough, processor 1 might get finished before N even gets started. Well, gee, if processor 1 is done, then it is available for more work, isn't it? What good does N do us when good old processor 1 is ready and able? We don't need N; in fact N is useless. Our A/R aging doesn't scale to N processors. If we had a faster disk subsystem, maybe it would, but we don't, so it doesn't.

But that isn't the only detriment to scaling. In the examples above, I pretended that if one processor could do the entire task in 45 seconds, then a processor could do half the job in half that time. That isn't true, not even if we ignore the system overhead of splitting the job off to another processor. There's also a fixed, finite amount of startup time- the program has to get to the point where it's ready to start calculating aging, and when it's done, it has to clean up, pass its data back, and shutdown. Whatever time those things take will remain constant no matter how small the actual work is, and if the actual work gets small enough, the overhead will consume more time than the actual work, so adding processors makes the total throughput slower rather than faster.

And then there's the architectural issues. With SMP and NUMA, the more CPU's you have, the more cache issues you have, and there's another little system bottleneck that adds up. With clusters, you don't have the same coherency problems, but you do have I/O issues: the disk drives are not necessarily attached to the CPU that needs the data, for instance. And there's always communication needs between the clusters, so the more nodes we add, the more overhead we have there, and the more demands we make of our interconnect hardware.

By the way, there's nothing that says clusters can't be built with SMP machines, or that you can't have clusters of NUMA machines, or clusters of SMP NUMA machines, so these little issues can get very complicated very fast.

No matter who says that their cluster-numa-smp-who knows what boxes scale better than someone else's, it isn't necessarily true for whatever your mix of applications is.

Doug Steel again:

One example of things that scale badly are programs that make extensive use of writing to shared memory, shared memory is implemented using a Distributed Shared Memory paradigm, and can lead to pages ping-ponging between nodes. Running an Oracle database spread out between nodes shows this behaviour, you are better off pinning Oracle to one node where it will run as if on a standard UW7 system (it can however failover should the node fail).

Highly Available

Well, here we're on firmer ground. Clusters are highly available. Not fault tolerant (though again, there's nothing that says you can't build clusters with fault tolerant nodes), but definitely highly available.

So is everybody else and their cousin Sam.

Sure, it's easy to say that CC-NUMA isn't highly available. But if you go to Sequent's web site (now in the process of being consumed by IBM; try http://www.sequent.com), you'll find their NUMA machines being promoted as Highly Available. The web site is a little fuzzy about the details of just how this comes to be (aren't they always?), and you might wonder why a lot of the fluff seems to be devoted to the quality and speed of their on-site service rather than architectural issues, but the fact is they are labeling these things as Highly Available.. Architecturally, failure of a NUMA node is failure of the entire cluster because it's one OS spread over more than one machine. But again, nothing says you can't make each node fault tolerant to whatever degree you can afford, and nothing says you have to stick flatly to the NUMA architecture- you could have multiple NUMA clusters forming a super-cluster. Is that what Sequent does? I don't know- the price of that stuff is too stratospheric for me, so I didn't dig through the fluff to find out.

It's also easy to show that Brand X only does fail-over, can't migrate this, is going to be slower at benchmark Y (but don't mention that it's faster at benchmark Z) and that the interconnect architecture that glues these machines together is inferior in design and execution to whatever we're selling this week. Trouble is, it all changes next week anyway, so you need to keep that salt shaker handy. For example, the literature I have from SCO says that Microsoft's WolfPack is a failover system only with no SSI, and that Beowolf is a networking level cluster designed for balancing Web farms and the like. Well, you sure wouldn't get that impression reading Microsoft's web site. Here's part of what they claim for MCS (Microsoft Cluster Server, aka Wolfpack):

  • Availability: MSCS can automatically detect the failure of an application or server, and quickly restart it on a surviving server. Users only experience a momentary pause in service.
  • Manageability: MSCS lets administrators quickly inspect the status of all cluster resources, and easily move workload around onto different servers within the cluster. This is useful for manual load balancing, and to perform "rolling updates" on the servers without taking important data and applications offline.
  • Scalability: "Cluster-aware" applications can use the MSCS services through the MSCS Application Programming Interface (API) to do dynamic load balancing and scale across multiple servers within a cluster.

Does that sound like a failover only system with no SSI? Sure doesn't, but let's give the SCO literature the benefit of the doubt and say that Microsoft is guilty of extreme exaggeration for their present product. Frankly, even allowing for my pro-Unix bias, it does look to me that Unixware 7 Non Stop Clusters is a more advanced product. Great. How long do you think that will continue to be true? If Wolfpack really doesn't have SSI management now, what do you think the chances are that will still be true in 6 months? And with Linux, the rate of change is even faster. I didn't even bother to check out the current state of Beowolf for this article, because whatever I find out probably won't be true by the time I finish writing it!

Doug Steel had a lot to say about this:

Different people define SSI as different things, see my paper at http://www.ocston.org/~douglass/underware.html for some thoughts. Pfister's book is also good here.

I would say that MSCS is more that just simple failover, it has an partial SSI at the middleware/application level and has some explicit APIs to allow the application some control over its clustered nature.

Building an SSI system is hard work (believe me I've had a part in building more than one). NSC has a history going back to the LOCUS project in UCLA (late 70's/early 80's) and has taken around 20 years to mature to the point it is today (it's had several rewrites, changes in ownership and names over the past 20 years - if you want I can give some more of the history). Sun had a research project SolarisMC that built a SSI Solaris cluster in 1995/6 and have been claiming for years that FullMoon will be SSI "real soon now", finally some parts of it (cluster file system) will show up this year - nearly 5 years after they started trying to productize it.

Two Linux developments to watch out for are MOSIX (which has a cluster process management system similar to NSC in that it allows process migration) and the Linux Cluster Cabal (which includes Larry McVoy (ex Sun/SGI/SCO) and Stephen Tweedie (RedHat)), unfortunately info on the LCC is hard to come by as they meet in secret, but are reportedly working on something SSI-like.

I did read a research paper last year about someone who has some partial SSI on NT (research was co-sponsored by Microsoft) that include process migration, but it looked like it was at a very early stage and I have no idea if MS are persuing this commercially.

No doubt the Unixware product will continue to evolve, too, and maybe it will even stay ahead of the (wolf) pack. The only point here is that you have to keep your eyes open and your hands on your wallet, because there's a tremendous amount of noise and confusion with all this stuff.

Architecture

The Unixware 7 Non Stop Cluster is based on Tandem's Non Stop Clusters. Tandem is now a division of Compaq, so it shouldn't be too much of a surprise to learn that, at least right now, the Unixware 7 Non Stop Cluster only runs on Compaq equipment. Why? Because it's more than software. There's a special high speed interconnect card that ties the machines together, and it's a Compaq card designed for a Compaq server. There are plans to release a NSC (Non Stop Cluster) version that will use standard gigabit ethernet cards in standard Intel hardware, but that's a few months down the line. If you want it now, you will be using Compaq.

Not quite true, it turns out. Presently the Servernet interconnect is required (and that is Compaq), but I've been advised that the hardware list now includes Siemens

If you just have a cluster of two machines, you need nothing else. You install Unixware on one of the nodes, and then add the cluster software. Part of that installation creates a floppy disk. After rebooting the first machine, you stick the floppy in the second and boot it. From there, the cluster software takes over. Neat.

If you have three nodes or more, you need what's called a ServerNet Hub to plug every machine into (once the Gigabit version is available, you would of course need a Gigabit hub. SCO tells me that the Servernet communicates at 50 MB/second (bidirectional)- so when the Gigabit version is available, that should be able to handle more nodes or run faster with the same number).

Gene Henriksen, who is one of the lead trainers for the NSC course, tells me:

Future improvements include Giganet (as you mentioned) and VIA hardware (probably last half of the year) and plain old Ethernet in May to eliminate the Compaq ServerNet. Siemens does sell NSC but uses a re-badged ServerNet. Another cool feature is using the Online Software Upgrade slice as a backup root. This can protect you from a heavy handed operator.

As I mentioned above, you'll probably connect an external, twin-tailed Raid box to two of the servers. You can't connect it to more; the SCSI interface is twin-tailed, not N-tailed (but nothing prevents you from having another Storage Array connected to two other nodes in the hub, or more than one array connected to each node pair). This though, is one of the scalability bugaboos discussed earlier: let's say we have a 6 node cluster with one storage array. Only two of the nodes actually have a physical connection to storage- all other nodes have to use the ServerNet to get to or from that storage. Further, the second node only takes over disk IO in the event the first node fails. If the IO node is heavily laden with other tasks, everybody's IO could suffer.

Understand that this doesn't limit your storage very much: each node pair could have 11 storage units attached-at present configurations that's over 1 TG (teragigabyte) per node pair!

The type of interconnect used for distributed computing is the cause of religious wars. The interconnect needs to be there for data transfer between nodes and for the heart beat functions that let the system know who's alive and who isn't. But how that interconnect is designed and even how it is used is the kind of thing that makes competing vendors yell at each other.

Load Balancing

Of course, the idea of the cluster is that nobody gets too tied down with work. Jobs can run on any node- every node has its own kernel, every node has access to the devices on every other node. The cluster administrator can manually move jobs around from node to node, or the jobs can move automatically based on the current work level of the individual machines.

There are two different ways a job can move from one node to another. One is by migration, and the other is by remote execution- old fashioned rexec(). Migration is conceptually the same thing that happens on an SMP machine if a process moves off one CPU and on to another. The new CPU gets the process state, open files. pointers, etc. that the old had, and nobody even notices that anything is different. Obviously migration is what you want to happen.

I originally wrote (based on information I had from some early SCO sales literature):

But it can't always be that slick. For example, a process that has shared memory can't migrate. It has to be restarted on the new node. The same is true for any process with an open socket or a FIFO. Threads can't migrate by themselves either; they are stuck on the node that their process is running on; if the process goes somewhere else the threads would need to go with it- I'm not clear whether a process with threads can migrate or has to be restarted.

Well, that was wrong- several of the SCO clustering engineers corrected me immediately:

 

A process with shared memory, open sockets, FIFOs, etc. can be migrated without needing to be restarted. The one wrinkle here is that the kernel object implementing the shared memory, open socket, FIFO, etc. is left behind where it was created. The process can continue to use them, however this may be a performance problem and if the node where the object is located goes down then the object is lost and your shared memory, socket, FIFO, etc. goes away. Threads are a different matter and migrate with the process.

Fail-Over

Obviously failover is a restart situation. You can't migrate a process from a machine that isn't running any more. Further, fail-over is something that has to be arranged for ahead of time. This isn't load balancing; this is load recovery. Something we need running has stopped working and we want to start it up somewhere else. That's done by "registering" the application with the "spawndaemon", which sets up parameters for "keepalive", which you can think of as a super init program that knows a lot more about your program and how to start and stop it. Spawndaemon registration isn't necessary for load balancing. There's a separate configuration file that can specifically include or exclude processes from load balancing. Conceptually, though, it's obvious that you'd need some facility like this to handle fail over. I guess you could just keep track of every process that a node had running and restart each one elsewhere if it fails, but the registration process gives the administrator more control.

But what about the event that caused the failure to start with? This can get really interesting. We'll ignore the obvious case of a dead node for just a second and consider some of the other things that might cause an application to fail.

How about those neutron stars that every now and then happen to score a direct hit on one of our memory chips? More seriously, how about one of those really deep bugs that only pops up when everything is lined up just so and nobody knows why? So the app fails mysteriously, but the cluster software restarts it, probably on a different node, and now it works, because that node doesn't push its buttons quite the same way. Or it's a momentary problem- it would have worked, but something else wasn't quite ready, so it failed. Restart it elsewhere and it finds what it needs. These are situations when it's useful to restart, though of course you don't necessarily need a cluster to do so. No, it's dead machines that let the cluster software be the hero.

But what makes a machine dead?

Or more to the point, how does the cluster know the machine is dead? And what does it do about it? These are not trivial questions at all, in fact they are very tough questions, with more than one possible answer and a lot of hemming and hawing and "but if"'s thrown in just to keep you confused.

The Interconnect, again

Part of the function of the high speed connections between cluster nodes (ServerNet or Gigabit Ethernet for Unixware 7 NSC) is the "heart-beat"- the message that says "I'm alive" (clusters can implement that in different ways, and in fact there may be multiple methods used to determine node health) . If other nodes don't get this message from Node 1, then apparently Node 1 has crashed, gone to sleep, or is busy pursuing other interests. So, it's time to take over Node 1's work.

What if Node 1 isn't really dead?

What if Node 1 has just lost its heartbeat, but is otherwise quite happily chunking along, sorting those A/R aging lists, downloading that 4 gigabyte jpeg collection, and otherwise being a good little cluster node? Oooh, that wouldn't be good, would it? Especially if Node 1 is the guy who has the connection to the storage array. We can't very well have two programs working on the same data without them knowing it. We need to kill Node 1.

It's called a "poison pill", but really it's just a node specific shutdown. At the time the cluster started, one of the nodes got elected the "boss". It's that node's job to add new nodes, take nodes out, and otherwise monitor the health of the cluster. The boss could get replaced by some other node, of course. Barry Southon of SCO's Engineering Group explained it this way:

There is no regular challenge for authority. There are special cluster process that run on the primary and secondary nodes. These processes on the secondary nodes are just backup processes in case the primary node fails. These secondary processes only take over when these processes receive the SIGCLUSTER signal. The CLM process that runs on each node ( with an addional master process on the primary node ) is the process that sends out the SIGCLUSTER signal. The CLMs may consult the other nodes' CLMs processes when a heartbeat fails to arrive, or when for example an internal RPC call has timed out. The CLMs then consult each other and try to contact the primary nodes CLMs process. If all is OK then nothing need happen. If there is a problem, then SIGCLUSTER is sent around the cluster and all the other "backup" processes take over.

The boss also has the job of deciding where to restart the programs that died with Node 1, and getting them running again. As there can also be a Load Leveler process that attempts to balance the load on the different nodes, the processes might end up anywhere, but you do get to express preferences when you originally registered these as things you need to keep alive (the actual process that does this is called "keepalive"; the "spawndaemon" registers processes with "keepalive").

What if Node 1 won't die? That's probably a very unusual situation; it means you've got a nutso machine that isn't taking orders anymore and is just doing whatever it pleases. It would be very dangerous to just restart its processes, because now you've got two machines trying to do the same task. This is a situation that calls for manual intervention; the cluster software is best to keep its hands off something like this.

Another really bad condition would be if Node 1 is the one that has control of the Storage Array. The only way to release it is to do a node shutdown- if it can't do that, it probably can't release the Array to the other connected node. This is another situation where the cluster might have to restart.

But that should be extremely unusual. In most cases, hardware failure is relatively quick and deadly- just the sort of thing a cluster can detect and react to.

However, you do have to keep in mind that restarting software isn't necessarily seamless. Some software, particularly higher end databases, keeps state information and it can restart with either no or very little indication to its clients that anything went wrong. Other software is not so smooth. The cluster software really can't help you with that.

But inexpensive clusters do seem to make sense for a lot of people. Combining the power of a real Unix OS with this sort of reliability is something that could potentially save a lot of money, keep users and customers happy, and keep the IT support staff out of crisis mode. For more information, see SCO's web pages.

Publish your articles, comments, book reviews or opinions here!

Copyright February 2000 A.P. Lawrence. All rights reserved



Got something to add? Send me email.





(OLDER)    <- More Stuff -> (NEWER)    (NEWEST)   

Printer Friendly Version

-> -> Unixware 7 Non stop Clusters




Increase ad revenue 50-250% with Ezoic


More Articles by

Find me on Google+

© Tony Lawrence



Kerio Samepage


Have you tried Searching this site?

Unix/Linux/Mac OS X support by phone, email or on-site: Support Rates

This is a Unix/Linux resource website. It contains technical articles about Unix, Linux and general computing related subjects, opinion, news, help files, how-to's, tutorials and more.

Contact us





The last bug isn't fixed until the last user is dead. (Sidney Markowitz)

Let us change our traditional attitude to the construction of programs: Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to human beings what we want a computer to do. (Donald Knuth)












This post tagged: