Anyone who slaps a 'this page is best viewed with Browser X' label on a Web page appears to be yearning for the bad old days, before the Web, when you had very little chance of reading a document written on another computer, another word processor, or another network. (Tim Berners-Lee)
Great minds discuss ideas; average minds discuss events; small minds discuss people. (Eleanor Roosevelt)
This is the story of a long support incident that happened after a customer replaced an ailing Kerio Control firewall machine with a new server.
Let me preface this by telling you that it embarrasses me a bit that took me several hours to even fix this and it bothers me even more that I fixed it without really understanding why it needed to be fixed. There were some extenuating circumstances; I was over tired from having to take my wife to the emergency room after I had just arrived home from a 20 hour trip - yes, she's OK, it was an allergic reaction to an antibiotic. She's fine now, but this problem came up in the midst of all that, so I was tired and distracted.
I could give other excuses, too. I was working with a new customer and I didn't yet have a good mental image of his three offices, so I was sometimes getting them confused. I was also confused because I thought things had been working correctly prior to the firewall transplant - in fact, I was double-dog sure of that, and we all know what happens when we get double-dog sure.
Finally, I was slightly thrown off because two-thirds of his network uses non-private addressing (something the current IT guy inherited and hasn't yet found the time to change). I find it just a little harder to scan through firewall rules and recognize internal IP's when they aren't RFC-1918 addresses.
So, harrumph and fooey and all that, but the fact remains that I should have spotted the problem instantly and when I DID spot it, I should have realized why things had apparently been working. With all that in mind, join me in a fun ride through this mess.
As obliquely mentioned above, the customer's old firewall machine needed replacement. It was apparently in the habit of winking off now and then. That's inexcusable for a guard on watch during time of war and for a firewall, so a new machine was purchased.
I wish I had thought to mention the Kerio Firewall Software Appliance before they bought that new hardware. They purchased a new server with Windows 2008 Server OS, a rather pricey operating system and totally unneeded. In fact, more than unneeded: it's unwanted because you don't need it if your intent is to run the firewall and nothing else. As that darn well should be your intent (firewalls shouldn't be serving any other function), there is no reason to pay for a server OS. All you need is hardware and the downloadable ISO for the Kerio Firewall Software Appliance. Configuration data is transportable from the old firewall and it really is that easy. I didn't think to mention that, though, so it's my fault.
However, the transplant went fairly smoothly - some minor confusion with interfaces, but the IT guy solved that quickly on his own and come Monday morning everything seemed to be fine. Well, almost everything - one minor glitch was that their branch office firewall couldn't send email alerts to him. It was that small (but important) problem that led me and their IT guy into madness.
Just the facts, Ma'am
First, let's review some facts: the branch office firewall was configured to send alerts to the aforementioned IT guy. Let's call him "Rick" to save me some typing from now on. Rick certainly could receive mail: I could send him email, people in the branch office could send him email. The only thing that couldn't was the branch office firewall and as that had not been touched, obviously (yeah, double-dog sure, I know) the problem had to be in the newly rejuvenated firewall in the main office which sat in front of the mail server.
In retrospect, if those were the only facts at hand, we might have proceeded more quickly, but unfortunately there had also been some other recent experience involving email, all of which is related at Troubleshooting a Scanner with an email gateway. You don't need to read that whole thing right now, but there is an important little piece of evidence in one paragraph:
Let's start with what I knew. I knew that we could telnet to port 25 from a machine on "New Company" to the Kerio mail server. I knew that because I had led their tech guy through a whole SMTP client impersonation drill last week.
I want to mention that because by the time we reach the end of this web page, you are going to say that the above paragraph is a lie, that it never happened, that it could NOT have happened. And yet, my memory (admittedly a fragile thing at times) insists that it did happen as related and moreover, Rick agrees with me. If it were just me and my unreliable brain that has suffered some disorientation from missing sleep and being stressed by my wife's problems, I'd agree that the facts are plain: the paragraph is wrong. It has to be wrong. But..
Oh well, leave that be other than to note that I *knew* that I should be able to telnet to their mail server from anywhere on port 25. I KNEW that, dammit!
Because I knew that I should be able to, and because I knew that the firewall was not able to send email, of course that was my first debugging step: I tried the telnet.
It failed. Not rejected, but it timed out, unable to reach its destination.
Let's review, shall we? Let's review the "facts". That is, the facts I knew as I knew them at that point, the facts I assumed to be true, the facts I was double-dog sure were true:
I should have been able to "telnet theirmailserver.com 25"
I could not; it would hang.
The firewall ahead of the mail server had been transplanted to new hardware.
Nothing else had changed.
Well, gosh, given those facts, it's bleeping obvious, isn't it: there's something funky about their firewall rules. Somehow, something changed. That's obvious enough, right?
So I started looking at the rules.
Kerio stores everything in an xml file called "winroute.cfg". A minimal version of that might contain a little less than 3,000 lines, but their file was 7,623 lines. In those lines were 99 traffic rules, which isn't overwhelming, but it can be a bit daunting, especially when one is stressed, tired and so on. Nor should we forget the "facts" that I knew, and especially the fact that this was working before the transplant. That's a fact, Jack. So there!
Hold your horses, partner
There was one tiny little thing. It's just the teeniest little inconvenience that we could just toss out of consideration because of all the other obvious facts.
Mail was working.
Even though I could not ""telnet theirmailserver.com 25" or "telnet theirmailserver.com 465", I could send them mail, both through my own Kerio server and through Gmail. I could send mail to Rick, and in fact I SENT mail to him while debugging their apparently blocked mail server. Duh.. that is inconvenient for my other facts, isn't it?
Looking at the logs of MY mail server, I did notice a pattern: there was always a delay for mail I sent to Rick. It was as much as three minutes before the message would leave my queue. Could this possibly be a reverse DNS issue? I asked Rick to do a nslookup of my IP from his mail server and from his firewall. Nope, instant response. Dammit - I was going to have to look at the rules.
I say "dammit" because I did not want to do this. Pawing through 99 rules isn't all that bad, but these things depend upon definitions elsewhere and sometimes definitions are nested inside other definitions and then there is the aforementioned unfamiliarity with his networks and the RFC-1918 stuff and, dammit, I was tired. Also, I had started to come down with a little "stomach flu", so I was in no mood for any of this.
But, as they say, the show must go on. When a customer has a problem, you stick with it until you fall down and die, or at least until the customer says "To hell with this, let's regroup tomorrow". So Rick and I started looking at the rules.
I don't know how much other crap we had looked at when Rick forced me to look at his "SMTP Rule". I HAD looked at it, but I hadn't LOOKED at it, because I already *knew* that it was right. It had to be right, first because mail had always worked before the transplant and because it still WAS working, even if I could not presently "telnet theirmailserver.com 25". Rick was insistent that I should examine that rule again because, he said, "it bothers me".
The rule looked something like this:
The real rule was actually a little more complicated. It didn't just "Allow"; it mapped the traffic to the mail server's internal IP and it had numeric IP's rather than symbolic names, but its purpose was plain enough: it was going to allow port 25/465 traffic from their secondary MX machine and from their VPN's and that was it.
Well, duh, that's not going to work. The world needs to be able to make SMTP connections and this rule wasn't going to do it. We looked through the rest of the rules and I couldn't find anything that would allow it. And yet, of course, I bloody well KNEW that it had to have been working in the past!
My supposition at this point was that somewhere in those 7,623 lines was something that was hiding from me and that had mysteriously broken during the transplant. What, I could not imagine, but remember that I *knew* that a few days back I had been able to "telnet theirmailserver.com 25" so that is what it HAD to be. I therefore made a reasonable suggestion to Rick:
Add a new rule at the very top that allows SMTP traffic from anywhere and maps it through to the mail server. Actually, not quite at the top: there was a "bad guys" rule that blocked certain well known annoyances, so I had Rick move that rule to the very top and then added my new rule under it.
As I expected, the "telnet theirmailserver.com 25" now worked and mail that I sent to Rick no longer got delayed. Problem solved for the moment. I suggested to Rick that as we were both tired and sick (yeah, he had some bug too), we should stop right there and revisit the rules later to see why this had happened. That was agreed to immediately.
Duh! The secondary MX!!
The next day, I had email from Rick. I'll reproduce it here with only minimal obfuscation:
Is it possible that ALL ALONG we have been getting mail "second hand"?
What I see is that when mail failed trying our primary MX, after
x amount of time it went to our secondary, which then transferred
the mail to our primary, which the old rules let in.
OMG, indeed. Yes, of course, that is exactly why mail worked, exactly why it had that several minute delay and exactly why I could not "telnet theirmailserver.com 25". That explains everything.
Well, not quite everything. It didn't explain why the branch firewall couldn't send him alerts and it didn't explain how both of us thought we HAD done a "telnet theirmailserver.com 25" from another branch just days earlier. But it sure explained everything else.
We went back to play with the alert problem but couldn't get it to work, so I kicked that upstairs to Kerio support - maybe the solution will be interesting enough for its own article; if not I'll add it to the comments here later.
But why? WHY??
Why was it configured that way? I can only assume an accident. It makes absolutely no sense and in fact defeats the purpose of having a secondary MX. If this sort of configuration was wanted because the other mail server needed to process all emails, then it should have been the primary, not the secondary. Sombody goofed up these rules way back when and that's how it has been working all along.
I THINK it was because their secondary was once their primary and when the Kerio mailserver was put in, it was initially configured just to recieve mail from that machine. Maybe they did that while testing the Kerio and just forgot to fix the rule once it was ready to roll out completely. That's probably it - they just forgot.
As to the "telnet theirmailserver.com 25" mystery, I don't know. Maybe I had him telnet to my mail server, maybe we telnetted to his secondary.. but neither of us believes that. Hmm.. if the Mac where we did that from hasn't used Terminal too much since then, the commands we typed would still be in the bash or csh history.. maybe I'll ask Rick to check that. It won't show what happened, but it will show what we actually typed..
Nope. Rick sent the history. We definitely were trying to go to their primary MX and our memories say that it worked before the firewall upgrade. THAT is impossible so obviously Rick and I are both insane.. or were in some alternate reality. I don't know - some things are just too baffling to bother with.