Kerio Control Require3WayHandshake dropping packets


2012/09/26

If you can keep your head when all about you are losing theirs .. Rudyard Kipling

It's not always easy, is it? I confess that I get into situations where I lose sight of the big picture and go off on a tangent when I should have taken a deep breath and stuck to basics. Such was the case last week when we installed a brand new Kerio Control 3110 firewall to replace an old SonicWall.

This particular firewall would be installed in a datacenter. It had a brother that would later be installed in the company offices, but that would be the following week as the company was moving. The data center firewall would go in first, the company firewall would follow later.

We had looked at the SonicWall configuration and pre-configured the Kerio, so I wasn't expecting major problems. I knew there might be some small glitches because we were deliberately ignoring a T1 line that would be going away because of the move and replaced by site-to-site VPN in the Kerio, but I wasn't expecting anything really difficult.

Oh, how the computer gods must have laughed at my folly!

The Data Center

Fortuitously, this data center was located just a few hundred feet from the present company offices. At least we wouldn't have to go far, I thought happily. The gods smiled knowingly as we walked across the parking lot carrying the newly configured firewall, a laptop and some spare patch cords.

After checking in, we found their cage and started looking at the machines configured there. Most of them were in a Dell blade server. I just wanted to have a look-see to make sure I could flush arp caches as necessary and verify connectivity. We also needed to confirm IP addresses of the various machines.

First problem: we had trouble with passwords. The people I was with couldn't log in to all the servers, so we had to call another employee and have her walk over to the data center to assist us. I had a little twinge of worry there, but that went away as soon as she arrived as she did have access.

So far, so good. We had some Tomcat servers on the same subnet as the firewall and two webservers on another subnet. The built in switch on the blade provided a VLAN to separate those subnets. I had created a DMZ subnet on one port of the new firewall and provided the rules for the webservers to query the Tomcat servers on the ports the SonicWall said were needed.

There did seem to be a problem, though. On the webservers, we couldn't run basic commands: ssh, scp, sudo, shutdown would all just hang. That put a little knot in my stomach, but on the other hand the webservers were working and supposedly the outside consultant who originally set them up had "locked them down" for security. Maybe this was some little bit of extra confusion he had engineered? It seemed odd to me, but as I said, they were working.. the gods winked at one another and settled into their couches to watch the show.

Switcheroo

So, we mounted the firewall in the cage, readied the cables, took a deep breath and turned it on.

Of course I knew that arp caches would have to clear, but we could hurry that up by flushing cache on the blade machines. I wouldn't be able to do that on the webservers, but rebooting them would certainly do it.

Reboot? My companions didn't like that idea. They expressed doubt that the webservers would come back up correctly. What? That put another knot in my stomach and the computer gods clinked their glasses together, chuckling at my discomfort.

Well, the arp cache would time out eventually on its own - and if it did not, we could use "arping" to force it. Whatever - we had to do what we had to do.

So.. everything came up and and, after some arp clearing, stuff was basically working. A few minor typos in some traffic rules, but those were quickly fixed. But then weirdness set in.

Dropped packets

A simple test was to bring up the main web page, which has a login link for their customers. That login needs the Tomcat servers and it was sometimes working and sometimes not.

Right there is where I should have stopped and thought that through. In retrospect, the only logical suspect is the blade switch and I realized that, but we had lost access to that after bringing up the new firewall. Why? I don't know, but there it was - or there it was not, more accurately.

Let's take a minute to consider some environmental issues too. We were in a noisy data center, with no cell phone access. The data center makes wireless phones available but they were not charged so to make any calls required being a long distance from the equipment. We all felt the switch had to be at fault - probably it was holding on to arp cache for the old SonicWall, but what could we do? Lacking any other ideas, we decided to power cycle it.

A horrendously bad idea, as it turns out

Oh, was that ever a bad idea! We were at least partially working before that, but after, we had nothing. My stomach was in knots and even in the dim light of the data center I could see that my companions faces had gone white with fear.

As a little bit of extra comedy, the computer gods also arranged for someone else to enter the data center now, needing to replace some equipment in his cage. In a 25,000 square foot data center, what do you think the chances would be that his cage was right under ours? Ayup, now we had to trip over each other and give him access too!

It was getting late. We decided to regroup (at least virtually) in the morning. Their IT guy would study the blade documentation and call Dell in hopes of getting that fixed.

Another day

There's got to be a morning after
If we can hold on through the night
We have a chance to find the sunshine
Let's keep on looking for the light

- The Morning After (Maureen McGovern)

Well, the call to Dell was fruitful. By morning they had the switch reconfigured with the desired VLANS and things were partially working again. I say partially because we were still seeing the login flakiness.

Right here is where the customer may have made a mistake. They decided to wipe the webservers and reinstall. The rationale was all those commands that wouldn't work - they couldn't do any debugging, so a reinstall seemed like the only option. That may or may not have been true - they hadn't yet been able to get hold of the guy who had originally installed those, but they went ahead with that plan.

Unfortunately, that didn't help. We still had the same flaky, sometimes it works, sometimes it doesn't behavior and I still had not screwed my head on straight enough to see what it had to be. I called Kerio and they suggested changing the Debug log to show "Packets dropped for any reason". After that we would see entries like this:

[23/Sep/2012 11:50:42] {pktdrop} packet dropped: 3-way handshake
not completed (from Port 2, proto:TCP, len:92, 192.168.131.101:22 ->
10.189.241.5:49465, flags:[ ACK PSH ], seq:2673204299 ack:173509679,
win:27, tcplen:52)
 

Let's stop for a moment. Some of the gods have fallen on the floor and are contorted by laughter. Others are taking bets on whether I will miss this broad hint.

Yes, I missed it. Well, not entirely. I knew this was most likely the blade switch at fault, but it COULD be the Kerio firewall too. That latter possibility was slim, but I couldn't discount it entirely. I also knew that it was possible to tell the firewall NOT to be fussy about 3-way handshakes, but my brain stopped there and it should not have.

It's possible that I interrupted my thinking because the customer had now found the original developer and he was logged in fixing things on the web servers.He seemed to be finding plenty to fix, so I had hopes that he would resolve the problem.

Second thoughts

However, over the next few hours, I had second thoughts. Looking at this logically, the webservers had to be making requests of the Tomcat machines - those machines wouldn't randomly acknowledge packets unless they had seen packets from the web servers. The Kerio firewall obviously had NOT seen the original packets. Again, that could be its fault, but I had a very easy way to find out. If I disabled the 3-way checking, only two things could happen. One, everything could start working, which would mean the switch was sometimes leaking packets between the VLANS without passing through the firewall, or nothing would change, which would prove the firewall was defective.

If you want to understand what this three way handshake is all about, I recommend TCP Connection Establishment Process: The "Three-Way Handshake", but conceptually it's just this:

We have charged the Control box with mediating all conversations between the webservers and the Tomcat machines. What's happened is that the firewall has suddenly seen a Tomcat server say "Hi, Miss Webserver. Yes, I'd be delighted to open a session with you on port 27779!".

The Control box looked at that and said "The hell you will. I never saw any packet from the webserver requesting that session. You guys must be passing notes under the table or something and that's a security breach in my book, so there is no way I'm sending that packet to the Webserver!"

So it doesn't, and the login fails.

Understand that the webserver DID request the session. It's just that the packet that held that request was never seen by the firewall. The Tomcat server saw it; that's why it replied. Nothing "wrong" here except the path of the packets.

Disabling the 3-way requirement

As I said, disabling that security check could fix this. How do we do that?

If this were not a Control appliance box, you could log in at the console and do this:

Stop Control:  'sudo /etc/boxinit.d/60winroute stop'
Edit the 'winroute.cfg' file
Change 'Require3WayHandshake' from a '1' to a '0'
Start Control: 'sudo /etc/boxinit.d/60winroute start'
 

If you tried that while ssh'ed into a Control box, you'd lose ssh access the instant you ran the first command and you wouldn't get it back until someone physically reset the power on the box!

Fortunately there is a relatively easy solution. You can export the configuration to your local machine. It comes down as a gzipped tar bundle and when unzipped it contains these files:

-rw-r--r--  1 tony  staff   10696 Sep 24 09:15 UserDB.cfg
-rw-r--r--  1 tony  staff       7 Sep 24 09:15 hardware
-rw-r--r--  1 tony  staff     218 Sep 24 09:15 host.cfg
-rw-r--r--  1 tony  staff     187 Sep 24 09:15 interfaces
-rw-r--r--  1 tony  staff   21474 Sep 24 09:15 logs.cfg
-rw-r--r--  1 tony  staff       2 Sep 24 09:15 version
-rw-r--r--  1 tony  staff  126636 Sep 24 09:20 winroute.cfg
 

I edited winroute.cfg and made this change:

<variable name="Require3WayHandshake">0</variable>
 

I then tarred it all up again to a gzipped bundle and was ready to import it. However, the developers were still making changes, so I had to wait another few hours to avoid causing problems for them. Eventually I got the all clear, imported the configuration and a few minutes later we were working with no flakiness at all.

Verdict: it's the switch passing notes where it should not. This is a bandaid, of course: we should NOT leave that check disabled, but until somebody fixes the switch, we have to.

I'll tell you that this experience does make me reevaluate my feelings about integrated blades. Like a multifunction printer, this seems like a great idea, but when an important part of it malfunctions, it would be a lot easier to debug if we had separate machines on separate switches instead of everything on a common backplane separated by software.

So, for now we will run this way. As noted, I have no choice: the customer needs his server working. Here's an interesting thing though: I'm not sure SonicWall does that check by default. From what I find on the web, you need to specifically enable that check. If that's true, this leak could have been going on for years and nobody would have known.

The customer will get it fixed. When he does, I'll update his configuration to enable the 3-way check again. Actually, I've recommended that he put in a couple of VMWare servers instead and virtualize all this. That's much more flexible and a heck of a lot easier to debug.



Got something to add? Send me email.





(OLDER) <- More Stuff -> (NEWER)    (NEWEST)   

Printer Friendly Version

-> -> Kerio Control Require3WayHandshake dropping packets


1 comment



Increase ad revenue 50-250% with Ezoic


More Articles by

Find me on Google+

© Anthony Lawrence







Wed Oct 24 15:01:39 2012: 11399   NickBarron

gravatar


What a good post!

Sounds like you had a bit of mental running around to do before you figured it out!





------------------------
Kerio Connect Mailserver

Kerio Samepage

Kerio Control Firewall

Have you tried Searching this site?

Unix/Linux/Mac OS X support by phone, email or on-site: Support Rates

This is a Unix/Linux resource website. It contains technical articles about Unix, Linux and general computing related subjects, opinion, news, help files, how-to's, tutorials and more.

Contact us