Is this network congestion or some other problem? Can you see what is confusing me? Join me in a network troubleshooting problem.
I'm going to start by describing the root problem and then we'll move into the details and observations.
A customer has a site to site VPN between two Kerio Control firewalls, one at the main office and one at a warehouse. The performance on that has never been great and the lack of performance was always attributed to having an asymmetric Internet connection at the main office. We'll call that connection at the main office "Cable" and say that it is nominally 100 Mb down and 15 Mb up.The connection at the warehouse is nominally 100/15.
They were able to get a 100/100 Fiber connection at the office. Call that "Fiber". We expected that would help the VPN problems. It did not - in fact, performance became worse when we tried using the Fiber for the site to site VPN.
Locking down the VPN
As the normal configuration attempts load balancing between Cable and Fiber, we thought perhaps that the VPN was still sending some traffic through Cable. If so, that certainly shouldn't have caused worse performance, but just the same we locked the VPN traffic to Fiber using this Traffic Rule:
To my surprise, that broke the VPN completely. The speed was so bad we had to immediately abandon that rule. But why? What was happening here?
Speed tests from within the LAN using Fiber showed us better than 90 Mb down and never better than 60 Mb up. While disappointing, even 60 Mb still should have improved the VPN performance.
The laptop test
While attempting to debug that, a test was made where a laptop was connected directly to the Fiber. The speed tests jumped to slightly over 100 Mb down and 80-85 Mb up. Obviously the network or firewall was adding significant overhead.
Move VPN back to Cable
Because the VPN was unusable over Fiber, we moved it back to Cable. There it was slow, but still useful.
Attempting to debug that, we turned on debugging at both ends for "packets dropped for some reason".
Immediately the debug was filled with anti-spoofing messages.
Anti-spoofing comes into play when a packet is seen at an interface it shouldn't be at. Note that first packet? That's a LAN packet appearing on the Cable interface. The second is a LAN packet on the Fiber interface.
The third one is another VPN to another office. How do a LAN packets appear to be coming into a WAN interface?
They get there because of topology. Everything in this network comes through two interconnected 48 port switches. There are no VLANS - the Cable is plugged into one port, the Fiber is in another and the LAN machines are in the rest.
Perhaps worse, some LAN machines are dual homed and have Cable public IP's on another NIC and those cards are also plugged into the very same switch.
But a switch is supposed to switch
Here's where my knowledge gets shaky. A switch is supposed "switch" packets. That is, it is supposed to learn that one MAC address is at port 6 and another is at port 7, so if it gets a non-broadcast packet that is supposed to go to that first computer, it puts it on port 6 and only port 6. The other ports don't see it - that's part of what distinguishes a switch from a hub.
That's plainly not happening here. Most (but not all) of the anti-spoofing messages are about UDP traffic and some of it is broadcasts, but thousands and thousands of non broadcast packets are being seen at the wrong interfaces.
Is this flooding slowing us down?
I'm assuming that the switch must be flooding these as though they were all broadcast packets. If it were actually dropping all these packets, this wouldn't work at all. So does this activity slow us down?
It definitely makes debugging hard: there is so much noise from this stuff that we could easily miss more significant messages.
It turns out that the switches are capable of creating VLANs. Therefore we created a VLAN for the Fiber. That eliminated messages in the log about Fiber and it also marginally improved the Speed test. The VPN was moved back to Fiber and it was again too slow (4 Mb) to be of any use.
We also tried taking that interface on the firewall directly to the Fiber, bypassing the VLAN. No help.
I'm confused right now and have several questions I can't wrap my head around.
Why are the switches flooding traffic?
Why is it mostly or maybe all UDP flooding? Why not TCP?
Are the anti-spoofing messages just noise or do they cause congestion?
Does having dual homed machines on the switches contribute to flooding?
Why is bypassing the switches (laptop test) so much faster?
And another thing
In addition to all that stuff, there's another oddity here. We had trouble creating the VPN on the Fiber.
With site to site VPN's, you make one side Passive and one side Active. That wasn't working, but making both sides Active did work. That makes no sense me at all unless some rule I didn't notice is interfering. If a rule IS causing interference, maybe that's the source of the poor performance, but I don't see anything and I have had Kerio look these over also.
So this is where we are right now. Very, very confused. We intend to VLAN the Cable side also, but the dual homed machines make that a little more complicated - one will have to be physically relocated before that can happen.
Any thoughts are welcome.
Hurricane Sandy interrupted us, but we eventually got back to this.
Having exhausted everything else, I felt it HAD to be the Windows box, so I convinced the customer to take the time to put up Control in the Linux version.
Upload speeds immediately climbed to 73 Mbs - still a bit shy of the 90Mbs download, but big, big jump. He used an old piece of hardware for this; something newer might do better. At least it is now fast enough to use!
I don't think it's Linux vs. Windows issue. I think it's NIC problem on the Windows box that Windows just isn't seeing, but the happy thing is that it is fixed!