Towards the end of March, my internet connection started acting flaky. I noticed it happening during WoW raids, and my wife caught Netflix and Spotify cutting out during the day. The internet would drop for a minute or so, and then come back online. It was never out for very long, and never in close succession.
I chalked it up to Comcast doing some maintenance work in our region. However, after two weeks of drops that seemed to be becoming more frequent, I decided to dig into the issue. Checking the logs on my cable modem, I never saw any loss of signal or connection, which ruled out any type of physical issue. Simple ping checks confirmed that it wasn’t an internal connectivity problem. The issue was somewhere between my firewall and Comcast’s upstream router, at a logical level.
Logs don’t lie
The revelation came when my wife noticed a pattern in the disconects. Our Mumble voice chat server logs every time someone joins or leaves. She had been disconnected several times that day, and saw in the join/leave logs that each disconnect occurred about 1 hour apart. Example logs:
2:05pm - Mory has left the server
2:06pm - Mory has joined the server
3:06pm - Mory has left the server
3:07pm - Mory has joined the server
This gave me further confirmation that it was a logical issue, as physical problems will rarely manifest in such a timely fashion.
Armed with the knowledge of when the next outage should occur, I started monitoring all of my logs. I had my firewall’s log monitor open, I had pings running to a variety of sources, and I had Wireshark capturing my packets.
The next disconnect arrived exactly as predicted. I immediately noticed a flurry of activity in my firewall’s logs. As my pings started dropping, my firewall was negotiating its DHCP lease with Comcast’s DHCP server. The firewall was unable to renew the lease on that interface, it killed all active sessions and started the DHCP lease process from the beginning. This was the ultimate cause of the internet drops.
After doing some more investigation, I learned the details of Comcast’s DHCP process. My firewall would request an initial lease, and would receive the same public IP address from Comcast’s DHCP server (which happened to be in Oregon ). The lease was for 7200 seconds, or 2 hours. DHCP clients will typically wait until their lease is halfway through , and then they’ll try to refresh proactively.
My firewall would send the
DHCPRENEW packet, to initiate the refresh. Comcast’s DHCP server would reply back with a
, which indicates that the server did not accept the renewal request. That caused the refresh to fail, and my firewall to dump the lease and try a
DHCPREQUEST from scratch. When my firewall fell back to requesting a “new” lease, the process succeeded. The strange part is that the Comcast DHCP server would respond to the
DHCPREQUEST with the same IP from the original lease.
Unfortunately, it seemed like my firewall was acting properly, and Comcast’s DHCP server wasn’t respecting the renewal request. There’s no way to turn off DHCP renewal requests on my side (for good reason), and I didn’t have any visibility into Comcast’s server side. When one side of a protocol doesn’t work in the manner specified by the RFC, all bets are off.
At this point, I tried calling Comcast tech support and explaining my issue. However, despite multiple attempts on the phone, I couldn’t get any help beyond a tech trying to reboot my modem to fix the problem. I tried explaining that the issue was occurring on a regular basis and that I had tracked down the cause. However, since my internet was always “working” at the time, there wasn’t anything they could do to help me. I could tell them the date and time the issue started, and the exact time it would happen next, but that didn’t matter since my internet was always working by the time I was talking to the tech. I received a similar response from their Twitter support account.
Luckily for me, I asked around, and a friend of a friend happened to be a Comcast technician. After I talked with them, they explained that the 2 hour lease wasn’t typical, and was usually set before maintenances. Unfortunately, after looking into my account details, there wasn’t anything they could do. They did have one suggestion though: change the MAC address of my router to try getting a different public IP leased to me.
Now, many consumers won’t be able to do this, since their router won’t let them set the MAC address arbitrarily or they don’t have configurable WAN ports. I am using a Palo Alto in a virtual machine, so I swapped to a different vNIC (ethernet 3 instead of ethernet 1) and that did the trick. I received a new public IP address from the same DHCP server, also with a 2-hour lease. However, as the half-life of the lease approached, the renewal occurred and went through without a hitch. My issue was solved.
The challenging part of this issue was realizing that even though I had done the technical work to figure out exactly what was happening, when it would happen, how it would happen, what servers it was happening on, etc, it was impossible to reach out and talk technically with Comcast. If it wasn’t for a friend of a friend, I would still be in the dark. I remembered the XKCD comic about this issue, and realized it had happened to me.
How would a normal user troubleshoot this issue? They wouldn’t have access to the detailed firewall DHCP logs, their hardware wouldn’t have the capability to change its WAN MAC address, etc. I can imagine they would’ve been told to try getting a new router, which would’ve “fixed” the issue but also been a waste of money.
I was lucky to be able to troubleshoot and fix this issue, but it sucks that it can be so hard to have a technical conversation with a company providing a technical service. We should try to do better at both designing systems that users can troubleshoot by giving them information, and also providing channels for receiving technical feedback about the health of our systems.