Wednesday, February 13, 2008

FI Live in Production 2008

This FI cutover is definitely way overdue. The souped-up GGSN is due to be launched before the Christmas embargo last year to handle the sudden surge of mobile broadband traffic. Due to LIG and alarming issues, we can't launch it in time. We managed to solve all the pending issues, and it was due for cutover mid-January. In the morning of the planned cutover date, the box just hanged by itself. This caused a huge commotion with the customer. An emergency case was raised all the way to Finland. A patch was found, installed, and tested. Customer still not convinced problem was fixed and that FI is stable enough. We had no way of proving otherwise, so we had a write a customized script to check the logs every hour and raise an alarm if the fault happens again.

After a week of no alarms, we are go for another cutover attempt. The DNS entries were modified; the SGSN was reconfigured; production traffic started flowing through the FI. All seems well until the customer noticed that there are actually 12 L2TP tunnels going to the LNS when we were expecting just one, like the GGSNs. Not expected behaviour, rollback.

Another big commotion. Another emergency case raised. More investigation done. Our PL experts came back and confirmed that's how it's supposed to be. With that, the FI was scheduled to go live yesterday. Good thing it didn't push through because I was at the Tsai Chin concert last night. If something had gone wrong with the cutover, I'm sure I'll get called in. What happened was that Operations wasn't convinced about the FI's HA feature. This guy started questioning FI's ability to switch between the multiple L2TP tunnels in case one of the service blades go down. This was not tested during the SAT due to time constraints, and now they're flagging it as critical before they sign off. Can we test the resiliency with a couple of CPEs? No, they want a full stress testing with a simulator. That means I'll have to find me a Linux laptop and start brushing up on my SGSN simulator skills again.

So I go to the site today. My counterpart tells me the testbed FI is not working. To test on the production FI, we'll have to let Operations know in advance in case we generate alarms. We set up the CPE to do a continuous ping and YouTube download. I shut down the active blade while the session is on. Traffic stalled for a few seconds, then picked up again without a hitch. Session was switched from the active tunnel to the backup tunnel, and the LNS was none the wiser. Now for some stress test. For some strange reason, it's just not working. I was able to create 50 simultaneous PDP contexts, but they're not pinging. After half an hour of playing around with the parameters, I realized that maybe the box is actually pinging but the output is not being sent to stdout. I tested my theory using tcpdump and was proven correct. Showed my counterpart the evidence, and he in turn called up his superiors to convince them that the thing actually works. Cutover was scheduled for 11:30pm that same night.

I didn't hear anything back during the maintenance window - no alarms, no emergency page. I reckon no news is good news.

No comments: