Saturday, June 21, 2008

Melbourne - Day Two

Even before the activity started, I was already planning for tomorrow. I'm mean, how hard would it be to shutdown some processes and power down the machine, get the Sun engineer to upgrade the CPU and memory, power up, and re-run the processes? As with most things, it's only a problem when it doesn't work.

One, the Sun engineer doesn't want to go by the implementation plan. He said he had encountered enough systems that had problems recovering from a reboot, so he wants to see a proper reboot first before proceeding. After shutting down the system, he started checking the contents of the delivered boxes. Then he refers to his online documentation to see whether the hardware is compatible and how to actually install the CPU and memory chips. Shouldn't he be doing all these prior to shutting down the server? All this time, the other server is already acting up because it can't see its partner.

Anyway, hardware upgrade for the first node went well. The database and all the processes came up properly. After rebooting the second node, everything went pear-shaped. Can't restart the database. Checked the first node, error 3114 - not connected to db anymore. Restarted both nodes - same thing. Oracle simply says "ORA-27041: unable to open file", which doesn't tell me anything. After some investigation, it looks like the oscracdg group is disabled. I know nothing about Veritas Volume Manager, and the Sun engineer is just about to ready to flee from the crime scene. I do have the option of calling our CSI or even Optus' HP support for Sun (strange, I know), but I believe this will just make things worse. Thankfully, the Sun engineer was kind enough to dig around his laptop, and we found some Veritas cheat sheets. Even that didn't help because we don't even know what the commands are actually doing. He got out his Telstra datacard and logged on to their solution database. After finding an exact match to our scenario, we're back in business. Took us a while to figure out what commands to run, and in which sequence, but after a few hours of working through all the volumes, all the disks are enabled again.

But wait, there's more. The volumes are up, the DB came up, the processes came up, and I can now see traffic going through, but I feel something is still wrong. The application threads keep hanging. Have to restart pcore and osc_core every few minutes to handle incoming traffic. Left the site at 5am. Got back to the hotel and continued troubleshooting till 9:30am. Called up CP for support. Asked me to call TKC. Involved him around 10am. Had to stop work at 11am as I need to catch my flight back to Sydney. Had brunch of fried dumplings in the cab. Started talking to TKC in the airport about the problem. Asked him to start troubleshooting. GF went to the airport to pick me up. She spent the morning packing lunch boxes for WYD 2008. Continued troubleshooting at home. Turns out pcore is causing osc_core to hang when it is called. Don't run pcore and osc_core will keep on working. Together with TKC, we granted new privileges for pcore to some tables. Now pcore can run, AND delete entries from the db (pc_disconnect_ctx). Both Sunshine OSCs pointed to osc1's pcore. Fixed that one, too. Stopped work around 7:30pm. CP said she's not available tomorrow. My suggestion to everyone is to stop work and just monitor tomorrow. We work again come Monday.

No comments: