You are here

Server Downtime (Sept 29, 2006)

5 posts / 0 new
Last post
John T. Haller
John T. Haller's picture
Offline
Last seen: 4 hours 44 min ago
AdminDeveloperModeratorTranslator
Joined: 2005-11-28 22:21
Server Downtime (Sept 29, 2006)

As lots of you probably noticed, the PortableApps.com server was inaccessible today from around 3:30AM to 5:30PM New York time. This was due to an odd chain of events. But the issue has now been corrected and the server should be operating even faster than it was previously. For those interested in the technical details, read on...

Technical Details

As I'd mentioned previously, one of the hard drives on the server failed. This isn't normally a huge deal as the server uses a RAID5 drive array (multiple hard drives acting as one so that if one fails, the server keeps on going and a new drive can be popped in). While the drive was being replaced this morning, the box needed to be rebooted as a RAID BIOS update was applied. While shutting down, mySQL apparently failed to properly close 3 tables used by Drupal: cache, accesslog and sessions. When the box was brought back up with the new drive in place, the RAID array rebuilt itself as expected. mySQL on the other hand attempted to recover from a crash and failed to recover the 3 tables that weren't closed properly. So, whenever someone attempted to access the site... they either got nothing... or a page (or 1/2 a page) after about 2 minutes.

Misdiagnosis Due To Popularity

While attempting to diagnose the issue, RackSpace's techs thought the server may be under a DDOS attack. It turns out this was simply due to the fact that the connections hung due to the table corruption... and the fact that the server is so busy that, as soon as Apache would come back up it was flooded with connection attempts. Unfortunately, this misdiagnosis wound up costing a bit of time as we attempted to setup rules in iptables and things of that nature.

A Solution... Now Better, Stronger, Faster

While working through the issue with one of the techs in one of the many hourlong phone calls I was on today, we were able to get the load down enough so that the site came back. The RackSpace tech accomplished by limiting connections and by enabling query cache in mySQL (which had been overlooked previously). The site was back, but logins didn't work and the behavior of the URLs when attempting them lead me to believe there was some database corruption. Now that the site was accessible enough to log into phpMyAdmin I checked out the database and, sure enough, corrupt tables. I manually recreated each of them and we were back in business.

An interesting side effect of this is that the site is now (correctly) using mySQL query cache (set at 200MBs to be precise) so you should notice more "pop" while viewing pages on the site.

Well, that's the long and short of it... ok, the long of it. Sorry if this interfered with anyone else's day. I know it destroyed mine. (The last thing I needed was another day without working on paying work.) I think I need a nap.

Regards,
John

Ryan McCue
Ryan McCue's picture
Offline
Last seen: 15 years 3 months ago
Joined: 2006-01-06 21:27
John,

We made a page to monitor your server status.
http://liberta-project.org/portableapps/servers
(Sign up, and I can transfer the ownership of the page to you)
So, if the server screws up again, check that page to make sure.
----
R McCue
PortaBlog Home and My Website
And before anyone complains about the grammar, I'm so jetlagged that my
hands aren't even in the same time zone...

"If you're not part of the solution, you're part of the precipitate."

John T. Haller
John T. Haller's picture
Offline
Last seen: 4 hours 44 min ago
AdminDeveloperModeratorTranslator
Joined: 2005-11-28 22:21
Thanks, but

Thanks, but I was updating my homepage, johnhaller.com, with the information as I had it. Most portable apps users correctly assumed they might be able to check there. (All these apps were hosted there up until 10 months ago, which is why I still get a couple hundred thousand visits a month) I had status updates as well as links to all the SourceForge projects so people could download.

I'm bringing a secondary server online in the semi-near future as a hot backup so I can have something available in the event of an outtage longer than an hour.

Sometimes, the impossible can become possible, if you're awesome!

Ryan McCue
Ryan McCue's picture
Offline
Last seen: 15 years 3 months ago
Joined: 2006-01-06 21:27
Well

I'll still keep it there in case.
----
R McCue
PortaBlog Home and My Website
And before anyone complains about the grammar, I'm so jetlagged that my
hands aren't even in the same time zone...

"If you're not part of the solution, you're part of the precipitate."

strider_mt2k
strider_mt2k's picture
Offline
Last seen: 14 years 2 months ago
Joined: 2006-02-15 12:35
Thanks for your work

Great job and a good read to boot.
Thanks for all you do to keep the site up and running!

Topic locked