Andy McKay

Oct 22, 2010

When App Engine went wrong


Misery

About a month ago Clearwind launched a new site on App Engine. It's a pretty straightforward site for a great client we had. We'd been working on the site for many months a little bit here and there, building out new features as the client requested them. Within a few hours of starting work we had a base site running on App Engine as our test.

So that site sat there, quite happily serving out those odd few requests as we added in new features. There was the occasional errors for things like data migrations and testing of new features as we went, but nothing major.

And then in September, the site went live - and there was much rejoicing. And then it went down. I did an update (this is a push of the code to the server) and it came back up. And then it went down. And so on. It couldn't stay up for more than about 10 requests or so, sometimes more, sometimes less.

I started tweeting, I was pretty upset. The site that had been working fine, just didn't seem to work well. We were getting errors similar to issue 772 and issue 1409. As result of reading those tickets, I did learn that your app needs to be able to cope with a DeadlineExceededError killing it any point, even imports, monkey patching etc... another interesting App Engine issue to cope with.

The result of those threads was not helpful (and 772 is still not resolved) but grasping at straws I thought I'd try ripping google-app-engine-django out. This is a helper that we also use in Arecibo that allows you to use Django a little more out of the box with App Engine (and gives you all important command line integration). So I ripped it out that night and pushed up. Site went down.

Around this time Google got in touch after seeing my tweets and offered to help. That was pretty nice of them and I was quite impressed, thanks. I explained the problem and after chatting to the client decided we'd call it quits for the night, give it about 24 hrs since the problems started and see if Google can help.

Next day

No reply from Google by this point, the site is still down. I tried more changes, nothing worked. What was clear is that App Engine is having real problems on the import. Here's the start of main.py (which is the main entry point in your app):

import logging 
import sys
import os
...

I was getting a DeadlineExceededError on line 3. That's importing os. If it couldn't even import that before the DeadlineExceededError hit. What could I do?

It also became clear it was affecting a whole bunch of people too.

Rackspace

That night, with the site still down and client upset, but understanding, we moved to Rackspace. That meant 2 nights and some of the weekend (this is a period when I was crazily busy trying to meet other clients deadlines too) ripping everything App Engine about the site out and making it plain old Django with a Postgres back end. A day of testing and migration (the App Engine migration tool broke at this time too) and the site was up and running.

So we've got a site that had a large amount of stuff written specifically to cope with App Engine design in it, rapidly ported to standard Django. We could probably rip about 30% of the site out and still be functioning and there's a few ForeignKey oddities in there as a result, but it's working and working well.

We've lost the (theoretical) speed, scalability and lack of maintenance of App Engine, but gained a working site. We also gained technical support, that both me and the client have been able to mail and phone and get quick helpful answers from, I've actually been quite impressed with Rackspace in the meantime.

Where now with App Engine?

It's not just this one client site we run on App Engine, it's also all the Arecibo instances out there. Some of which were hit by similar problems. That really sucked since instance X generated an error and sent it to instance Y which was also down. Once I've got the features I want in Arecibo for the next release, the next feature will be providing a standard Django, non App Engine instance.

It took from Sep 23rd (most of the original errors occurred around this time) to Sep 28th for some start of resolution and it looks things were under control by Sep 29th. None of this ever showed up in the App Engine status page.

Would I use App Engine again? Maybe, but I'd need a paid support line - someone whom I can get help and resolution from - perhaps App Engine for Business is that. All systems have their bad days, Rackspace even occasionally have outages, but I've never seen anything last 3-4 days and not allow me to do anything. It made me look bad and cost me an awful lot of time.