Amazon EC2
I was very impressed when Amazon launched its EC2 cloud infrastructure. So, eager to test this, I started up some servers and tried to install my koopjeszoeker application on it. Until then this Java application was running on a private server (in Brussels). This is almost 2 years ago.
Everything went reasonably well and I liked the possibility to install a new version on a separate server and then just use the elastic ip address feature to switch the production version to this new server. The problem I had was running a database server which could also scale with the application. Luckily, Amazon seemed to read my mind every time I needed something. So they released Simple DB as a scalable database which was enough for my needs. Later on, they released Relational Database Service, but I haven’t needed this yet.
The whole setup for my site was maybe a bit overkill, but is was a nice test setup for learning more about working with this infrastructure. During the next 2 years, I added Cloudfront and used S3 as a backup solution. I also set up Amazon Elastic Load Balancing with autoscaling enabled for traffic peaks. I wanted a server solution that just worked so I wouldn’t need to spend too much time on system maintenance.
Switching to Appengine
I was able to lower my monthly bill for hosting the zamtam sites (koopjeszoeker.be, koopjeszoeker.com, fr.zamtam.be,zamtam.fr and recently beta sites zamtam.co.uk and zamtam.de) by switching from Amazon EC2 to Google Appengine. The monthly Amazon bill (a constanly running High-CPU Medium instance with S3 traffic, Simple DB, Cloudfront and now and then a test instance) was around $ 180 a month. My Amazon server ran for almost 2 years. My Google Appengine bill is now around 40 cents a day, which makes around $ 12 a month. This is 15 times less!
I think the main benefit of Appengine versus EC2 in my case was that I don’t need a constantly running server, but I do need enough capacity to handle peak traffic (mainly in the evening and the weekends). In the EC2 case, this means you need to start more servers (manually or with elastic load balancing) while Appengine handles this automatically. You (roughly) only pay for the extra CPU time consumed.
For me the only reason not to use Appengine until a month ago was the lack for Java non-blocking IO support. Luckily, this issue was (silently, I only found out about it by reading the detailed release notes) resolved and you can now use UrlFetchService.fetchAsync()!
Lessons Learned
Some things I’d like to share about my experience with AppEngine:
1 GB is a lot of space. Don’t optimize for storage size when you have 200 GB a month for $ 1 a day. A typical application won’t need more than 10 GB which costs $ 1.5 a month. Similarly, one million tasks a day is a lot. Don’t prematurely optimize to put a lot of work in one task when you can spread it in many small concurrent tasks. Like Chris Anderson puts it in his book “Free” (I couldn’t find the exact quote since I listened to the audio book in the car and this isn’t searchable yet): “when something’s free, people tend to treat it like it’s indefinitely available”.
6.5 free CPU hours already allow for a lot of work. I handle around 10.000 visitors a day, a lot of URL Fetches and many image transformations and only now and then I need more than this.
Startup time can be an issue, so I removed all unneeded jars from WEB-INF/lib and did some lazy loading. This startup time is however mainly an issue during lower traffic times because Appengine stops and starts instances according to the traffic. A visitor who hits a just starting app needs to wait longer and sometimes gets an error page. Once your app is up and handles a steady amount of traffic, the server instances seem to stay up. You can monitor this in the logs by using a ServletContextListener and log the event in the contextInitialized() and contextDestroyed() methods.
The task queues are really useful to do work asynchrously, like cleaning up the datastore (remove all thumbnails older than 30 days) or executing long running cron jobs. Requests called by the task queue provide some headers that are useful to retry a task only for 3 times. I check this header in the catch block and when it is equal to 3, I don’t throw an exception anymore so the task is removed from the queue.
There are workarounds around the 30 second execution limit. My workaround is to do a small amount of work in a Servlet (Spring Controller) and then add the same url with some other parameters (like a database cursor) to a task queue.
You don’t need a database for everything. I moved some tables that would never change to my Spring config XML which avoids datastore lookups.
Your application needs to be able to handle sudden shutdowns and startups without error. A user may arrive on a different server instance for every request. I decided not to use HttpSessions (I almost never use this).
The URLFetchService caches responses by default. You need to add your own no-cache request headers to get fresh results.
Subscribe to the Appengine downtime notification feed, you can also check the system status. According to Murphy’s law, the first week when I ran on Appengine the whole thing that’s not supposed to die went down. Google did provide a detailed post mortem explaining everything. As long as they’re the ones who need to solve the infrastructure problems and not me, I’m happy with that. I’m modest enough to know I couldn’t possible match their expertise.
It is possible to set up multiple custom domains, so you’re not stuck with myapp.appspot.com. I also use 4 hostnames for thumbs, like thumbs1.zamtam.com, thumbs2.zamtam.com, … and a hash on the filename to determine which hostname should server the image.
I created a small java class AppengineUtils.java with some useful methods, feel free to use it. I add the app version to my javascript file so this has a different url for each time I deploy a new version and the cache headers for this url can be set to a much longer time. I check if I run in the development server to show some buttons in the html that don’t show up in the production version.
Improvements
The dashboard resets every morning at 9 AM CET. There is no way to see the quota details for the previous days.
The time mentioned in the logs is confusing since it is not my local time. An option in the Appengine settings to set the local time would be handy.
The blobstore (still in beta) misses some features, like an easy way to store data fetched with the UrlFetchService to the blobstore. Luckily my url fetches are smaller than 1 MB so I can store them in the datastore.
The Google Accounts integration is sometimes confusing. I use Appengine from my Google Apps domain (onthoo.com) but my site runs on different hostnames (koopjeszoeker.be, zamtam.fr, …). So I needed to add (verify) these domains to my Google Apps Domain. This part succeeded. The problem is that I want to send an e-mail from the Mail API, but this service only allows outgoing mails from accounts that are developers for the app. I can’t seem to add a developer who has an e-mail address like noreply at zamtam.com (an extra domain for my onthoo.com Google Apps domain) instead of noreply at onthoo.com. I get the developer confirmation e-mail, but the link goes through a series of redirects to end in an error page. I think my whole Appengine setup is a bit messed up since I currently have 9 apps deployed and it still shows I have 4 remaining (you can have maximum 10 apps). It can have something to do with the fact that I have a Google Apps account and a Google Account with the same e-mail address. I have to be careful to log in through https://appengine.google.com/a/<YOURDOMAIN.COM>/ instead of https://appengine.google.com .
The URLFetchService is limited to 10 asynchronous fetches at a time, while I need 12 at the moment. An increase would be nice, although I know my case is probably an exceptional one.
The 30 active dynamic request limit is for me sometimes an issue, since I use the image api to generate thumbs on the fly, which takes a bit longer (fetch the image url, resize it, store it in the datastore and return it). Since I’m using different hostnames for the thumbs (like thumbs1.zamtam.com, thumbs2.zamtam.com, …) I get up to 10 requests at a time for a page. You see the problem when I have 3 users requesting a page at the same time… I cache the thumbs so they’re only generated once, but this doesn’t handle all the cases. This is something I need to investigate further and maybe I should ask for an increase?
‘Naked domains’ are not supported anymore, so using zamtam.co.uk for example isn’t possible. This makes the DNS setup a bit more complex.
Conclusion
A lot of exciting things can be done with Appengine. Especially when you run a website instead of long-running batch operations, Appengine can turn out to be a lot cheaper than Amazon EC2. While EC2 allows you to do much more and in the way you prefer, Appengine pushes a bit to do it their way which makes it easier for you. With Appengine, you also don’t need to think about scaling MySQL, load-balancing Apache or updating Linux.
One benefit can’t be stressed enough: you don’t need to plan your server capacity beforehand since Appengine does this automatically. Also, deploying a new version is easy: upload it, test it and when ready, switch the default version to the new version. No downtime, no worries (you can always go back to the previous version if something shows up later with the new version).