I’ve got a client running survey software, Sawtooth WinCati, that was originally written in the 80’s for phone interviewing. The latest version has an ASP web app attached to the original software. Now, it’s hard for a small business to scale a call center because of the costs of labor, space, hardware, phone lines, etc. But a web survey should be able to handle a large load w/ cheap hardware and relatively cheap internet access available these days.
I pitched to run the software on EC2, but the client wanted a dedicated machine and didn’t trust a virtual instance running in a huge Amazon data center. (Also, see the problems w/ transferring from EC2 small to medium instances I blogged about earlier. We’d need to pay licensing for another server installation to have it setup for small and medium/large instances. Amazon, you’re killing me with this decision!)
Anyway, the survey company tech support couldn’t tell me any stats about scaling for the web. They guessed it could handle 50 concurrent users before having serious problems. I called back and got one of their programmers and we went over the architecture and the bottlenecks. It seems this software uses a FoxPro database. I wasn’t in the software world when FoxPro was popular, but there’s a reason it’s gone. Each ASP session making a write request to the DB would need a full file lock to make the write. i.e. It uses database locking instead of table or row locking. ugh! This is the bottleneck everyone sees. Bandwidth and CPU are pretty negligible, though I have seen this software chew throw memory due to ASP session leaks from poor programming and testing. They said that one day they will move to a client/server db, but obviously that doesn’t help now.
So, I asked the obvious question. If we loaded up the survey into two instances on the same box, then would that double throughput? They hadn’t considered this approach before for some reason. However, they did have a function to merge datasets from two identical questionaires. Perfect!
Essentially, I’m making multiple masters and synchronizing them manually once a day before data is converted in a report. Due to the nature of this app (Non-conflicting writes. Reads from tables that don’t change during a survey.), this is really easy to scale. In fact, if this were using MySQL, it would be very fast with a single db since no queries should be attempting to access the same row at the same time for writes. This is assuming the DB design done by Sawtooth is decent.
Anyway, I’ve got Jmeter loaded on ec2 and hitting some sample survey at the highest rate expected for a very successful project. Putting each duplicate survey instance into it’s own IIS6 Application pool and own database file appears to scale well so far. The separate db file is most important, but I experimented w/ separate app pools, and got improved performance. Plus, this should help mitigate the impact of the session leaks. I still have to test it out on a full size survey on the production server I starting setting up on Friday, but I feel confident we can scale now.
We send out emails w links to the survey, so I’m just splitting up those emails to point to N survey instances, and using Jmeter to figure out how many instances I’ll need to handle the max load.
Now, as I write this, I’m thinking that a load balancer would be nice. One that checks the open sessoins per app pool and keeps everyone balanced. That way we only distribute a single URL to the 32K survey takers. Less management of urls, etc. Hmmmm. Well, there is one requirement that actually preclude a simple load balancer like this. However, the requirement is kind of unique, so future projects could probably benefit from LB. I haven’t looked into LB options, but may in the future. I’m thinking I could set all this up on EC2 and offer a scalability service to any company requring this software for their web surveys. (Apparently, it’s entrenched in a lot of companies around the world.)
More manual managment of data and creation of multiple URLs, so more room for error.
The merging of the data will be a pain sometimes.
Successful execution of the project for the client.
I checked out the Win2k3 builtin Load Balance Service. It appears to only work at the IP level to distribute among different machines. I want to distribute load on one machine across multiple application pools.
I suppose I could write a simple ASP landing page to round robin among all the app pools. Not sure how much effort it takes to collect info from each app pool about active sessions, request queue length, etc. (Basically all the info the NLB is supposed to use for me automatically.) However, these are all identically sized surveys, so in spirit of KISS, the round robin would probably do the trick well enough as a simple load balancer.
Wrote the simple load balancer for each instance. It handles a Jmeter deluge well.