Anyone who was a victim of the DDoS attack on the registration system last year remembers all-too-well how the convention staff were caught with their pants down. (For anyone who doesn't recall, this thread details the exact event and our responses to it.)
We had failures on both the technical side of our system (network and computers) and on the human side of our system (putting the backup plan into place fast enough, congoer and staff attitudes). In the past several months, Bloo09 has talked a lot on these forums about the changes and fixes that are being put into place on the human side of our system. As an ADH of Registration, and a friend of mine, I know she takes your complaints seriously and I know how committed she and Tevva (the Reg DH) are to making absolutely sure that side of our system doesn't fail again.
What there hasn't been is a lot of talk about the changes and fixes on the technical side of our system. I'd like to change that. While professionalism prevents me from discussing everything in perfect detail, I want you guys to know that we're listening to your concerns.
First: The size of the IT department has been
Second: We are planning out our network for next year in great detail, months ahead of time. I want to emphasize that the plan is a work in progress, and has not been finalized, and may change. But as it currently stands, some features of the plan that you may appreciate include:
- Our application and database servers will be on-site. So if the entire internet blows up, we can still continue running (except for credit-card transactions). This was an obvious choice, and should have been obvious last year, but it's one I know many of you were hoping to hear.
- There will be hot spares for our application and database servers on-site and running, so that if the at-con server fails, no data is lost and our downtime won't be more than about five or ten seconds.
- There will be cold spares for network hardware and cables, so that if one of them fails, our downtime is only as long as it takes to connect and power-up the new one. (In a worst-place failure -- one of our core switches -- this should be about two minutes. That's a very rough ballpark though.)
- There is an alternate method for credit-card transactions, so that if the application server still fails, the finance staff can continue to securely and quickly process credit cards (provided the internet still exists).
- Our database will be backed up at not one, but two off-site locations on a periodic basis.
- All application traffic, both inside and outside the network, will be encrypted.
- Specific servers continuously monitor the status and health of the network, and will allow the IT department to see problems as they arise, instead of hours later.
- The physical security of all the hardware is being taken into account -- no more being able to sneak behind the curtain and get to the core switches!
Third: We are writing a testing plan to ensure that our network (and the staff) is reliable. This is in coordination with the Reg department, but the plan is to set up the entire network several months ahead of time (early next year) and run full mock tests with Registration staff using the network. We plan to test as many modes of failure as we can reasonably do, see how the network performs, train the staff on use of the application, and make changes to the plan if necessary, months before con.
Fourth: In the event that, despite all of the above, our system still boffs the pooch, the Registration department is preparing a network-less, computer-less backup plan that can be put into effect quickly. (This isn't an IT change, but I wanted to make it clear.)
There is still a lot of planning to be done and there may be other changes made. But I wanted to reassure those of you who had complaints last year that we are listening to you and working hard to prevent those failures from ever happening again. We are very grateful for your support and your patience, and we recognize that that respect has to work both ways. As we get closer to con, I plan to write a few more updates to keep you informed.
This post has been edited by Riker: 06 December 2008 - 10:55 AM
Reason for edit: More IT staff.