The Best Laid Plans of Mice and Men

Max Torps's picture
Transmitted by Max Torps | YC-112-06-26

Let's start at the beginning. First a move to new hardware was announced for Eve Online. TQ Level Up

Fantastic. It's all part of the continued growth of Eve Online and something to be applauded.

The estimated downtime was somewhere in the region of 6 hours, details of which are posted here:
http://www.eveonline.com/ingameboard.asp?a=topic&threadID=1337468&page=1...

Then, disaster. And anyone who works in IT will tell you that when you least expect it, disaster can and will occur. There was a protracted delay in getting services online and throughout the player base were provided with timely information on progress and eta's as and when information was made available.

If you have dealt with a major IT incident as I have, you will know that sometimes if something is broken and someone quite reasonably asks when it will be fixed, frustratingly you often cannot answer as you need to know:

1.What is actually broken
2.What is then required to fix it

If you are at stage 1, the diagnostics stage, you cannot give an answer. Some people I have dealt with in my environment who are otherwise reasonable and intelligent seem to turn into petulant babies when an answer cannot be given. Others, probably a 50/50 split, understand and accept.That percentage is guesstimate and highly reliant on moods!

Some posters on the Eve Online forums also comment from a perspective of IT professional claiming that disasters like this will never occur in their workplace. Well, if that is true I congratulate you. You obviously have all the funding and executive backing with robust policies and procedures in place to have multiple fail-over, built in redundant systems and the total eradication of single points of failure.

Back on topic, we are talking about a game company moving a game server cluster. While moving a server cluster is no mean feat we must also put this into perspective and remember that we are talking about a game here. Entertainment. Not a health, government, military or financial organisation. A game.

This is not to say a brief outage is not important to the customer. Far from it, it is very important to the customer. I would also imagine it is of critical importance to CCP too! CCP have recognised how important it is to us by offering a form of compensation for lost time. But one has to wonder about the level of noise on the forums and if perhaps the customer has the right sense of perspective?

That is not to say CCP could not learn from this. With the only information available to us we can see that the move has uncovered some bugs still affecting account management which could possibly be attributed to undocumented elements of code. I would imagine that CCP have some form of Change Management in place, ITIL based or not, that could have theoretically caught this. If not, it's a good idea to refresh documentation and think about implementing for the next time.

On saying that, I have the utmost confidence that CCP will learn from this as they always seem to improve over time and don't particularly strike me as a company that rests on its laurels and judging by this post they won't.

Either way, when you think about the actual scope of the change undertaken and the relative pain free result I can only congratulate CCP on a great job regardless. Hopefully you can too.