Archive for November, 2010

Today I had an interesting experience. At about noon the two main webservers of a major client went down under massive load. We immediately assembled a team to address the issue.

Our system is based on web scripts with data in a database. As the application designers we immediately attacked those two variables. The database seemed to be operating normally, and no queries were funking things up there. That seemed fine.

Second we attacked the code. It had to be the code because it wasn’t the database. Also we had moved a version of code to the production servers barely 30 minutes before the problem occurred. ¬†We spent almost 6 hours troubleshooting this, convinced we had a coding problem. It turned out it wasn’t a code problem at all, an affiliate had increased their load on the server by over 1000% – were being D0S’d in effect and because of the nature of the requests, our normal rate limiting system hadn’t kicked in.

I saw this affiliate hitting the server very often early on in the process of working the problem, but I ignored the evidence and went on trying to find a problem in the code. Why? Several reasons:

  • We had recently updated the code. Correlation is causation, right?
  • Some of my code was questioned from the start. My ego was defending my skills, so I was invested in working a ‘code’ angle of the problem so I could find another problem with the code other than mine to redeem myself.
  • Everyone else thought it was a code problem. Group influence. No one else had any alternatives to a ‘busted code’ explanation. This also corresponded with my ego problem, so they reinforced each other.
  • We have a stigma as a group writing code. We usually have problems with code and database, those are the big variables in the system and we had eliminated the database as a possibility, so it must be the code. No one could think of other variables.

So analyzing this situation, what could I of done better? Probably a lot of things. Primarily I fell into the trap of first impressions. As someone experienced with the technology I often have pretty accurate first impressions on what is wrong with this system. In this case though the as-always brief set of data I had to go on was misleading, and I bought into what it suggested.

This shaped my perspective which literally blocked out competing information – I even saw the evidence of an affiliate hitting the system really hard. Much harder than anyone should be doing anything.

I even showed this to another engineer who was involved with the affiliate system – Both of us agreed with little thought or evidence that this was unlikely the problem. We both were already convinced of another source of the problem, even though we hadn’t found it.

Lesson learned: Beware of snap judgements. They are ALWAYS formed on insufficient data. Those that are experts in an area are seem more likely to make mistakes than beginners simply because beginners don’t know enough to make the snap judgements yet. They question themselves more inherently. Maybe this is a good thing.

When approaching a problem like this, keep a young mind. Look at it as a newb would.