Thursday, July 5, 2012

Developers Do the Darndest Things - Episode 1


We had an issue today where some code went crazy and started slamming one of engines with new connection requests, which led to all kinds of fun problems with dynamically allocated shared memory segments, out of memory problems and eventually an engine reboot.

I was taking a look at the code for the application to try and figure out what went wrong and I saw something like the following. It wasn't the cause of the problem, but I don't think it helped.

   for i = 0; i < 5; i++:
      ret = connect_to_database();

      if ret = SUCCESS:
         break;

   if ret != SUCCESS:
      print "unable to connect to database"

Well, it was something like that. Basically try to connect to the database and retry a few times before giving up and throwing some kind of error.

So, what's the big deal? It looks reasonable enough. In my opinion there really isn't much use in retrying the connection attempt in a loop like this unless you're going to go to sleep in between attempts.

Look at what happens, you just got told your attempt to connect failed. What are the chances the problem is resolved a nanosecond later when you loop back and try to connect again? Almost zero. All you've done is throw an additional 4 connection requests at the engine while it is having some kind of a problem, possibly making the problem worse (especially if you have multiple apps/clients doing the same thing).

I realize you can't sleep forever, but sleeping here for even 1 second would have spread the 5 connection attempts over 5 seconds instead of 5 nanoseconds and has a much better chance of actually doing what it was intended to do by recovering gracefully from an intermittent DB problem rather than causing more problems.


No comments:

Post a Comment