Wednesday 16 April 2008

Never ever synchronize threads without specifying a timeout value

Whenever there is more then one thread and more then one shared resource there must be some synchronization in place to make sure that the overall state of the application is consistent. Synchronization is not easy as it very often involves locking which very easily might lead to all sorts of deadlocks and performance bottlenecks. One of the ways of keeping out of trouble is to follow a set of guidelines. I can list at least a few sources of information worth getting familiar with:
And of course :) my two cents or rather lessons I've learnt writing and/or debugging multithreaded code:
  1. Minimize locking  -  Basically lock as little as possible and never execute code that is not related to a given shared resource in its critical section. The most problems I've seen were related to the fact that code in a critical section did more then it was absolutely needed.
  2. Always use timeout - Surprisingly all synchronization primitives tend to encourage developers to use overloads that never time out. One of the drawbacks of this approach is the fact that if there is a problem with a piece of code then an application hangs and nobody has an idea why. The only way to figure that out is to create a dump of a process (if you are lucky enough and the process is still hanging around) and debug it using  Debugging Tools for Windows. I can tell you that this is not the best way of tackling production issues when every minute matters. But if you use only API that lets you specify a timeout then whenever a thread fails to acquire a critical section within a given period of time it can throw an exception and it's immediately obvious what went wrong.

    Default
    Preferred
    Monitor.Enter(obj)
    Monitor.TryEnter(obj, timeout)
    WaitHandle.WaitOne()
    WaitHandle.WaitOne(timeout, context)

    The same logic applies to all classes that derive from WaitHandle: Semaphore, Mutex, AutoResetEvent, ManualResetEvent.
  3. Never call external code when in a critical section - Calling a piece of code that was passed to a critical section handler from outside is a big risk because there is always a good chance that at the time the code was designed nobody even thought that it might be run in a critical section. Such code might try to execute a long running task or to acquire another critical section. If you do something like that you simply ask for trouble :)
I suppose it's easy to figure out which one has bitten me the most :).

10 comments:

  1. What would timeout value would you put on a process that can be locked for an undetermined time ? I have been programming multi-process and multithreaded application for 15 years and most of the time it was impossible to determine what was that value.

    Personnaly, what I normalyI do is that I create an event that is used to unlock the wating thread. If the user feels that the processs is hung, he just have to press a button that trigger that event. At the same time an exception is raised and the error recovery system takes over. This system can used to shutdown the application.

    ReplyDelete
  2. I understand your reasoning, but this isn't good advice. First, as just pointed out, you can't always choose a reasonable timeout. More importantly, however, in situations like this you're left with a program in an indeterminate state. The only safe course of action is to terminate the application, which an exception won't always do. Anything else is going to lead to further deadlocks, race conditions or catastrophic errors due to the state of the application.

    There are better ways to detect deadlock situations (not as simple, though), and in those situations what you need to do is log and abort, not throw an exception.

    ReplyDelete
  3. Martinos,

    You are saying that an application can be locked for "an undetermined time". Well in this case I give up but to be honest I haven't seen a project that would specify that an activity, a process or a call can wait for a resource forever.

    I agree that it's not easy to find the right value but it doesn't prove that it's impossible. From my experience I can say that if I can't find a good enough value then it means that my design is not well thought out. I would even say that every project is based on some assumptions regarding performance and it doesn't matter if they are specified explicitly or implicitly. I just can not imagine any business process that would accept that an operation might take 50 ms, 50 days or 50 years.

    The way you get out of a deadlock might work with application that provide UI. I work mainly on back-end systems and there is nobody there to do anything.

    Don't get me wrong, I'm not saying that figuring out the correct value of timeout in every case is simple but I would rather specify a value that is slightly incorrect and get a meaningful notification when it fails than not specify it at all and get a call at night that a service stopped responding. I'm just trying to stick as much as possible to a rule that I borrowed from Rico Mariani: "If you aren't measuring you aren't engineering".

    ReplyDelete
  4. wekempf,

    If the error handling is correct an application won't be left in an inconsistent state. Let's assume that there are 100 concurrent requests going on and one of them fails. In a incorrectly designed application the other 99 requests are not affected. I can not imagine that a multi-thread application goes down because of a single thread that misbehaved. And the best way of detecting deadlocks it to avoid them :)

    ReplyDelete
  5. You've traded a deadlock for a race condition. Two threads want to use a lock, and only one gets to. That's a race. And the vast majority of the time, that means you're in an inconsistent state.

    ReplyDelete
  6. From my perspective in this particular case deadlock and race condition don't differ much as in both cases a resource has not been acquired within a given period of time. That's it. If a resource can't be acquired then a thread can throw an exception and an application should roll back to the last know consistent state. If this is a web application then you might want to log out the user, if it is a persistent workflow you might roll it back to the previous stage, if it is a remote call you just return a fault and it's up to the caller to figure out what to do. In all cases a single local problem does not bring down the whole application. The local state might get corrupted but the global state should stay valid. I'm not saying that it is always possible but I think that there are more shades of grey.

    ReplyDelete
  7. I really dont think that all wating thread should have a timeout. Every day, you use applications that don't have any timeout. Any Windows application is waiting forever for a user input.. If you leave your computer alone for 51 days your application main thread will not timeout.


    ReplyDelete
  8. A timeout value makes sense only in the context of a critical section/shared resource. If an app waits for user input(or any other external trigger) then more than likely it doesn't do it inside of a critical section.

    ReplyDelete
  9. In fact that was my point,

    You said in your "Always use timeout" section:

    "The same logic applies to all classes that derive from WaitHandle: Semaphore, Mutex, AutoResetEvent, ManualResetEvent".

    It's not on all sychronisation primitive that need a timeout.

    In fact, a part of code protected by a critical section that waits directly or indirectly on a even (eg. user input), you should not use a timeout.

    Regards

    ReplyDelete
  10. If you don't wait in a critical section and the contract specifies wait value as infinite then I agree.

    ReplyDelete