Saturday 4 August 2012

Race Condition

A race condition is defined as the inter-dependency between a few processes. If one process finishes ahead of other processes, undesirable condition could occur. Race condition sometimes only happens in certain hardware combination.

Could race condition happen in the 100 meter sprint competition?

A case study of GPIB EOI Race Condition on LAN/GPIB converter. EOI stands for "end or identify".  EOI line has two purposes - The GPIB Talker uses the EOI line to mark the end of a message string, and the Controller uses the EOI line to tell devices to identify their response in a parallel poll. In our case, the GPIB instrument is the Talker.

PC -------- LAN/GPIB converter -------- GPIB instrument
                                     |
                                     |     <-------        +1
                                     |     <-------       EOI
                                     |
         (normal case, +1 will be forwarded to PC, error case, timeout will occur)   
                                     |
<--------    +1               |
                                     |     ------->      REN
                                     |

The converter is treated as a GPIB card in the PC. The PC test program call itimeout(...) which sets the timeout value, and this timeout value is passed to the converter. In our case, the timeout happens, no matter when timeout value we set.

In the converter firmware, read_buf() function:
       while (buf < bufend) {
          if (re_rdonly_intr(INT0_END)) {
              if (re_rdonly_intr(INT0_BI)) {
                   re_rddes_intr(INT0_BI);
                   buf_eoi = true;  ----> added to fix race condition
                   if (len!=1);
                            modify_cnt = true;
                }
         } else {
              re_rddes_intr(INT0_BI);
         }
      }

  notify_pressed()
     buf_sem.wakeup();    ------>  (1)

  read_buf()
     buf_sem.sleep();       ------->  (2)

(1) wakes up (2), (2) wakes up and check its status, not fulfilled, and (2) goes back to sleep again. and then (2) will timeout eventually.

In details, due to h/w response, when receiving short GPIB msg, INT0_BI is not cleared. Next time INT0_BI is already set, so buf_eoi is not set. It causes read_buf() to wait for a data byte that never arrives, and the semaphore will timeout.

The GPIB instrument is running on Pentium 800 embedded processor, it speeds up GPIB bus response, so the race condition appears.

In debugging race condition, log time should be displayed till milliseconds, and a circular log buffer should be used.