Coding Approaches to Fault Tolerance in Dynamic Systems
Fault tolerance has been a long standing necessity in system
design and operation. In systems
with memory (state),
however, modular redundancy and other traditional approaches to fault tolerance
are undesirable not only because they are expensive but also because they rely
heavily on the assumption that the error-correcting (e.g. voting) mechanism is fault-free. This talk presents a general framework
that systematically addresses these issues in fault-tolerant discrete-time
dynamic systems. Our approach
replaces the original system with a redundant implementation in which the state
and next-state transition mechanism of the original system are encoded. Error detection, correction and/or
reconfiguration are then based on detecting and identifying violations on the
state encoding of this redundant implementation. One of the implications and advantages of this approach is
that, unlike traditional methodologies that rely on concurrent checking at the
end of each time step, this approach allows the construction of redundant
systems in which detection and identification of errors is based on
non-concurrent (e.g. periodic) checks. In the resulting design, the checker
can operate at a slower speed than the rest of the system, which relaxes the
stringent requirements on its reliability. We demonstrate this approach in the context of linear
dynamic systems and arbitrary finite-state machines. We also discuss how our framework can handle failures in the
error correction mechanism, enabling in this way the construction of reliable
systems exclusively from unreliable components. In particular, we construct reliable LFSMs using unreliable
XOR gates and voters in a way that requires only a bounded amount of redundancy per machine and
achieves arbitrarily low probability of failure during any finite time
interval. The talk will
conclude with open problems and directions for future research.
Friday, March 21, 2003
3:30 – 4:30 p.m.