Chapter 7. Components for fault-tolerance

Table of Contents
Introduction
Tools and components to be developed
Components and tools descriptions

Introduction

Fault-tolerance can be provided by the combine use of a set of methods and mechanisms starting from design and covering all the development process until run-time.

Targeted objectives of Fault-tolerance in the project are twofold : implementation of tools and mechanisms to support degraded mode management and dynamic reconfiguration of tasks (based on dynamic scheduling policies) on a local basis (one node) during the first phase of the project and, redundancy mechanisms to support dynamic reconfiguration in a distributed system in the second phase of the project.

In the project we will not consider that fault-tolerance can only be provided by isolated separate fault-tolerant components. The methodology we will adopt will be to build a fault-tolerant system using as far as possible the other OCERA components (scheduling, resource management and communication) and exploiting in an appropriate way the new high level facilities they will implement.

The Fault-Tolerance work package will thus not only provide some basic specific run-time FT components, but it will also provide other OCERA components with requirements for new features so that these components can contribute to the overall application fault-tolerance. It will also provide a methodology and associated supporting tools to help application users specify and implement fault-tolerance.

Once specialized, OCERA components will contribute to fault-tolerance provided a few additional features (such as temporal fault signaling), or specific configuration (within QoS scheduler) is undertaken, or specific protocol is implemented (in communication).

The main reasons for splitting objectives in two sets is related to the project planning . We have to reach rapidly a first objective so that a consistent set of components can be demonstrated at the end of the first phase. Local management of degraded modes can be achieved in a such a short period of time and still provide the basic building blocks for the more complete fault-tolerance set of mechanisms. Full implementation of fault-tolerance will require cooperation of almost every other OCERA component and will thus take time to specify precisely all the needed requirements and implement them, this could not reasonably reached within the first period of the project.

While most OCERA components will be run-time components, we have identified a need for design help tools related to fault-tolerance. These design help tools will be devoted first to provide the user to specify non-functional features for its application namely, temporal constraints, declare critical tasks, specify exception handling and alternative behaviors. It will support also the specification of mode-change. A second help will be related to configuration of the target system

Indeed, since fault-tolerance will require cooperation of several interrelated components, it is important that proper configuration and use of the various OCERA components is consistent. Moreover some additional code will have to be added to provide support for dynamic reconfiguration. So a building tool will be developed that will configure OCERA components, instantiate fault-tolerance policies and generate additional code to support fault-tolerance.

Fault-tolerance will thus be considered at three steps in the life-cycle of an application :

Design

Application building

  • Build tasks

  • Configure schedulers

  • Instantiate specific fault-tolerance mechanisms

Runtime

  • Monitoring and control

  • Collecting logs (for tuning)