Fault Tolerance in Operating System
Al Anoud Filfilan
Master Program Student
Eshraq Bin Hureib
Master Program Student
E-mail: [email protected]
system is one in which the failure of a computer you didn’t even know existed
can render your own computer unusable” Leslie Lamport, May 1987
Every day we see that our dependent on technology
increased in incredible way and our utilization of technology has expanded
inconceivably and today computer systems are interconnected by means of various
correspondence medium. The utilization of appropriated systems in our everyday
exercises has exclusively enhanced with information dispersions. This is on the
grounds that dispersed systems empower hubs to arrange and enable their assets
to be utilized among the associated systems or gadgets that make individuals to
be incorporated with topographically distributed computing facilities. The
conveyed frameworks may prompt absence of administration accessibility because
of various system disappointments on different multiple failure points. This article features the
diverse fault tolerance mechanism in dispersed systems used to prevent various
framework disappointments on multiple failure points by thinking about
replication, high repetition and high accessibility of the distributed services.
Computing Systems comprises of assortment of hardware and software segments.
Disappointment of any of these segments can prompt unexpected, conceivably
troublesome conduct and to service availability. Insufficient level of fault
tolerance in past tend to cause some cases where critical applications buckled
we meant by or what suitable definition fault-tolerance technology?
best definition is a ability of a computer system, electronic system or network
to transport continued service, in malice of one or more of its ingredient
fruitless, that what we. Fault tolerance also solve possibility service
obstruction concerning to software or logic fault. The aim is to refuse
calamitous fail that perhaps outcome from a single point of defeat.
other world, Fault-tolerant computing is the technique and science of building
processing frameworks that keep on operating acceptably within the sight of
issues. A fault-tolerant system might have the capacity to endure at least one
blame composes including:
Transient, discontinuous or lasting
Software and hardware design errors.
Administrator mistakes. OR
Remotely incited upsets or physical
range of utilizations extending across over implanted continuous systems,
business exchange systems, transportation systems, and military/space systems
that covered by adaptation to non-critical failure and trustworthy system look
into. The supporting research contain system architecture, validation, testing,
directory of correctness, styling mechanism, designing, software accuracy,
operating systems, coding notion, analogous processing, and real-time
processing. These areas regularly include generally different center aptitude
going from formal logic, mathematics of stochastic modeling, graph theory, hardware
design and software engineering.
consider it as dependable systems because it is very much related to fault
tolerant, when the system is trustworthy enough that reliance can be placed on the service that it delivers we can be considered
as dependable system. If we need to make any
system to be dependable, it must to be confident from the following things:
1- It must be available (e.g., prepared for use
when we need it).
2- It must be reliable (e.g., able to supply
continuity of service while we are using it).
3- It must be safe (e.g., does not have a tragic
outcome on the environment).
4- It must be secure (e.g., able to maintain
are few terminologies that are very carefully related to dependability of system
and its conduct:
Fault – Can be describe as
“fault” at the lowest level of abstraction. It can lead to incorrect system
state. Error may be classified as transient, intermittent or permanent.
can be of following types:
Processor Faults (Node Faults):
Processor faults happen when the processor conducts in an unforeseen method. It
may be of classified into three kinds:
a) Fail-Stop – where a processor can both be active and share
in distribute protocols or is completely failed and will never reply. In this
case the neighboring processors can discover the miss processor.
b) Slowdown – where a processor
might run in degraded mode or might completely defeat.
c) Byzantine – where a processor can
fail, run in degraded mode for some time or performs at ordinary speed but
tries to fail the calculation.
Network Faults (Link Faults): Network
faults happen when (live and working) processors are blocked from communicating
with each other. Link mistake can cause new type of problems like:
way Links – Here one processor can transmit messages to other is not able to
receive messages. This kind of trouble is similar to that notable due to
b) -Network Partition – Here a portion of network
is perfectly separated with the other.
Error – Undesirable system state
that may lead to failure of the system.
Failure – Faults due to unintentional
Types of failure:
Types of Failure
A server halts, but is working correctly
until it halts
A server fails to respond to incoming requests
A server fails to receive incoming messages
A server fails to send messages
A server’s response lies outside the
specified time interval
State transition failure
A server’s response is incorrect
The value of the response is wrong
The server deviates from the correct flow of
A server may produce arbitrary responses at
Fault Tolerance – Ability of system to conduct
in a well-defined method upon appearance of mistake.
Recovery – Recovery is a passive
process in which the country of the system is maintain and is used to roll
support the enforcement to a predefined checkpoint.
Redundancy – With veneration to error
tolerance it is replication of hardware, software components or computation.
Security – Robustness of the system
characterized by privacy, integrity, availability, accuracy and benignity over
Types of Fault Tolerance and Failure
1- Types of
means that some particular “bad thing” never happens within a system.
Officially, this can be characterized by specifying when enforcement is “not
safe” for a royalty p: if e ? p there must be an
identifiable separated event e that prevent all potential system executions
from being safe. e.g.: simultaneous updating of a shared object. Distributed
program is secure if system will always remain within set of safe states.
a liveness property claims that some “good thing” will ultimately happen
through the system execution. Formally, a fractional execution of a system is
live for property p, if and only if it can be legally extensive to still remain
in p. “Legally here means that the extension must be allowed by the system
itself. e.g. : a process expectation for access to a shared object will
eventually is allowed to do so.
distributed program A must satisfy both its safety and its liveness property.
Now upon appearance of fault to conduct correctly, how is the property of a
of Fault Tolerance
live not live
The masking type of fault tolerance is most eligible but also
it most costly to execute. All usage with this kind of adaptation to internal
failure can endure error in CSE 6306 Advance Operating Systems 5 transparent
technique. While for last case where neither integrity nor liveness is secured
is the most unwanted.
Between the two medium fail secure is
appropriate (and is dynamic territory of research) over non-masking in kind of
the significance of leaving the system in secure state. If there should be an
occurrence of non-masking compose the yield of the system may not be alluring
or redress but rather still the outcome is conveyed. As of late specialization
of non-masking fault tolerance called Selfstabilization is effectively a great
many. Projects of this kind can withstand any sorts of transient faults.
However projects of such kind are hard to develop and test.
2- Failure Detection
Failure detection is quite primary in
realize integrity and liveness monarchy of the system. Different researchers
for efficiently detecting and define the type of fault in the system have made
many pains. One of the significant donors has been Chandra to determine
agreement and nuclear communicate issue by unreliable failure detectors by
utilization of thorough formalization.
Gallet have further scrupulous their
competence with measurement using prolonged message chain previously decision
and paramount lower confines for prime class of failure detectors. Yang
concerning different system models and failure detectors by converting
unanimity algorithm proposed for one pattern to another. Beauquier have offered
hierarchy of passing failure detectors that detect appearance of passing faults
and the resources wanted for implementing them.
Fault-Tolerance in General-Purpose
People may not understand to which range
fault-tolerance mechanism are applied in general-purpose computers to increase
their accuracy. Techniques utilized in general-purpose computers are also
applied in more specialized fault-tolerant computers, so it is a perfect
starting point to survey these computers.
A computer is commonly split into three
main departments: processor, primary memory and I/O. These departments
predominating utilize a little different fault-tolerant mechanism. Error
disclosure and retrieval technique in a exemplary system is exemplary in table
(a). On memory data, valence is utilized. In the more expensive computers, and
at present also increasingly on inexpensive computers, double-error-detecting
codes are also applied. In addition, valence is applied on address and control
information. Recovery can be done with single error-correcting codes on data
and retry on heading and control information valence fault. Memory, under
software surveillance can in several systems (e.g. VAX 8600) be dynamically
reconfigured to shut out poor or bad pages.
Many of the mechanism utilized on
memory, can also be applied on I/O. Retry is often extensively applied here,
particularly on devices as disks this is an efficient process. A processor
contains many records. To supply fault-tolerance here, the same mechanism as
those utilized on memory can be used. In addition, repetition of control logic
is commonly applied. To raise availability, reform time has to be reducing. One
way to do this is remote diagnostics. When error is detected, either the
computer or an operator factor a service station, probably located far away
from the computer site.
Parity and double-error-detecting code
Single error-correction code, retry and dynamically
Parity, duplication and comparison
Table a: Error detection and recovery
The service center can link to the
computer, and utilize diagnostic programs if indispensable. The personnel at
the serving center can either resolve the issue from their site (in condition
of software problems) or ship a surrogate module (in case of hardware failure)
to the unsuccessful site. Software mechanisms are also with difficulty applied
to raise fault-tolerance. One exceedingly used mechanism is package
transformation, e.g. in databases. Computer-, power- or
disk-failure should not be sufficient to harm the database if the indispensable
reservations are taken.
Fault-tolerance is completed by stratify
a collection of analysis and design techniques to create systems with dramatically
afflicted dependability. As new technologies are sophisticated and modern
implementations arise, fresh fault-tolerance approximations are also necessary.
In the early days of fault-tolerant calculate, it was potential to plane
specific hardware and software solutions from the ground up, however now chips
hold congregation, highly-integrated functions, and hardware and software must
be crafted to gathering a set of criterion to be economically viable.
1 Zhou and Andrzej Goscinski – Fault Tolerant Servers for the
RHODOS Systems, Journal of Systems Software, 37(3), pp. 201-214, June 1997.
2 Philippe Queinnec and Gerard Padiou – Derivation of fault
tolerance properties of distributed algorithms, Proceedings of the thirteenth
annual ACM symposium on Principles of distributed computing August 1994.
3 Laprie J. C. 1985 – Dependable computing and fault
tolerance: concepts and terminologies. In FTCS-15, 15th Symposium on Fault Tolerant
Computing Systems (June 1985), pp. 2-11.
4 Aad P. A. van Moorsel – Action Models: A Reliability
Modeling Formalism for Fault Tolerant Distributed Computation System, IPDS 1998
5 Randy Chow and Theodore Johnson – Distributed Operating
Systems and Algorithms, Addison Wesley Longman Inc. 1997.
6 H. Kopetz, G. Fohler, G. Gr¨unseidl, H. Kantz, G.
Pospischil, P. Puschner, J. Reisinger, R. Sclatterback, W. Sch¨utz, A.
Vrchoticky, and R. Zainlinger. The distributed, fault-tolerant real-time
operating system MARS. IEEE Technical Committee on Operation Systems and
Application Environments NEWSLETTER, 6(1), 1992.
7 D. P. Siewiorek and R. S. Swarz, editors. Reliable Computer
Systems: Design and Evaluation. Digital Press, 1992 (1st edition published
8 A. Azagury, D. Dolev, G. Goft, J. Marberg, and J. Satran.
Highly available cluster: A case study. In 1994 IEEE 24th International
Symposium On Fault-Tolerant Computing, pages 404–413, 1994.
9 S.-O. Hvasshovd, Øystein Torbjørnsen, S.-E. Bratsberg, and
P. Holager. The ClustRa telecom database: High availability, high throughput,
and real-time response. In Proceedings of the 21st VLDB Conference, 1995.
10 R. Scott, J. Gault, D. McAllister, and J. Wiggs.
Experimental validation of six fault-tolerant software reliability models. In
The 14th International Conference On Fault-Tolerant Computing, pages 102– 107,
11 Siewiorek, D., ed., Fault-Tolerant Computing
Highlights from 25 Years, Special Volume of the 25th International
Symposium on Fault-Tolerant Computing FTCS-25, Pasadena, CA, June 1995. (Papers
selected as especially significant in the first 25 years of Fault- Tolerant
12 Harper, R., J. Lala, and J, Deyst,
“Fault-Tolerant Parallel Processor Architectural Overview,” Proc
of the 18st International Symposium on Fault-Tolerant Computing FTCS-18,
Tokyo, June 1988.