Home About us Contact

Fault Tolerance (fault + tolerance)

Distribution by Scientific Domains

Information Science and Computing	60%
Engineering	40%

Selected Abstracts

Sensor network design for fault tolerant estimation

INTERNATIONAL JOURNAL OF ADAPTIVE CONTROL AND SIGNAL PROCESSING, Issue 1 2004
M. Staroswiecki
Abstract This paper addresses the problem of fault tolerant estimation and the design of fault tolerant sensor networks. Fault tolerance is defined with respect to a given estimation objective, namely a given functional of the system state should remain observable when sensor failures occur. Redundant and minimal sensor sets are defined and organized into an automaton which contains all the subsets of sensors such that the estimation objective can be achieved. Three criteria, which evaluate the system fault tolerance with respect to sensor failures when a reconfiguration strategy is used, are introduced: (strong and weak) redundancy degrees (RD), sensor network reliability (R), and mean time to non-observability (MTTNO). Sensor networks are designed by finding redundant sensor sets whose RD and/or R and/or MTTNO are larger than some specified values. A ship boiler example is developed for illustration. Copyright © 2004 John Wiley & Sons, Ltd. [source]

Fault tolerance in Clos,Knockout multicast ATM switch

INTERNATIONAL JOURNAL OF COMMUNICATION SYSTEMS, Issue 9 2002
K. S. Chan
Abstract In this paper, we propose a new architecture for multicast ATM switches with fault tolerant capability based on the Clos,Knockout switch. In the new architecture, each stage has one more redundant switch module. If one switch module is faulty, the redundant module would replace the faulty one. On the other hand, under the fault-free condition, the redundant modules in the second and third stages will provide additional alternative internal paths, and hence improve the performance. The performance analysis shows that the cell loss probability is lower than the original architecture when all modules are fault free, and the reliability of the original architecture is improved. Copyright © 2002 John Wiley & Sons, Ltd. [source]

High-performance hybrid information service architecture

CONCURRENCY AND COMPUTATION: PRACTICE & EXPERIENCE, Issue 15 2010
Mehmet S. Aktas
Abstract We introduce a distributed high-performance Information Service Architecture, which forms a metadata replica hosting system to manage both highly dynamic small-scale metadata and relatively large static metadata associated with Grid/Web Services. We present an empirical evaluation of the proposed architecture and investigate its practical usefulness. The results demonstrate that the proposed system achieves high performance and fault tolerance with negligible processing overheads. The results also indicate that efficient decentralized hybrid Information Service Architectures can be built by utilizing publish-subscribe-based messaging schemes. Copyright © 2010 John Wiley & Sons, Ltd. [source]

High-level distribution for the rapid production of robust telecoms software: comparing C++ and ERLANG

CONCURRENCY AND COMPUTATION: PRACTICE & EXPERIENCE, Issue 8 2008
J. H. Nyström
Abstract Currently most distributed telecoms software is engineered using low- and mid-level distributed technologies, but there is a drive to use high-level distribution. This paper reports the first systematic comparison of a high-level distributed programming language in the context of substantial commercial products. Our research strategy is to reengineer some C++/CORBA telecoms applications in ERLANG, a high-level distributed language, and make comparative measurements. Investigating the potential advantages of the high-level ERLANG technology shows that two significant benefits are realized. Firstly, robust configurable systems are easily developed using the high-level constructs for fault tolerance and distribution. The ERLANG code exhibits resilience: sustaining throughput at extreme loads and automatically recovering when load drops; availability: remaining available despite repeated and multiple failures; dynamic reconfigurability: with throughput scaling near-linearly when resources are added or removed. Secondly, ERLANG delivers significant productivity and maintainability benefits: the ERLANG components are less than one-third of the size of their C++ counterparts. The productivity gains are attributed to specific language features, for example, high-level communication saves 22%, and automatic memory management saves 11%,compared with the C++ implementation. Investigating the feasibility of the high-level ERLANG technology demonstrates that it fulfils several essential requirements. The requisite distributed functionality is readily specified, even although control of low-level distributed coordination aspects is abrogated to the ERLANG implementation. At the expense of additional memory residency, excellent time performance is achieved, e.g. three times faster than the C++ implementation, due to ERLANG's lightweight processes. ERLANG interoperates at low cost with conventional technologies, allowing incremental reengineering of large distributed systems. The technology is available on the required hardware/operating system platforms, and is well supported. Copyright © 2007 John Wiley & Sons, Ltd. [source]

Checkpointing BSP parallel applications on the InteGrade Grid middleware

CONCURRENCY AND COMPUTATION: PRACTICE & EXPERIENCE, Issue 6 2006
Raphael Y. de Camargo
Abstract InteGrade is a Grid middleware infrastructure that enables the use of idle computing power from user workstations. One of its goals is to support the execution of long-running parallel applications that present a considerable amount of communication among application nodes. However, in an environment composed of shared user workstations spread across many different LANs, machines may fail, become inaccessible, or may switch from idle to busy very rapidly, compromising the execution of the parallel application in some of its nodes. Thus, to provide some mechanism for fault tolerance becomes a major requirement for such a system. In this paper, we describe the support for checkpoint-based rollback recovery of Bulk Synchronous Parallel applications running over the InteGrade middleware. This mechanism consists of periodically saving application state to permit the application to restart its execution from an intermediate execution point in case of failure. A precompiler automatically instruments the source code of a C/C++ application, adding code for saving and recovering application state. A failure detector monitors the application execution. In case of failure, the application is restarted from the last saved global checkpoint. Copyright © 2005 John Wiley & Sons, Ltd. [source]

Advanced eager scheduling for Java-based adaptive parallel computing

CONCURRENCY AND COMPUTATION: PRACTICE & EXPERIENCE, Issue 7-8 2005
Michael O. Neary
Abstract Javelin 3 is a software system for developing large-scale, fault-tolerant, adaptively parallel applications. When all or part of their application can be cast as a master,worker or branch-and-bound computation, Javelin 3 frees application developers from concerns about inter-processor communication and fault tolerance among networked hosts, allowing them to focus on the underlying application. The paper describes a fault-tolerant task scheduler and its performance analysis. The task scheduler integrates work stealing with an advanced form of eager scheduling. It enables dynamic task decomposition, which improves host load-balancing in the presence of tasks whose non-uniform computational load is evident only at execution time. Speedup measurements are presented of actual performance on up to 1000 hosts. We analyze the expected performance degradation due to unresponsive hosts, and measure actual performance degradation due to unresponsive hosts. Copyright © 2005 John Wiley & Sons, Ltd. [source]

Performance comparison of checkpoint and recovery protocols

CONCURRENCY AND COMPUTATION: PRACTICE & EXPERIENCE, Issue 15 2003
Himadri Sekhar Paul
Abstract Checkpoint and rollback recovery is a well-known technique for providing fault tolerance to long-running distributed applications. Performance of a checkpoint and recovery protocol depends on the characteristics of the application and the system on which it runs. However, given an application and system environment, there is no easy way to identify which checkpoint and recovery protocol will be most suitable for it. Conventional approaches require implementing the application with all the protocols under consideration, running them on the desired system, and comparing their performances. This process can be very tedious and time consuming. This paper first presents the design and implementation of a simulation environment, distributed process simulation or dPSIM, which enables easy implementation and evaluation of checkpoint and recovery protocols. The tool enables the protocols to be simulated under a wide variety of application, system, and network characteristics. The paper then presents performance evaluation of five checkpoint and recovery protocols. These protocols are implemented and executed in dPSIM under different simulated application, system, and network characteristics. Copyright © 2003 John Wiley & Sons, Ltd. [source]

Sensor network design for fault tolerant estimation

Energy-efficient target detection in sensor networks using line proxies

INTERNATIONAL JOURNAL OF COMMUNICATION SYSTEMS, Issue 3 2008
Jangwon Lee
Abstract One of the fundamental and important operations in sensor networks is sink,source matching, i.e. target detection. Target detection is about how a sink finds the location of source nodes observing the event of interest (i.e. target activity). This operation is very important in many sensor network applications such as military battlefield and environment habitats. The mobility of both targets and sinks brings significant challenge to target detection in sensor networks. Most existing approaches are either energy inefficient or lack of fault tolerance in the environment of mobile targets and mobile sinks. Motivated by these, we propose an energy-efficient line proxy target detection (LPTD) approach in this paper. The basic idea of LPTD is to use designated line proxies as rendezvous points (or agents) to coordinate mobile sinks and mobile targets. Instead of having rendezvous nodes for each target type as used by most existing approaches, we adopt the temporal-based hash function to determine the line in the given time. Then the lines are alternated over time in the entire sensor network. This simple temporal-based line rotation idea allows all sensor nodes in the network to serve as rendezvous points and achieves overall load balancing. Furthermore, instead of network-wide flooding, interests from sinks will be flooded only to designated line proxies within limited area. The interest flooding can further decrease if the interest has geographical constraints. We have conducted extensive analysis and simulations to evaluate the performance of our proposed approach. Our results show that the proposed approach can significantly reduce overall energy consumption and target detection delay. Copyright © 2007 John Wiley & Sons, Ltd. [source]

Towards a debugging system for sensor networks

INTERNATIONAL JOURNAL OF NETWORK MANAGEMENT, Issue 4 2005
Nithya Ramanathan
Due to their resource constraints and tight physical coupling, sensor networks afford limited visibility into an application's behavior. As a result it is often difficult to debug issues that arise during development and deployment. Existing techniques for fault management focus on fault tolerance or detection; before we can detect anomalous behavior in sensor networks, we need first to identify what simple metrics can be used to infer system health and correct behavior. We propose metrics and events that enable system health inferences, and present a preliminary design of Sympathy, a debugging tool for pre- and post-deployment sensor networks. Sympathy will contain mechanisms for collecting system performance metrics with minimal memory overhead; mechanisms for recognizing application-defined events based on these metrics; and a system for collecting events in their spatiotemporal context. The Sympathy system will help programmers draw correlations between seemingly unrelated, distributed events, and produce graphs that highlight those correlations. As an example, we describe how we used a preliminary version of Sympathy to help debug a complex application, Tiny Diffusion.,Copyright © 2005 John Wiley & Sons, Ltd. [source]