Optimal recovery schemes in fault tolerant distributed computing

Document type: Journal Articles
Article type: Original article
Peer reviewed: Yes
Author(s): Kamilla Klonowska, Håkan Lennerstad, Lars Lundberg, Charlie Svahnberg
Title: Optimal recovery schemes in fault tolerant distributed computing
Journal: Acta Informatica
Year: 2005
Volume: 41
Issue: 6
Pagination: 341-365
ISSN: 0001-5903
Publisher: Springer Verlag
URI/DOI: 10.1007/s00236-005-0161-7
ISI number: 000228546000002
Organization: Blekinge Institute of Technology
Department: School of Engineering - Dept. of Mathematics & Natural Sciences, School of Engineering - Dept. of Systems and Software Engineering (Sektionen för ingenjörsvetenskap - Avd.för matematik och naturvetenskap, Sektionen för teknik – avd. för programvarusystem)
School of Engineering S-371 79 Karlskrona, School of Engineering S- 372 25 Ronneby
+46 455 38 50 00
http://www.bth.se/ing/; http://www.tek.bth.se/
Authors e-mail: kamilla.klonowska@bth.se, hakan.lennerstad@bth.se, lars.lundberg@bth.se, charlie.svahnberg@bth.se
Language: English
Abstract: Clusters and distributed systems offer fault tolerance and high performance through load sharing. When all n computers are up and running, we would like the load to be evenly distributed among the computers. When one or more computers break down, the load on these computers must be redistributed to other computers in the system. The redistribution is determined by the recovery scheme. The recovery scheme is governed by a sequence of integers modulo n. Each sequence guarantees minimal load on the computer that has maximal load even when the most unfavorable combinations of computers go down. We calculate the best possible such recovery schemes for any number of crashed computers by an exhaustive search, where brute force testing is avoided by a mathematical reformulation of the problem and a branch-and-bound algorithm. The search nevertheless has a high complexity. Optimal sequences, and thus a corresponding optimal bound, are presented for a maximum of twenty one computers in the distributed system or cluster.
Subject: Computer Science\Distributed Computing
Mathematics\Discrete Mathematics
Computer Science\Computersystems