» Software Downloads » FDiag Contact IHPC

FDiag

FDiag

System event logs are often the primary source of information for diagnosing (and predicting) the causes of failures for cluster systems. Due to interactions among the system hardware and software components, the system event logs for large cluster systems are comprised of streams of interleaved events, and only a small fraction of the events over a small time span are relevant to the diagnosis of a given failure. Furthermore, the process of troubleshooting the causes of failures is largely manual and ad-hoc. We have developed a fault/failure diagnostics toolkit based on the analysis of cluster system logs called FDiag. FDiag is equipped with the capability for (a) pre-processing standard syslogs, bluegene/L logs, specific cray-XT logs and the rationalized logs by J. Hammond, T. Minyard and J. Browne, (b) establishing probable cause-and-effect relationships given a symptom event, and (c) constructing the sequence of events that are likely to have led-up to the system failure.

The development of FDiag is the result of an international collaboration between the Institute of High Performance Computing (IHPC), University of Texas at Austin, Texas Advanced Computing Centre (TACC) and A*Star Computational Resource Centre (ACRC). The first paper presents the details of the first generation FDiag framework and the second paper describes the advances made.

1. E. Chuah, S.-h. Kuo, P. Hiew, W.-C. Tjhi, G. Lee, J. Hammond, M. T. Michalewicz, T. Hung, J.C. Browne, "Diagnosing the Root-Causes of Failures from Cluster Log Files", in Proceedings of the 17th IEEE International Conference on High Performance Computing (HiPC), pp.1-10, 2010

2. E. Chuah, G. Lee, W.C. Tjhi, S-h. Kuo, T. Hung, J. Hammond, T. Minyard, J.C. Browne, "Establishing Hypothesis for Recurrent System Failures from Cluster Log Files", in Proceedings of the 9th IEEE International Conference on Dependable, Autonomic and Secure Computing (DASC), pp.15-22, 2011.

Downloads


Please make sure you have read and understand our flexible licensing scheme before you download.

Ver. Link Size Date Changes
2.0 FDiag (tar.gz)
FDiag (7z)
FDiag (zip)
5.3MB
2.5MB
5.2MB
2012-01-26 1. Installation and User Manual
2. Three archive file types have been provided for your convenience.
ANSI C++, Serial Software

*If you wish to participate in FDiag development, please contact the project maintainer.*

Description


The Architecture of the second-generation FDiag fault/failure diagnostics toolkit based on the analysis of cluster system logs is shown below:

Fig 1: Architecture of the second-generation FDiag framework.

Post a Comment


+ 5 = ?

Could not execute select query.

Posted Comments