System event logs are often the primary source of information for diagnosing (and predicting) the causes of failures for cluster systems. Due to interactions among the system hardware and software components, the system event logs for large cluster systems are comprised of streams of interleaved events, and only a small fraction of the events over a small time span are relevant to the diagnosis of a given failure. Furthermore, the process of troubleshooting the causes of failures is largely manual and ad-hoc. We have developed a fault/failure diagnostics toolkit based on the analysis of cluster system logs called FDiag. FDiag is equipped with the capability for (a) pre-processing standard syslogs, bluegene/L logs, specific cray-XT logs and the rationalized logs by J. Hammond, T. Minyard and J. Browne, (b) establishing probable cause-and-effect relationships given a symptom event, and (c) constructing the sequence of events that are likely to have led-up to the system failure.
The development of FDiag is the result of an international collaboration between
the Institute of High Performance Computing (IHPC), University of Texas at Austin, Texas Advanced Computing Centre
(TACC) and A*Star Computational Resource Centre (ACRC). The first paper presents the details of the first generation
FDiag framework and the second paper describes the advances made.
Please make sure you have read and understand our flexible licensing scheme before you download.
*If you wish to participate in FDiag development, please contact the project maintainer.*
The Architecture of the second-generation FDiag fault/failure diagnostics toolkit based on the analysis of cluster system logs is shown below:
Fig 1: Architecture of the second-generation FDiag framework.
Post a Comment
Could not execute select query.