System event logs are often the primary source of information for diagnosing (and predicting) the causes of failures for cluster systems. Due to interactions among the system hardware and software components, the system event logs for large cluster systems are comprised of streams of interleaved events, and only a small fraction of the events over a small time span are relevant to the diagnosis of a given failure. Furthermore, the process of troubleshooting the causes of failures is largely manual and ad-hoc. We have developed a fault/failure diagnostics toolkit based on the analysis of cluster system logs called FDiag. FDiag is equipped with the capability for (a) pre-processing standard syslogs, bluegene/L logs, specific cray-XT logs and the rationalized logs by J. Hammond, T. Minyard and J. Browne, (b) establishing probable cause-and-effect relationships given a symptom event, and (c) constructing the sequence of events that are likely to have led-up to the system failure.

The development of FDiag is the result of an international collaboration between the Institute of High Performance Computing (IHPC), University of Texas at Austin, Texas Advanced Computing Centre (TACC) and A*Star Computational Resource Centre (ACRC). The first paper presents the details of the first generation FDiag framework and the second paper describes the advances made.

1. E. Chuah, S.-h. Kuo, P. Hiew, W.-C. Tjhi, G. Lee, J. Hammond, M. T. Michalewicz, T. Hung, J.C. Browne, "Diagnosing the Root-Causes of Failures from Cluster Log Files", in Proceedings of the 17th IEEE International Conference on High Performance Computing (HiPC), pp.1-10, 2010

2. E. Chuah, G. Lee, W.C. Tjhi, S-h. Kuo, T. Hung, J. Hammond, T. Minyard, J.C. Browne, "Establishing Hypothesis for Recurrent System Failures from Cluster Log Files", in Proceedings of the 9th IEEE International Conference on Dependable, Autonomic and Secure Computing (DASC), pp.15-22, 2011.


The Architecture of the second-generation FDiag fault/failure diagnostics toolkit based on the analysis of cluster system logs is shown below:

Fig 1: Architecture of the second-generation FDiag framework.

