Seminar Topic WS18/19: Fault Tolerance for HPC

Seminar Style

The presence of each participant in all seminar presentations is obligatory.

Successful participation consist in

  1. choosing, reading and understanding 1-2 papers from a list,
  2. presenting the papers to the other participants (slides, 30 minutes),
  3. and writing a summary of the papers (10-15 pages).
The number of meetings depends on the number of participants. There will be usually two talks per meeting.

ECTS points: 3.0
Seminar ECTS points will be assigned where the topic presented fits the best:

Key Dates

Registration

Register on TISS until October 17, 2018! (Procedure will be explained during first meeting.)

Topics/Papers

paper/topic advised by:
SH - Sascha Hunold
JLT - Jesper Larsson Träff

Topic Advisor Paper ECTS Comment
1 SH/JLT T. Herault and Y. Robert, eds. Fault-Tolerance Techniques for High-Performance Computing. 1st. Springer, 2015. isbn: 3319209426, 9783319209425 (1 [Checkpointing]) SE, TI, AL  
2 SH/JLT T. Herault and Y. Robert, eds. Fault-Tolerance Techniques for High-Performance Computing. 1st. Springer, 2015. isbn: 3319209426, 9783319209425 (1 [ABFT]) SE, TI, AL  
3 SH/JLT T. Herault and Y. Robert, eds. Fault-Tolerance Techniques for High-Performance Computing. 1st. Springer, 2015. isbn: 3319209426, 9783319209425 (2 [Silent Errors]) SE, TI, AL  
4 SH/JLT T. Herault and Y. Robert, eds. Fault-Tolerance Techniques for High-Performance Computing. 1st. Springer, 2015. isbn: 3319209426, 9783319209425 (2 [Failures in Large Scale Machine]) SE, TI, AL  
5 SH/JLT T. Herault and Y. Robert, eds. Fault-Tolerance Techniques for High-Performance Computing. 1st. Springer, 2015. isbn: 3319209426, 9783319209425 (3 [Fault Tolerant MPI - Logging]) SE, TI, AL  
6 SH/JLT T. Herault and Y. Robert, eds. Fault-Tolerance Techniques for High-Performance Computing. 1st. Springer, 2015. isbn: 3319209426, 9783319209425 (3 [Fault Tolerant MPI - ULFM]) SE, TI, AL  
7 SH/JLT T. Herault and Y. Robert, eds. Fault-Tolerance Techniques for High-Performance Computing. 1st. Springer, 2015. isbn: 3319209426, 9783319209425 (4 [Replication for Resilience]) SE, TI, AL, TH  
8 SH/JLT Z. Chen and J. J. Dongarra. “Algorithm-Based Fault Tolerance for Fail-Stop Failures”. In: IEEE Trans. Parallel Distrib. Syst. 19.12 (2008), pp. 1628–1641. doi: 10.1109/TPDS.2008.58 SE, TI, AL  
9 SH/JLT J. Elliott, K. Kharbas, D. Fiala, F. Mueller, K. B. Ferreira, and C. Engelmann. “Combining Partial Redundancy and Checkpointing for HPC”. in: Proceedings of the 32nd IEEE International Conference on Distributed Computing Systems (ICDCS). IEEE Computer Society, 2012, pp. 615–626. doi: 10.1109/ICDCS.2012.56 SE, TI, AL  
10 SH/JLT N. El-Sayed and B. Schroeder. “Understanding Practical Tradeoffs in HPC Checkpoint-Scheduling Policies”. In: IEEE Trans. Dependable Sec. Comput. 15.2 (2018), pp. 336–350. doi: 10.1109/TDSC.2016.2548463 SE, TI, AL  
11 SH/JLT C. George and S. S. Vadhiyar. “Fault Tolerance on Large Scale Systems using Adaptive Process Replication”. In: IEEE Trans. Computers 64.8 (2015), pp. 2213–2225. doi: 10.1109/TC.2014.2360536 SE, TI, AL  
12 SH/JLT M. Gamell, K. Teranishi, J. Mayo, H. Kolla, M. A. Heroux, J. Chen, and M. Parashar. “Modeling and Simulating Multiple Failure Masking Enabled by Local Recovery for Stencil-Based Applications at Extreme Scales”. In: IEEE Trans. Parallel Distrib. Syst. 28.10 (2017), pp. 2881–2895. doi: 10.1109/TPDS.2017.2696538 SE, TI, AL  
13 SH/JLT X. Tang, J. Zhai, B. Yu, W. Chen, W. Zheng, and K. Li. “An Efficient In-Memory Checkpoint Method and its Practice on Fault-Tolerant HPL”. in: IEEE Trans. Parallel Distrib. Syst. 29.4 (2018), pp. 758–771. doi: 10.1109/TPDS.2017.2781257 SE, TI, AL  
14 SH/JLT J. Ansel, K. Arya, and G. Cooperman. “DMTCP: Transparent checkpointing for cluster computations and the desktop”. In: Proceedings of the 23rd IEEE International Symposium on Parallel and Distributed Processing (IPDPS). IEEE, 2009, pp. 1–12. doi: 10.1109/IPDPS.2009.5161063 SE, TI, AL  
15 SH/JLT Z. Chen. “Algorithm-based recovery for iterative methods without checkpointing”. In: Proceedings of the 20th ACM International Symposium on High Performance Distributed Computing (HPDC). ACM, 2011, pp. 73–84. doi: 10.1145/1996130.1996142 SE, TI, AL  

Dates

Contact

In case you have further questions about the seminar, please contact Assistant Prof. Dr. Sascha Hunold.