BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:Europe/Stockholm
X-LIC-LOCATION:Europe/Stockholm
BEGIN:DAYLIGHT
TZOFFSETFROM:+0100
TZOFFSETTO:+0200
TZNAME:CEST
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=-1SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:+0200
TZOFFSETTO:+0100
TZNAME:CET
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=10;BYDAY=-1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20230831T095745Z
LOCATION:Seehorn
DTSTART;TZID=Europe/Stockholm:20230626T170000
DTEND;TZID=Europe/Stockholm:20230626T173000
UID:submissions.pasc-conference.org_PASC23_sess161_msa120@linklings.com
SUMMARY:Task-Level Resilience for Dynamically Generated Tasks under Work S
 tealing in Clusters
DESCRIPTION:Minisymposium\n\nClaudia Fohry (University of Kassel)\n\nPerma
 nent hardware failures of cluster nodes cause processes to abort and, if n
 o precautions are taken, all previous compute results will be lost. Resili
 ence can be achieved through checkpointing, which allows restarting applic
 ations from a saved state. However, writing checkpoints to a file system i
 s costly and there is a delay before restarting. Therefore, alternative te
 chniques store checkpoints in the main memory of other cluster nodes (in-m
 emory checkpointing), reduce the checkpoint size by selecting the checkpoi
 nt data (application-level checkpointing), or recover and continue running
  on the intact nodes (shrinking localized recovery). These approaches can 
 be nicely combined for Asynchronous Many-Task (AMT) programs, where a runt
 ime system automatically assigns execution units called tasks to processes
  and threads. Since tasks have clean interfaces, the runtime can automatic
 ally select checkpoint data. Moreover, it can reassign tasks that were aff
 ected by a failure. Keeping track of tasks is not trivial, though, when ta
 sks may dynamically generate new tasks and work stealing is used to balanc
 e the load by moving tasks from busy to idle processes and threads. The ta
 lk outlines a task-level checkpointing scheme for this environment. The sc
 heme can handle independent, side-effect-free tasks under multiple failure
 s with a runtime overhead below 1%.\n\nDomain: Computer Science, Machine L
 earning, and Applied Mathematics &#8232;\n\nSession Chair: Nicolas Morales (Sand
 ia National Laboratories)
END:VEVENT
END:VCALENDAR
