BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:Europe/Stockholm
X-LIC-LOCATION:Europe/Stockholm
BEGIN:DAYLIGHT
TZOFFSETFROM:+0100
TZOFFSETTO:+0200
TZNAME:CEST
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=-1SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:+0200
TZOFFSETTO:+0100
TZNAME:CET
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=10;BYDAY=-1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20230831T095746Z
LOCATION:Hall
DTSTART;TZID=Europe/Stockholm:20230627T193000
DTEND;TZID=Europe/Stockholm:20230627T213000
UID:submissions.pasc-conference.org_PASC23_sess116_pos131@linklings.com
SUMMARY:P28 - GPU-Optimized Tridiagonal and Pentadiagonal System Solvers f
 or Spectral Transforms in QuiCC
DESCRIPTION:Poster\n\nDmitrii Tolmachev, Philippe Marti, and Giacomo Casti
 glioni (ETH Zurich); Daniel Ganellari (ETH Zurich / CSCS); and Andrew Jack
 son (ETH Zurich)\n\nQuiCC is a code under development designed to solve th
 e equations of magnetohydrodynamics in a full sphere and other geometries.
  It uses a fully spectral approach to the problem, with the Jones-Worland 
 polynomials as a radial basis and Spherical Harmonics as a spherical basis
 . We present an alternative to the quadrature approach to their evaluation
  - the polynomial connection approach, which is more accurate and requires
  less memory. In this work, we demonstrate an efficient GPU implementation
  of this algorithm. This poster focuses on the efficient tridiagonal and p
 entadiagonal GPU solvers used to evaluate the polynomial connections. Base
 d on the Parallel Cyclic Reduction algorithm, they are optimized to exclus
 ively perform on-chip data transfers through the warp shuffling instructio
 ns, exchanging data directly between threads registers. This results in th
 e best occupancy (more registers per thread, more threadblocks per streami
 ng multiprocessor) and full dispatch latency mitigation (no kernel synchro
 nization during execution). The warp-shuffle approach to thread data excha
 nge can be adapted for many other GPU algorithms as it is developed in the
  runtime code generation platform designed for future algorithm reuse, ori
 ginally based on the VkFFT library.
END:VEVENT
END:VCALENDAR
