This is a bi-lingual blog of the members of the ADAMIS team at Laboratoire APC and invited guests. We comment on selected papers and events exploring, or relevant to, the interface between physics, cosmology, applied math, statistics, and numerical algorithms and which we have found interesting.

The opinions expressed in this blog reflect those of their authors and neither that of the ADAMIS group as a whole nor of Laboratoire APC.

Monday, October 25, 2010

Fast(er) and not furious ...
[arXiv:1010.1260, arXiv:1010.2084, & others]

Calculations of spherical harmonic transforms is a common task for all of us who deal with data distributed over a sphere. Be it a nearly spherical shape of Earth's surface for geophysicists and oceanographers, or a sky -- an apparent location of astronomical objects as seen by an Earth-bound observer -- for astronomers and astrophysicists. Doing it fast therefore matters, as clearly does doing it even faster.


Owing to a special choice of sphere sampling or pixelization and the fact that spherical harmonics can be factored as a product of two functions, one dependent only on of the two spherical angles and the other one only on the other angle, spherical harmonic transforms are typically done in two steps, each involving one of the two angles transform, over φ and θ. The φ transform is usually performed applying Fast Fourier Transforms and its costs is therefore usually subdominant. It is the "θ-transform" which takes all the CPU and wall-clock time. For n points, given or desired, on the sphere such transforms typically require n3/2 floating point operations (FLOPs). Not bad, one may think. Indeed, if you look at the transforms as a direct application of a spherical harmonics matrix to a data vector that would normally require n3 FLOPs. We do then much better than that. Nonetheless it is surely not as good as say the Fast Fourier Transform, a computational cost of which is more like n ln n.

Better algorithms would clearly not hurt and indeed some have been proposed. Interesting as they are for "fundamental reasons", as of now they have not had a practical impact on this area one might have expected. First, they have turned out to have typically huge prefactors, which for typical sizes of the problems we deal with, seem to compensate for their advantageous scaling. Second, however esthetically pleasing they may be, and indeed are, their mathematical and algorithmic complexity is rather intimidating, and their stability for cases of practical interest still unclear.

Pragmatic as we, as a community, are, before embarking on a rather steep task of implementing one of these new algorithms, we considered it prudent to look around for some other, simpler, even if less elegant, approaches, which can compensate for that (roughly) n1/2 extra cost (over the linear scaling) of the standard calculations.

And we have found things to try.
  • We could think of parallelizing the transforms, if we are lucky no extra overhead will come as a consequence of that and we may gain up to 1/nproc factor if all goes perfectly. Not may be as much as n1/2 for values of n expected from some largest future data sets, which look towards n ∼ 108, but it would help nonetheless. For the gain to be significant, we would need to be able to use on order of 1,000 of processors simultaneously and thus a distributed-memory programming model (MPI) is a must.
  • We could try to use "accelerators", such as GPU cards. As mentioned in one of the previous posts, this may be something we may need to get accustomed to anyway. So why not to have a go now ?!
  • We may stick to a single, standard CPUs, but try to get the most of them by using the new opportunities the hardware producers provide us with, such as intra-processor vectorization.
  • Or we could try to use a combination of those.
Three first options listed above have been already tried. And no without some success.

Parallelizing spherical harmonic transforms would have been rather straightforward if one could afford storing all the harmonic coefficients in the memory of every single processors. Indeed first attempts have done exactly so. This however brings about rather a heavy price: the total memory for a transform would grow proportionally to a number of used processors ! A real bummer then !
A simple and robust way around, avoiding the issue in part via a specific but flexible data distribution, in part thanks to a specially designed, global communication instance, has been devised, implemented, and shown to work very well, living up to a nearly perfect best-case scaling one could have hoped for. And does so up to a few thousands of processors !

The first shot at the GPUs has also been just made by members of our very own ANR-MIDAS'09 collaboration and is reported here. We have achieved a speedup of roughly 18 times over a single core CPU, which is no doubt something promising. (But note this is not your laptop GPU ... Not yet in any case.) Interestingly it has turned out that -- unlike in the CPU codes -- it is the φ transform, which is now a performance bottle-neck. This is in part because of a new, custom-made, and thus efficient algorithm devised in that work for the θ transform (running nearly 60 times faster than its CPU counterpart), while for the φ transform we have resorted ourselves to off-the-shelf, Fast Fourier transform packages available for either GPUs or CPUs. There is therefore still a way to go before all the actual limits are explored and understood. But the first step has been made ...

In yet another recent paper, a single CPU transform are accelerated by a clever coding relying on intra-processor vectorizing commands, so called SSE2 options. These are featured by many of the newest generation processors. For those a speed-up by a factor 2 or more has been obtained. Not a revolution one may think. True, but nothing you would scorn at, while waiting for your interactive run to finish and having no more on-line news left to read ...

All in all, we have learned something and can do the transforms faster than ever. Some issues still remain though. Will the simple MPI-parallelization scheme start choking for a larger number of processors ? Say, 10,000 ? Will attempts to capitalize on both parallel multi-CPU/GPU structure and GPUs speed be as successful as the parallel CPU code, or rather the communication will turn out to be an ultimate bottleneck ? Will that already be there for SSE2 based parallel algorithms ?

Furious, not yet, but bored not either ...

No comments:

Post a Comment