FWF P19939-N13: "Compiler Technology for Top-Performance Signal Transforms"
This page presents
stand-alone project P19939-N13 "Compiler Technology for
Top-Performance Signal Transforms", which was carried out by
Stefan Kral between August 2007 and February 2011.
Project funding was
provided by Austria's central funding organization for
the Austrian Science
Fund ("Fonds zur Förderung der wissenschaftlichen Forschung",
This project focused on the development of compilation
techniques for speeding up top-performance signal transform
codes running on Intel64/AMD64 processors. All compilation and
optimization techniques were implemented in NXyn (pronounced
"neck-sin"), a synergistic compiler for Intel64/AMD64 processors.
The NXyn compiler comprises two main components:
This structure allows NXyn to process hand-written assembly code
and to cooperate with proprietary, closed-source compilers like the
state-of-the-art Intel C compiler.
- nxas: a assembly-level source-to-source code optimizer which aims at reducing code size and run time.
- nxcc: a C compiler driver, which can be used as a drop-in replacement of GNU C or Intel C compiler.
nxcc uses another C compiler for generating assembly code, which it passes on to nxas for post-processing.
To cooperate with existing compiler technology and state-of-the-art
signal transform libraries in the best way, NXyn focuses on compiler
backend optimizations. Its optimization methods are fully orthogonal
to established optimization techniques present in modern C
compilers. In particular, NXyn features:
- Address code optimization which is tailored to signal
transform access patterns. For large codes, this optimization reduces the address instruction count
by more than 80%.
- Stack offset assignment which reduces code size by up to 10%.
- SIMD vector register reassignment which, in many cases, reduces code size by more than 5%.
NXyn is exclusively available for Intel64/AMD64 processors
running the GNU/Linux operating system in 64-bit mode.
Processors based on the following micro-architectures were used during for development and testing:
In addition to these processors, NXyn is likely to be useful for other Intel64/AMD64 processors, including
Intel Core i5 "Clarkdale", Intel Core i7 "Nehalem", Intel Atom, and VIA Nano "Isaiah".
- Intel Core: Intel Core 2 "Conroe"/"Merom", Intel Core i3 "Arrandale".
- AMD K10: AMD Phenom "Barcelona", AMD Phenom II "Shanghai".
Code optimizations implemented in NXyn particularly benefit program
codes comprising long basic blocks.
NXyn works both with scalar code and
SIMD code (Intel SSE, Intel SSE2, Intel SSE3, Intel SSSE3, AMD SSE4a, Intel SSE4.1, and Intel SSE4.2), equally supporting integer and
Compiling the widely used discrete Fourier transform
library FFTW with NXyn (in
combination with the Intel C compiler version 11.1)
consistently minimizes both run time and code size.
The following performance diagrams show that NXyn consistently
improves the performance of FFTW routines running on Intel64 and AMD64
processors. All measurements have been performed using a single
Above plots compare three configurations:
- icc(EAcalc): Intel C compiler with maximum optimizations
- icc(EAtable): As above, but with the FFTW library code caching address calculations in a table, which works around a limitation present in most compilers for Intel64/AMD64 processors.
- nxicc(EAcalc): Intel C compiler with maximum optimizations plus NXyn optimizations
As the performance plots show, NXyn gives significant performance
improvements for both Intel and AMD processors, for different problem sizes,
and for different instruction sets (SSE, SSE2). More performance plots featuring other problem sizes
and problem types are available here.
To maximize FFTW performance, consider generating a larger set of codelets than the one included in the standard FFTW distribution. Information on how to do this is available on the FFTW web page here.
The following publications are related to details of the NXyn compiler
and the signal transform specific compilation techniques that it
- BlueGene/L applications: Parallelism On a Massive Scale (2008)
- B. R. de Supinski, M. Schulz, V. V. Bulatov, W. Cabot, B. Chan, A. W. Cook, E. W. Draeger, J. N. Glosli, J. A. Greenough, K. Henderson, A. Kubota, S. Louis, B. J. Miller, M. V. Patel, T. E. Spelce, F. H. Streitz, P. L. Williams, R. K. Yates, A. Yoo, G. Almasi, G. Bhanot, A. Gara, J. A. Gunnels, M. Gupta, J. Moreira, J. Sexton, B. Walkup, C. Archer, F. Gygi, T. C. Germann, K. Kadau, P. S. Lomdahl, C. Rendleman, M. L. Welcome, W. McLendon, B. Hendrickson, F. Franchetti, S. Kral, J. Lorenz, C. W. Ueberhuber, E. Chow, and Ü. Çatalyürek.
- In the International Journal of High Performance Computing Applications, Volume 22, No. 1, Spring 2008, pages 33-51.
- Smaller and faster Intel SSE Code (2011)
- Stefan Kral
- Submitted to Euro-Par 2011 -- International Conference on Parallel and Distributed Computing.
NXyn is open-source software available under the GNU General Public License (GPL), version 2.
The current version of NXyn is available here (release notes).
Information about installing NXyn is available here.
FFTW 3.2.2 pre-compiled with Intel icc and NXyn is available here (double precision) and here (single precision).
Last update: Sun Jun 26 18:44:18 CEST 2011