CUDA: What is it?
CUDA is NVIDIA’s parallel computing architecture that enables dramatic increases in computing performance by harnessing the power of the GPU (graphics processing unit) for applications including image and video processing, computational biology and chemistry, fluid dynamics simulation, CT image reconstruction, seismic analysis, ray tracing, and much more. The current Nvidia (NASDAQ:NVDA) "Fermi" line of GPUs (Graphical Processing Units) provides incredibly powerful parallel computing within reach of most individual users and businesses through rather affordable Nvidia Graphics Cards (and, the upcoming Nvidia "Kepler" GPUs for early 2012 will only be better, faster, and more efficient).
Note: many of these latest CUDA features require a "Fermi"-based GPU (and, using the LLVM-based compiler does). These cards are worth investing in if you plan to do any CUDA development. You can get a Fermi-based CUDA-Capable Graphics Card that is quite affordable and power-efficient: I rather like my Quadro 600 card (~$160.00) which uses only 40W for 96 CUDA-processing cores; this card has been very capable for running all my development work on.
New in Nvidia CUDA Toolkit 4.1
LLVM Compiler / Toolchain Support
The first notable benefit of the LLVM compiler is that Nvidia claims this compiler delivers up to 10% faster performance for many applications (compared to their prior in-house developed C/C++ compiler).
But, what strikes me as the most (potentially) important aspect of this move to LLVM is that we could potentially soon see more (programming) language support for using CUDA outside of just C/C++ and/or additional CPU support. Nvidia has apparently used the Clang C and C++ compilers within the LLVM framework and has hooked in support for the CUDA parallel development environment.
Although Nvidia's (CUDA C and CUDA C++) compiler modifications are not open-sourced, LLVM will provide a foundation for more easily adding language/processor support. Given Apple's use of LLVM on ARM (platform), I have to wonder if this is going to be a build-target in the not too distant future. There are also open-source projects for other programming languages to make use of the LLVM toolchain, so the potential does exist for accessing CUDA / GPU-support from other domain-specific languages eventually (perhaps Java, Python, etc) directly.
Other Major New Features in CUDA Toolkit 4.1
(from Nvidia website, with some added comments and details)
New & Improved “Drop-In” Acceleration With GPU-Accelerated Libraries
- Over 1000 new image processing functions in the NPP (Nvidia Performance Primitives) library — this brings to total number of NPP functions to 2200+. These GPU-accelerated functions (building blocks) for image and signal processing include capabilities geared toward arithmetic, logic, conversion, statistics, filters, and more; also, these can execute on the GPU at up to 40x (yes, 40 times!) the speed of Intel IPP (Integrated Performance Primitives). This is great for media, entertainment, and visual processing applications.
- New Boost style placeholders in Thrust CUDA C++ template library which allow inline functors now. Thrust includes optimized functions for sort, reduce, scan operations and so on.
- New cuSPARSE tri-diagonal solver up to 10x faster than MKL on a 6 core CPU; this also includes up to 2x faster sparse matrix vector multiplication using ELL hybrid format
- New support in cuRAND for MRG32k3a and Mersenne Twister (MTGP11213) RNG algorithms
- Bessel functions now supported in the CUDA standard Math library
- CuFFT (Fast Fourier Transforms) library has a thread-safe API now (callable from multiple host-threads); also, substantial improvements in speed!
- CuBLAS level 3 performance improvements up to 6X over Intel MKL (Math Kernel Library)
- Batched-GEMM API for more efficient processing of many small matrices (i.e., 4x4 through 128x128 matrices; up to 4X speedup over MKL); up to 1 TFLOPS sustained performance (yes, a teraflop! Wow)
- Average and rounded-average functions (e.g., hadd / rhadd - signed and unsigned)
Enhanced & Redesigned Developer Tools (On Windows, Mac, & Linux)
- Redesigned Visual Profiler with automated performance analysis and expert guidance (guided workflow and drill-down expert guidance); during an online presentation, this was described as "almost like having an Nvidia engineer in a box", which sure sounds handy! You should benefit from the experience of those engineers, and be helped along through attaining best-practice outcomes with these built-in automated analyses/experts.
- Assert() in device code - helpful for debugging!
- CUDA_GDB support for multi-context debugging and assert() in device code
- CUDA-MEMCHECK now detects out of bounds access for memory allocated in device code
- Parallel Nsight 2.1 CUDA warp watch visualizes variables and expressions across an entire CUDA warp
- Parallel Nsight 2.1 CUDA profiler now analyzes kernel memory activities, execution stalls and instruction throughput
- Learn more about debugging and performance analysis tools for GPU developers on our CUDA Tools and Ecosystem Summary Page
Advanced Programming Features
- Access to 3D surfaces and cube maps from device code
- Enhanced no-copy pinning of system memory, cudaHostRegister() alignment and size restrictions removed
- Peer-to-peer communication between processes
- Support for resetting a GPU without rebooting the system in nvidia-smi
New & Improved SDK Code Samples
- simpleP2P sample now supports peer-to-peer communication with any Fermi GPU
- New grabcutNPP sample demonstrates interactive foreground extraction using iterated graph cuts (this is really neat!)
- New samples showing how to implement the Horn-Schunck Method for optical flow, perform volume filtering, and read cube map texture
Parallel Nsight is a powerful IDE-integration and development tool that allows you to perform the following types of procedures from within Microsoft Visual Studio:
- Debug CUDA Kernels directly on the GPU hardware
- Examine (potentially thousands of) threads that are executing in parallel
- Use on-target conditional breakpoints to locate errors
- Use the CUDA memory-checker
- Perform System-Trace activities to review CUDA activities that span your CPU(s) and GPU(s)
- Perform deep kernel analysis to find performance bottlenecks so you can optimize the code speedup that is possible with CUDA and massively parallel-processed code.
- Profiling capabilities including advanced experiments to measure memory utilization, instruction throughput, and stall conditions
Some of the new capabilities include:
- a "warp watch" ability to watch variables and expressions across an entire CUDA warp (a particular level of granularity that is very useful to watch)
- analyzing kernel memory (alloc/dealloc events, execution stalls, etc)