Tuesday, June 07, 2011

Nvidia CUDA Toolkit 4.0 and Parallel Nsight 2.0

It has been not quite 6 months since I last blogged about Nvidia's CUDA technology when the CUDA Toolkit 3.2 was released.  Now it is time for a substantial upgrade to both Nvidia CUDA Toolkit (with version 4.0) and Parallel Nsight (now at version 2.0) — and Nvidia is positioning this upgrade as a way to garner "broader developer adoption", which hopefully it will.

CUDA is NVIDIA’s parallel computing architecture that enables dramatic increases in computing performance by harnessing the power of the GPU (graphics processing unit) for applications including image and video processing, computational biology and chemistry, fluid dynamics simulation, CT image reconstruction, seismic analysis, ray tracing, and much more.  And, with the recent Nvidia (NASDAQ:NVDA) "Fermi" line of GPUs (Graphical Processing Units), powerful parallel computing is within reach of most individual users and businesses with various rather affordable Nvidia Graphics Cards.

Note: many of these latest features require a "Fermi"-based GPU; given how affordable these newest cards are, and how much more performance-per-Watt they deliver, if you plan to do any CUDA development, it is going to be very much worth your while to get a Fermi-based CUDA-Capable Graphics Card.

New in Nvidia CUDA Toolkit 4.0

I have been waiting for one particular feature of the Tookit 4.0 — a "feature" that has arguably little to do with any upgrades to the actual NVidia CUDA technology, but rather about the ease of doing software development using CUDA — in particular, this release finally supports Visual Studio 2010 in addition to Visual Studio 2008 SP1 development environments.  I need to try this out and make sure it works, as I found it a bit frustrating to have to go through manual steps of installing an older VC++ 2008 compiler to compile my CUDA 3.2 apps before, even when trying to use Visual Studio 2010.  That was simply annoying!

The Major Features in CUDA Toolkit 4.0

One of the most notable changes is the introduction of NVIDIA GPUDirect 2.0, which enables peer-to-peer memory access and thus faster multi-GPU programming; put in simple terms, if you have more than one Nvidia GPU that you want to copy some data between, that data no longer has to pass through your computer's CPU/system-memory (instead, it is direct GPU-to-GPU moves), like cudaMemcpy(GPU2, GPU1) basically. For applications that communicate within a node, this Peer-to-Peer memory access, transfers & synchronization ability should lead to less code and more productivity.

Next up is Unified Virtual Addressing, which provides a single flat memory address space for CPU and GPU resources with the objective of enabling quicker and easier parallel programming.  Having just one address space for all CPU and GPU memory enables libraries to simplify their interfaces — e.g. cudaMemcpy: whereby the single function cudaMemcpyDefault takes the place of cudaMemcpyHostToHost, cudaMemcpyHostToDevice, cudaMemcpyDeviceToHost, cudaMemcpyDeviceToDevice, and the data location (involved in the Memcpy) becomes an implementation detail of cudaMemcpyDefault.  If you were wondering, you can determine the actual physical memory location from a pointer value.

Easier Parallel Programming in C++ using Thrust library (C++ Templatized Algorithms & Data Structures).  Thrust provides more C++ capabilities and an easier way to program parallel applications in C++ with a library of template performance primitives. These primitives are a powerful collection of open source C++ parallel algorithms & data structures that are somewhat similar to C++ Standard Template Library (STL).  You will find algorithms like "thrust::sort" and "thrust::reduce" at your disposal, along with data-structures like "thrust::device_vector" and "thrust::host_vector". This code helps automatically choose the fastest code-path at compile time, divide work between GPUs and multi-core CPUs, and enable 5x to 100x faster parallel sorting operations.

I find all these improvements and new features in CUDA 4.0 to be very nice!  I am just starting to play with all these things and working to alter some of my existing code to take advantage of the latest CUDA improvements.  And, I still need to test that Visual Studio 2010 support/integration out in detail (essentially, I need to remove my prior "workarounds" for VS2010 and CUDA 3.2).

List of Features in CUDA Toolkit 4.0
(from Nvidia website, with some added comments)

Easier Application Porting
  • Share GPUs across multiple threads 
  • Use all GPUs in the system concurrently from a single host thread 
  • No-copy pinning of system memory, a faster alternative to cudaMallocHost() — this should allow you to reduce system memory usage and CPU memcpy() overhead as well as make it easier to add CUDA acceleration to existing applications.

    Just register malloc’d system memory for async operations and then call cudaMemcpy() as usual... or, in a longer form of an explanation for what this means...
    basically, we can remove some extra allocation and extra copy steps previously required in earlier versions of CUDA programming;

    i.e., cudaHostRegister(a) takes the place of cudaMallocHost(b), memcpy(b, a) prior to performing a cudaMemcpy() to GPU, launching kernels, and cudaMemcpy() from GPU. And, cudaHostUnregister(a) follows that sequence in place of the older memcpy(a, b), cudaFreeHost(b) previously required. Whew!  Make sense?
  • C++ new and delete operators (for dynamic memory allocation / management) and support for C++ virtual functions 
  • Support for inline PTX assembly — enables assembly-level optimization, for when you have to get into very low-level details
  • Thrust library of templated performance primitives such as sort, reduce, etc. (discussed above in more detail) 
  • NVIDIA Performance Primitives (NPP) library for image/video processing — these give you access to 10x to 36x faster image processing via imaging and video related primitives 
  • Layered Textures for working with same size/format textures at larger sizes and higher performance — these should be edeal for processing multiple textures with same size/format, including very large ones on GPU devices like the Tesla T20 (up to 16k x 16k x 2k density)
Faster Multi-GPU Programming (see my discussion above)

  • Unified Virtual Addressing 
  • GPUDirect v2.0 support for Peer-to-Peer Communication 

New and Improved Developer Tools
  • Automated Performance Analysis in Visual Profiler 
  • C++ debugging in CUDA-GDB for Linux and MacOS 
  • GPU binary disassembler for Fermi architecture (cuobjdump) 
  • Parallel Nsight 2.0 now available for Windows developers with new debugging and profiling features. 

New in Nvidia Parallel Nsight 2.0

Parallel Nsight 2.0 specifically shows Visual Studio 2010 support as one of its main features, and this is what I am certainly ready for.  Nvidia's website summarizes this release with the following description:
Parallel Nsight 2.0 makes parallel programming easier than ever, giving developers access to more tools and workflows they expect from developing on the CPU. The new release provides a number of new and enhanced features, including full support for Microsoft Visual Studio 2010, support for CUDA Toolkit version 4.0, attach to process support, PTX/SASS assembly debugging, other advanced debugging and analysis capabilities, graphics performance and stability enhancements.

Summary: CUDA 4.0 has a lot to Offer

This latest release of the CUDA Toolkit from Nvidia should make life easier for any of us that are into parallel-programming with modern GPUs.  It is perhaps a bit overwhelming and a requires a different mindset than programming desktop applications or designing a website, but if you have an application that can benefit from the power of simultaneous operations, this is a technology worth diving into: it is nothing short of a transformational technology.  Many people will never know *how* this technology is benefiting them, but I guarantee the end-results will be right in front of them in all sorts of applications ranging from games to financial-analysis software to medical-imaging software and more.

As discussed in a previous blog I wrote about VMware ESXi 5.0 and virtual machine technology, I am still waiting for CUDA Support in VMware ESXi 5.0 and other recent or next-generation virtualization products to emerge so I can take even more advantage of this technology.  At last review of features (with ESXi 5.0 coming soon), I still could find no mention of CUDA support in ESXi.  Hopefully VMware recognizes the importance of this technology sooner rather than later.  I would think some of the changes in CUDA 4.0 make implementing this direct GPU programming inside a Virtual Machine simpler to support and implement (I do not care if support were ONLY for v4.0+ of CUDA in VMs, if that is what requirement there was... I just want support for it).