NVIDIA CUDA Toolkit Changelog

What's new in NVIDIA CUDA Toolkit 12.4.0

Mar 6, 2024

CUDA Components:
Starting with CUDA 11, the various components in the toolkit are versioned independently.
CUDA Driver:
Running a CUDA application requires the system with at least one CUDA capable GPU and a driver that is compatible with the CUDA Toolkit. See Table 3. For more information various GPU products that are CUDA capable, visit https://developer.nvidia.com/cuda-gpus.
Each release of the CUDA Toolkit requires a minimum version of the CUDA driver. The CUDA driver is backward compatible, meaning that applications compiled against a particular version of the CUDA will continue to work on subsequent (later) driver releases.
More information on compatibility can be found at https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#cuda-compatibility-and-upgrades.
Note: Starting with CUDA 11.0, the toolkit components are individually versioned, and the toolkit itself is versioned as shown in the table below.
General CUDA:
Access-counter-based memory migration for Grace Hopper systems is now enabled by default. As this is the first release with the capability enabled, developers may find that applications that had been optimized for earlier memory migration algorithms may see a performance regression if optimized for the earlier behaviors. Should this occur, we introduce a supported but temporary flag to opt out of this behavior. You can control the enablement of this feature by unloading and reloading the NVIDIA UVM driver, as follows:
This release introduces support for the following new features in CUDA graphs:
Graph conditional nodes (enhanced from 12.3)
Device-side node parameter update for device graphs
Updatable graph node priorities without recompilation
Enhanced monitoring capabilities through NVML and nvidia-smi:
NVJPG and NVOFA utilization percentage
PCIe class and subclass reporting
dmon reports are now available in CSV format
More descriptive error codes returned from NVML
dmon now reports gpm-metrics for MIG (that is, nvidia-smi dmon --gpm-metrics runs in MIG mode)
NVML running against older drivers will report FUNCTION_NOT_FOUND in some cases, failing gracefully if NVML is newer than the driver
NVML APIs to query protected memory information for Hopper Confidential Computing
This release introduces nvFatbin, a new library to create CUDA fat binary files at runtime.
Confidential Computing General Access:
Starting in 12.4 with R550.54.14, the Confidential Computing of Hopper will move to General Access for discrete GPU usage.
All EA RIM certificates prior to this release will be revoked with status PrivilegeWithdrawn 30 days after posting.
CUDA Compilers:
For changes to PTX, refer to https://docs.nvidia.com/cuda/parallel-thread-execution/#ptx-isa-version-8-4.
Added the __maxnreg__ kernel function qualifier to allow users to directly specify the maximum number of registers to be allocated to a single thread in a thread block in CUDA C++.
Added a new flag -fdevice-syntax-only that ends device compilation after front-end syntax checking. This option can provide rapid feedback (warnings and errors) of source code changes as it will not invoke the optimizer. Note: this option will not generate valid object code.
Add a new flag -minimal for NVRTC compilation. The -minimal flag omits certain language features to reduce compile time for small programs. In particular, the following are omitted:
Texture and surface functions and associated types (for example, cudaTextureObject_t).
CUDA Runtime Functions that are provided by the cudadevrt device code library, typically named with prefix “cuda”, for example, cudaMalloc.
Kernel launch from device code.
Types and macros associated with CUDA Runtime and Driver APIs, provided by cuda/tools/cudart/driver_types.h, typically named with the prefix “cuda” for example, cudaError_t.
Starting in CUDA 12.4, PTXAS enables position independent code (-pic) as default when the compilation mode is whole program compilation. Users can opt out by specifying the -pic=false option to PTXAS. Debug compilation and separate compilation continue to have position independent code disabled by default. In future, position independent code will allow the CUDA Driver to share a single copy of text section across contexts and reduce resident memory usage.
CUDA Developer Tools:
For changes to nvprof and Visual Profiler, see the changelog.
For new features, improvements, and bug fixes in Nsight Systems, see the changelog.
For new features, improvements, and bug fixes in Nsight Visual Studio Edition, see the changelog.
For new features, improvements, and bug fixes in CUPTI, see the changelog.
For new features, improvements, and bug fixes in Nsight Compute, see the changelog.
For new features, improvements, and bug fixes in Compute Sanitizer, see the changelog.
For new features, improvements, and bug fixes in CUDA-GDB, see the changelog.
Resolved Issues:
General CUDA:
Fixed a compiler crash that could occur when inputs to MMA instructions were used before being initialized.
CUDA Compilers
In certain cases, dp4a or dp2a instructions would be generated in ptx and cause incorrect behavior due to integer overflow. This has been fixed in CUDA 12.4.
Deprecated or Dropped Features:
Features deprecated in the current release of the CUDA software still work in the current release, but their documentation may have been removed, and they will become officially unsupported in a future release. We recommend that developers employ alternative solutions to these features in their software.
Deprecated Architectures:
CUDA Toolkit 12.4 deprecates NVIDIA CUDA support for the PowerPC architecture. Support for this architecture is considered deprecated and will be removed in an upcoming release.
Deprecated Operating Systems:
CUDA Toolkit 12.4 deprecates support for Red Hat Enterprise Linux 7 and CentOS 7. Support for these operating systems will be removed in an upcoming release.
Deprecated Toolchains:
CUDA Toolkit 12.4 deprecates support for the following host compilers:
Microsoft Visual C/C++ (MSVC) 2017
All GCC versions prior to GCC 7.3
CUDA Libraries:
This section covers CUDA Libraries release notes for 12.x releases.
CUDA Math Libraries toolchain uses C++11 features, and a C++11-compatible standard library (libstdc++ >= 20150422) is required on the host.
Support for the following compute capabilities is removed for all libraries:
sm_35 (Kepler)
sm_37 (Kepler)
cuBLAS: Release 12.4:
New Features:
cuBLAS adds experimental APIs to support grouped batched GEMM for single precision and double precision. Single precision also supports the math mode, CUBLAS_TF32_TENSOR_OP_MATH. Grouped batch mode allows you to concurrently solve GEMMs of different dimensions (m, n, k), leading dimensions (lda, ldb, ldc), transpositions (transa, transb), and scaling factors (alpha, beta). Please see cublas<t>gemmGroupedBatched <https://docs.nvidia.com/cuda/cublas/index.html#cublas-t-gemmgroupedbatched>__ for more details.
Known Issues:
When the current context has been created using cuGreenCtxCreate(), cuBLAS does not properly detect the number of SMs available. The user may provide the corrected SM count to cuBLAS using an API such as cublasSetSmCountTarget().
BLAS level 2 and 3 functions might not treat alpha in a BLAS compliant manner when alpha is zero and the pointer mode is set to CUBLAS_POINTER_MODE_DEVICE. This is the same known issue documented in cuBLAS 12.3 Update 1.
cublasLtMatmul with K equals 1 and epilogue CUBLASLT_EPILOGUE_D{RELU,GELU}_BGRAD could out-of-bound access the workspace. The issue exists since cuBLAS 11.3 Update 1.
cuFFT: Release 12.4:
New Features
Added Just-In-Time Link-Time Optimized (JIT LTO) kernels for improved performance in FFTs with 64-bit indexing.
Added per-plan properties to the cuFFT API. These new routines can be leveraged to give users more control over the behavior of cuFFT. Currently they can be used to enable JIT LTO kernels for 64-bit FFTs.
Improved accuracy for certain single-precision (fp32) FFT cases, especially involving FFTs for larger sizes.
Known Issues:
A routine from the cuFFT LTO EA library was added by mistake to the cuFFT Advanced API header (cufftXt.h). This routine is not supported by cuFFT, and will be removed from the header in a future release.
Resolved Issues:
Fixed an issue that could cause overwriting of user data when performing out-of-place real-to-complex (R2C) transforms with user-specified output strides (i.e. using the ostride component of the Advanced Data Layout API).
Fixed inconsistent behavior between libcufftw and FFTW when both inembed and onembed are nullptr / NULL. From now on, as in FFTW, passing nullptr / NULL as inembed/onembed parameter is equivalent to passing n, that is, the logical size for that dimension.
cuSOLVER: Release 12.4:
New Features:
cusolverDnXlarft and cusolverDnXlarft_bufferSize APIs were introduced. cusolverDnXlarft forms the triangular factor of a real block reflector, while cusolverDnXlarft_bufferSize returns its required workspace sizes in bytes.
Known Issues:
cusolverDnXtrtri_bufferSize` returns an incorrect required device workspace size. As a workaround the returned size can be multiplied by the size of the data type (for example, 8 bytes if matrix A is of type double) to obtain the correct workspace size.
cuSPARSE: Release 12.4:
New Features:
Added the preprocessing step for sparse matrix-vector multiplication cusparseSpMV_preprocess().
Added support for mixed real and complex types for cusparseSpMM().
Added a new API cusparseSpSM_updateMatrix() to update the sparse matrix between the analysis and solving phase of cusparseSpSM().
Known Issues:
cusparseSpMV() introduces invalid memory accesses when the output vector is not aligned to 16 bytes.
Resolved Issues:
cusparseSpVV() provided incorrect results when the sparse vector has many non-zeros.
CUDA Math: Release 12.4:
Resolved Issues:
Host-specific code in cuda_fp16/bf16 headers is now free from type-punning and shall work correctly in the presence of optimizations based on strict-aliasing rules.
NPP: Release 12.4:
New Features:
Enhanced large file support with size_t.

New in NVIDIA CUDA Toolkit 12.2.0 (Jun 28, 2023)

New Features:
This release introduces Heterogeneous Memory Management (HMM), allowing seamless sharing of data between host memory and accelerator devices. HMM is supported on Linux only and requires a recent kernel (6.1.24+ or 6.2.11+).
HMM requires the use of NVIDIA’s GPU Open Kernel Modules driver.
As this is the first release of HMM, some limitations exist:
GPU atomic operations on file-backed memory are not yet supported.
Arm CPUs are not yet supported.
HugeTLBfs pages are not yet supported on HMM (this is an uncommon scenario).
The fork() system call is not fully supported yet when attempting to share GPU-accessible memory between parent and child processes.
HMM is not yet fully optimized, and may perform slower than programs using cudaMalloc(), cudaMallocManaged(), or other existing CUDA memory management APIs. The performance of programs not using HMM will not be affected.
The Lazy Loading feature (introduced in CUDA 11.7) is now enabled by default on Linux with the 535 driver. To disable this feature on Linux, set the environment variable CUDA_MODULE_LOADING=EAGER before launch. Default enablement for Windows will happen in a future CUDA driver release. To enable this feature on Windows, set the environment variable CUDA_MODULE_LOADING=LAZY before launch.
Host NUMA memory allocation: Allocate a CPU memory targeting a specific NUMA node using either the CUDA virtual memory management APIs or the CUDA stream-ordered memory allocator. Applications must ensure device accesses to pointer backed by HOST allocations from these APIs are performed only after they have explicitly requested accessibility for the memory on the accessing device. It is undefined behavior to access these host allocations from a device without accessibility for the address range, regardless of whether the device supports pageable memory access or not.
Added per-client priority mapping at runtime for CUDA Multi-Process Service (MPS). This allows multiple processes running under MPS to arbitrate priority at a coarse-grained level between multiple processes without changing the application code.
We introduce a new environment variable CUDA_MPS_CLIENT_PRIORITY, which accepts two values: NORMAL priority, 0, and BELOW_NORMAL priority, 1.
CUDA Compilers:
LibNVVM samples have been moved out of the toolkit and made publicly available on GitHub as part of the NVIDIA/cuda-samples project. Similarly, the nvvmir-samples have been moved from the nvidia-compiler-sdk project on GitHub to the new location of the libNVVM samples in the NVIDIA/cuda-samples project.
Resolved potential soft lock-ups around rm_run_nano_timer_callback(). A Linux kernel device driver API used for timer management in the Linux kernel interface of the NVIDIA GPU driver was susceptible to a race condition under multi-GPU configurations.
Fixed potential GSP-RM hang in kernel_resolve_address().
Removed potential GPUDirect RDMA driver crash in nvidia_p2p_put_pages(). The legacy non-persistent memory APIs allow third party driver to invoke nvidia_p2p_put_pages with a stale page_table pointer, which has already been freed by the RM callback as part of the process shutdown sequence. This behavior was broken when persistent memory support was added to the legacy nvidia_p2p APIs. We resolved the issue by providing new APIs: nvidia_p2p_get/put_pages_persistent for persistent memory. Thus, the original behavior of the legacy APIs for non-persistent memory is restored. This is essentially a change in the API, so although the nvidia-peermem is updated accordingly, external consumers of persistent memory mapping will need to be changed to use the new dedicated APIs.
Resolved an issue in watchcat syscall.
Fixed potential incorrect results in optimized code under high register pressure. NVIDIA has found that under certain rare conditions, a register spilling optimization in PTXAS could result in incorrect compilation results. This issue is fixed for offline compilation (non-JIT) in the CUDA 12.2 release and will be fixed for JIT compilation in the next enterprise driver update.
NVIDIA believes this issue to be extremely rare, and applications relying on JIT that are working successfully should not be affected.

New in NVIDIA CUDA Toolkit 12.1.0 (Mar 1, 2023)

New meta-packages for Linux installation:
cuda-toolkit
Installs all CUDA Toolkit packages required to develop CUDA applications.
Handles upgrading to the latest version of CUDA when it’s released.
Does not include the driver.
cuda-toolkit-12
Installs all CUDA Toolkit packages required to develop CUDA applications.
Handles upgrading to the next 12.x version of CUDA when it’s released.
Does not include the driver.
New CUDA API to enable mini core dump programmatically is now available. Refer to https://docs.nvidia.com/cuda/cuda-gdb/index.html#gpu-core-dump-support and https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__COREDUMP.html#group__CUDA__COREDUMP for more information.
.2.2. CUDA Compilers:
NVCC has added support for host compiler: GCC 12.2, NVC++ 22.11, Clang 15.0, VS2022 17.4
Breakpoint and single stepping behavior for a multi-line statement in device code has been improved, when code is compiled with nvcc using gcc/clang host compiler compiler or when compiled with NVRTC on non-Windows platforms. The debugger will now correctly breakpoint and single-step on each source line of the multiline source code statement.
PTX has exposed a new special register in the public ISA, which can be used to query total size of shared memory which includes user shared memory and SW reserved shared memory.
NVCC and NVRTC now show preprocessed source line and column info in a diagnostic to help users to understand the message and identify the issue causing the diagnostic. The source line and column info can be turned off with --brief-diagnostics=true.
.2.3. CUDA Developer Tools:
For changes to nvprof and Visual Profiler, see the changelog.
For new features, improvements, and bug fixes in CUPTI, see the changelog.
For new features, improvements, and bug fixes in Nsight Compute, see the changelog.
For new features, improvements, and bug fixes in Compute Sanitizer, see the changelog.
For new features, improvements, and bug fixes in CUDA-GDB, see the changelog.
.3. Deprecated or Dropped Features:
Features deprecated in the current release of the CUDA software still work in the current release, but their documentation may have been removed, and they will become officially unsupported in a future release. We recommend that developers employ alternative solutions to these features in their software.
General CUDA:
CentOS Linux 8 reached End-of-Life on December 31, 2021. Support for this OS is now removed from the CUDA Toolkit and is replaced by Rocky Linux 8.
Server 2016 support has been deprecated and shall be removed in a future release.
Kepler architecture support is removed from CUDA 12.0.
CUDA 11 applications that relied on Minor Version Compatibility are not guaranteed to work in CUDA 12.0 onwards. Developers will either need to statically link their applications, or recompile within the CUDA 12.0 environment to ensure continuity of development.
From 12.0, JIT LTO support is now part of CUDA Toolkit. JIT LTO support in the CUDA Driver through the cuLink driver APIs is officially deprecated. Driver JIT LTO will be available only for 11.x applications. The following enums supported by the cuLink Driver APIs for JIT LTO are deprecated:
CU_JIT_INPUT_NVVM
CU_JIT_LTO
CU_JIT_FTZ
CU_JIT_PREC_DIV
CU_JIT_PREC_SQRT
CU_JIT_FMA
CU_JIT_REFERENCED_KERNEL_NAMES
CU_JIT_REFERENCED_KERNEL_COUNT
CU_JIT_REFERENCED_VARIABLE_NAMES
CU_JIT_REFERENCED_VARIABLE_COUNT
CU_JIT_OPTIMIZE_UNUSED_DEVICE_VARIABLES
Existing 11.x CUDA applications using JIT LTO will continue to work on the 12.0/R525 and later driver. The driver cuLink API support for JIT LTO is not removed but will only support 11.x LTOIR. The cuLink driver API enums for JIT LTO may be removed in the future so we recommend transitioning over to CUDA Toolkit 12.0 for JIT LTO.
.0 LTOIR will not be supported by the driver cuLink APIs. 12.0 or later applications must use nvJitLink shared library to benefit from JIT LTO.
Refer to the CUDA 12.0 blog on JIT LTO for more details.
CUDA Tools:
CUDA-MEMCHECK is removed from CUDA 12.0, and has been replaced with Compute Sanitizer.
CUDA Compiler:
bit compilation native and cross-compilation is removed from CUDA 12.0 and later Toolkit. Use the CUDA Toolkit from earlier releases for 32-bit compilation. CUDA Driver will continue to support running existing 32-bit applications on existing GPUs except Hopper. Hopper does not support 32-bit applications. Ada will be the last architecture with driver support for 32-bit applications.

New in NVIDIA CUDA Toolkit 12.0.0 (Dec 9, 2022)

General CUDA:
CUDA 12.0 exposes programmable functionality for many features of the Hopper and Ada Lovelace architectures:
Many tensor operations now available via public PTX:
TMA operations
TMA bulk operations
32x Ultra xMMA (including FP8/FP16)
Membar domains in Hopper, controlled via launch parameters
Smem sync unit PTX and C++ API support
Introduced C intrinsics for Cooperative Grid Array (CGA) relaxed barrier support
Programmatic L2 Cache to SM multicast (Hopper-only)
Public PTX for SIMT collectives - elect_one
Genomics/DPX instructions now available for Hopper GPUs to provide faster combined-math arithmetic operations (three-way max, fused add+max, etc.)
Enhancements to the CUDA graphs API:
You can now schedule graph launches from GPU device-side kernels by calling built-in functions. With this ability, user code in kernels can dynamically schedule graph launches, greatly increasing the flexibility of CUDA graphs.
The cudaGraphInstantiate() API has been refactored to remove unused parameters.
Added the ability to use virtual memory management (VMM) APIs such as cuMemCreate() with GPUs masked by CUDA_VISIBLE_DEVICES.
Application and library developers can now programmatically update the priority of CUDA streams.
CUDA 12.0 adds support for revamped CUDA Dynamic Parallelism APIs, offering substantial performance improvements vs. the legacy CUDA Dynamic Parallelism APIs.
Added new APIs to obtain unique stream and context IDs from user-provided objects:
cuStreamGetId(CUstream hStream, unsigned long long *streamId)
cuCtxGetId(CUcontext ctx, unsigned long long *ctxId)
Added support for read-only cuMemSetAccess() flag CU_MEM_ACCESS_FLAGS_PROT_READ.
CUDA Compilers:
JIT LTO support is now officially part of the CUDA Toolkit through a separate nvJitLink library. A technical deep dive blog will go into more details. Note that the earlier implementation of this feature has been deprecated. Refer to the Deprecation/Dropped Features section below for details.
New host compiler support:
GCC 12.1 (Official) and 12.2.1 ( Experimental)
VS 2022 17.4 Preview 3 fixes compiler errors mentioning an internal function std::_Bit_cast by using CUDA’s support for __builtin_bit_cast.
NVCC and NVRTC now support the c++20 dialect. Most of the language features are available in host and device code; some such as coroutines are not supported in device code. Modules are not supported for both host and device code. Host Compiler Minimum Versions: GCC 10, Clang 11, VS2022, Arm C/C++ 22.x. Refer to the individual Host Compiler documentation for other feature limitations. Note that a compilation issue in C++20 mode with <complex> header mentioning an internal function std::_Bit_cast is resolved in VS2022 17.4.
NVRTC default C++ dialect changed from C++14 to C++17. Refer to the ISO C++ standard for reference on the feature set and compatibility between the dialects.
NVVM IR Update: with CUDA 12.0 we are releasing NVVM IR 2.0 which is incompatible with NVVM IR 1.x accepted by the libNVVM compiler in prior CUDA toolkit releases. Users of the libNVVM compiler in CUDA 12.0 toolkit must generate NVVM IR 2.0.

New in NVIDIA CUDA Toolkit 11.7.0 (May 12, 2022)

General CUDA:
To best ensure the security and reliability of our RPM and Debian package repositories, NVIDIA is updating and rotating the signing keys used by apt, dnf/yum, and zypper package managers beginning April 27, 2022. Failure to update your repository signing keys will result in package management errors when attempting to access or install packages from CUDA repositories.
NVIDIA Open GPU Kernel Modules: With CUDA 11.7 and R515 driver, NVIDIA is open sourcing the GPU kernel mode driver under dual GPL/MIT license. Refer to https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#open-gpu-kernel-modules for more information.
Lazy Loading: Delay kernel loading from host to GPU to the point where the kernel is called. This also only loads used kernels, which may result in a significant device-side memory savings. This also defers load latency from the beginning of the application to the point where a kernel is first called—overall binary load latency is usually significantly reduced, but is also shifted to later points in the application.
CUDA Compilers:
Grid private constants:
NVCC host compiler support for clang13
CUDA Developer Tools:
For changes to nvprof and Visual Profiler, see the changelog.
For new features, improvements, and bug fixes in CUPTI, see the changelog.
For new features, improvements, and bug fixes in Nsight Compute, see the changelog.
For new features, improvements, and bug fixes in Compute Sanitizer, see the changelog.
For new features, improvements, and bug fixes in CUDA-GDB, see the changelog.
Resolved Issues:
General CUDA:
All color formats are now supported for Vulkan-CUDA interop on L4T and Android.
Resolved a linking issue that could be encountered on some systems when using libnvfm.so.
CUDA Compilers:
There was a compiler bug due to which a function marked __forceinline__ in a CUDA C++ program (or a function marked with the NVVM IR alwaysinline attribute for libNVVM compilation) was incorrectly given static linkage by the compiler in certain compilation modes. This incorrect behavior has been fixed and the compiler will not change the linkage in these compilation modes. As a result, if the static linkage is appropriate for such a function, then the program itself should set the linkage.
Updated the libNVVM API documentation to include the library version and a note regarding thread safety.

New in NVIDIA CUDA Toolkit 11.6.0 (Jan 13, 2022)

CUDA Components:
Starting with CUDA 11, the various components in the toolkit are versioned independently.
CUDA Driver:
Running a CUDA application requires the system with at least one CUDA capable GPU and a driver that is compatible with the CUDA Toolkit. See Table 3. For more information various GPU products that are CUDA capable, visit https://developer.nvidia.com/cuda-gpus.
Each release of the CUDA Toolkit requires a minimum version of the CUDA driver. The CUDA driver is backward compatible, meaning that applications compiled against a particular version of the CUDA will continue to work on subsequent (later) driver releases.
More information on compatibility can be found at https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#cuda-compatibility-and-upgrades.
Note: Starting with CUDA 11.0, the toolkit components are individually versioned, and the toolkit itself is versioned as shown in the table below.
The minimum required driver version for CUDA minor version compatibility is shown below. CUDA minor version compatibility is described in detail in https://docs.nvidia.com/deploy/cuda-compatibility/index.html
For convenience, the NVIDIA driver is installed as part of the CUDA Toolkit installation. Note that this driver is for development purposes and is not recommended for use in production with Tesla GPUs.
For running CUDA applications in production with Tesla GPUs, it is recommended to download the latest driver for Tesla GPUs from the NVIDIA driver downloads site at http://www.nvidia.com/drivers.
During the installation of the CUDA Toolkit, the installation of the NVIDIA driver may be skipped on Windows (when using the interactive or silent installation) or on Linux (by using meta packages).
For more information on customizing the install process on Windows, see http://docs.nvidia.com/cuda/cuda-installation-guide-microsoft-windows/index.html#install-cuda-software.
For meta packages on Linux, see https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#package-manager-metas
General CUDA:
Added a new API, cudaGraphNodeSetEnabled(), to allow disabling nodes in an instantiated graph. Support is limited to kernel nodes in this release. A corresponding API, cudaGraphNodeGetEnabled(), allows querying the enabled state of a node.
Full release of 128-bit integer (__int128) data type including compiler and developer tools support. The host-side compiler must support the __int128 type to use this feature.
Cooperative groups namespace is updated with new functions to improve consistency in naming, function scope, and unit dimension/size Added ability to disable NULL kernel graph node launches.
Added new NVML public APIs for querying functionality under Wayland.
Added L2 cache control descriptors for atomics.
Large CPU page support for UVM managed memory.
CUDA Compilers:
VS2022 Support: CUDA 11.6 officially supports the latest VS2022 as host compiler. A separate Nsight Visual Studio installer 2022.1.1 must be downloaded from here. A future CUDA release will have the Nsight Visual Studio installer with VS2022 support integrated into it.
New instructions in public PTX: New instructions for bit mask creation - BMSK and sign extension - SZEXT are added to the public PTX ISA. You can find documentation for these instructions in the PTX ISA guide: BMSK and SZEXT.
Unused Kernel Optimization: In CUDA 11.5, unused kernel pruning was introduced with the potential benefits of reducing binary size and improving performance through more efficient optimizations. This was an opt-in feature but in 11.6, this feature is enabled by default. As mentioned in the 11.5 blog here, there is an opt-out flag that can be used in case it becomes necessary for debug purposes or for other special situations.
$ nvcc -rdc=true user.cu testlib.a -o user -Xnvlink -ignore-host-info
In addition to the -arch=all and -arch=all-major options added in CUDA 11.5, NVCC introduced -arch= native in CUDA 11.5 update1. This -arch=native option is a convenient way for users to let NVCC determine the right target architecture to compile the CUDA device code to based on the GPU installed on the system. This can be particularly helpful for testing when applications are run on the same system they are compiled in.
Generate PTX from nvlink: Using the following command line, device linker, nvlink will produce PTX as an output in addition to CUBIN: nvcc -dlto -dlink -ptx.
Device linking by nvlink is the final stage in the CUDA compilation process. Applications that have multiple source translation units have to be compiled in separate compilation mode. LTO (introduced in CUDA 11.4) allowed nvlink to perform optimizations at device link time instead of at compile time so that separately compiled applications with several translation units can be optimized to the same level as whole program compilations with a single translation unit. However, without the option to output PTX, applications that cared about forward compatibility of device code could not benefit from Link Time Optimization or had to constrain the device code to a single source file.
With the option for nvlink that performs LTO to generate the output in PTX, customer applications that require forward compatibility across GPU architectures can span across multiple files and can also take advantage of Link Time Optimization.
Bullseye support: NVCC compiled source code will work with code coverage tool Bullseye. The code coverage is only for the CPU or the host functions. Code coverage for device function is not supported through bullseye.
INT128 developer tool support: In 11.5, CUDA C++ support for 128-bit was added. In this release, developer tools supports the datatype as well. With the latest version of libcu++, int 128 data type is supported by math functions.
Resolved Issues:
CUDA Compilers:
When using the --fmad=false compiler option, even the explicitly requested fused multiply-add instructions were decomposed into separate multiply and add, leading to loss of algorithm semantics intended by the programmer. One of the consequences was that CUDA Math APIs could not be trusted to deliver correct results; worst case errors became unbounded. This issue was introduced in 11.5, and is now resolved.
Fixed a compiler optimization bug that may move memory access instructions across memory barriers that may lead to incorrect runtime results with certain synchronization dependencies.
An issue in the PTX optimizer sometimes produced incorrect results. This issue is resolved.
Linking with cubins larger than 2 GB is now supported.
Certain C++17 features that were backported to C++14 in MSVC are now supported.
An issue with the use of lambda function when an object is passed-by-value is resolved. https://github.com/Ahdhn/nvcc_bug_maybe
Deprecated Features:
The following features are deprecated in the current release of the CUDA software. The features still work in the current release, but their documentation may have been removed, and they will become officially unsupported in a future release. We recommend that developers employ alternative solutions to these features in their software.
General CUDA:
The cudaDeviceSynchronize() function used for on-device fork/join parallelism is deprecated in preparation for a replacement programming model with higher performance. These functions continue to work in this release, but the tools will emit a warning about the upcoming change.
Known Issues:
General CUDA:
Intermittent crashes were seen when CUDA binaries were running on a system with a GLIBC version older than 2.17-106.el7_2.1. This is due to a known bug in older versions of GLIBC (Bug reference: https://bugzilla.redhat.com/show_bug.cgi?id=1293976) and has been fixed in later versions (>= glibc-2.17-107.el7).
CUDA Libraries:
This section covers CUDA Libraries release notes for 11.x releases:
CUDA Math Libraries toolchain uses C++11 features, and a C++11-compatible standard library (libstdc++ >= 20150422) is required on the host.
CUDA Math libraries are no longer shipped for SM30 and SM32.
Support for the following compute capabilities are deprecated for all libraries:
sm_35 (Kepler)
sm_37 (Kepler)
sm_50 (Maxwell)
CuBLAS Library:
CuBLAS: Release 11.6:
New Features
New epilogue options have been added to support fusion in DLtraining: CUBLASLT_EPILOGUE_{DRELU,DGELU} which are similar to CUBLASLT_EPILOGUE_{DRELU,DGELU}_BGRAD but don’t compute bias gradient.
Resolved Issues
Some syrk-related functions (cublas{D,Z}syrk, cublas{D,Z}syr2k, cublas{D,Z}syrkx) may fail for matrices which size is greater than 2^31.
CuBLAS: Release 11.4 Update 3:
Resolved Issues:
Some cublas and cublasLt functions sometimes returned CUBLAS_STATUS_EXECUTION_FAILED if the dynamic library was loaded and unloaded several times during application lifetime within the same CUDA context. This issue has been resolved.
CuBLAS: Release 11.4 Update 2:
New Features:
Vector (and batched) alpha support for per-row scaling in TN int32 math Matmul with int8 output. See CUBLASLT_POINTER_MODE_ALPHA_DEVICE_VECTOR_BETA_HOST and CUBLASLT_MATMUL_DESC_ALPHA_VECTOR_BATCH_STRIDE.
New epilogue options have been added to support fusion in DLtraining: CUBLASLT_EPILOGUE_BGRADA and CUBLASLT_EPILOGUE_BGRADB which compute bias gradients based on matrices A and B respectively.
New auxiliary functions cublasGetStatusName(), cublasGetStatusString() have been added to cuBLAS that return the string representation and the description of the cuBLAS status (cublasStatus_t) respectively. Similarly, cublasLtGetStatusName(), cublasLtGetStatusString() have been added to cuBlasLt.
Known Issues
cublasGemmBatchedEx() and cublas<t>gemmBatched() check the alignment of the input/output arrays of the pointers like they were the pointers to the actual matrices. These checks are irrelevant and will be disabled in future releases. This mostly affects half-precision inputGEMMs which might require 16-byte alignment, and array of pointers could only be aligned by 8-byte boundary.
Resolved Issues
cublasLtMatrixTransform can now operate on matrices with dimensions greater than 65535.
Fixed out-of-bound access in GEMM and Matmul functions, when split K or non-default epilogue is used and leading dimension of the output matrix exceeds int32_t limit.
NVBLAS now uses lazy loading of the CPU BLAS library on Linux to avoid issues caused by preloading libnvblas.so in complex applications that use fork and similar APIs.
Resolved symbols name conflict when using cuBlasLt static library with static TensorRT or cuDNN libraries.
CuBLAS: Release 11.4:
Resolved Issues:
Some gemv cases were producing incorrect results if the matrix dimension (n or m) was large, for example 2^20.
CuBLAS: Release 11.3 Update 1:
New Features:
Some new kernels have been added for improved performance but have the limitation that only host pointers are supported for scalars (for example, alpha and beta parameters). This limitation is expected to be resolved in a future release.
New epilogues have been added to support fusion in ML training. These include:
ReLuBias and GeluBias epilogues that produce an auxiliary output which is used on backward propagation to compute the corresponding gradients.
DReLuBGrad and DGeluBGrad epilogues that compute the backpropagation of the corresponding activation function on matrix C, and produce bias gradient as a separate output. These epilogues require auxiliary input mentioned in the bullet above.
Resolved Issues:
Some tensor core accelerated strided batched GEMM routines would result in misaligned memory access exceptions when batch stride wasn't a multiple of 8.
Tensor core accelerated cublasGemmBatchedEx (pointer-array) routines would use slower variants of kernels assuming bad alignment of the pointers in the pointer array. Now it assumes that pointers are well aligned, as noted in the documentation.
Known Issues:
To be able to access the fastest possible kernels through cublasLtMatmulAlgoGetHeuristic() you need to set CUBLASLT_MATMUL_PREF_POINTER_MODE_MASK in search preferences to CUBLASLT_POINTER_MODE_MASK_HOST or CUBLASLT_POINTER_MODE_MASK_NO_FILTERING. By default, heuristics query assumes the pointer mode may change later and only returns algo configurations that support both _HOST and _DEVICE modes. Without this, newly added kernels will be excluded and it will likely lead to a performance penalty on some problem sizes.
Deprecated Features:
Linking with static cublas and cublasLt libraries on Linux now requires using gcc-5.2 and compatible or higher due to C++11 requirements in these libraries.
CuBLAS: Release 11.3:
Known Issues:
The planar complex matrix descriptor for batched matmul has inconsistent interpretation of batch offset.
Mixed precision operations with reduction scheme CUBLASLT_REDUCTION_SCHEME_OUTPUT_TYPE (might be automatically selected based on problem size by cublasSgemmEx() or cublasGemmEx() too, unless CUBLAS_MATH_DISALLOW_REDUCED_PRECISION_REDUCTION math mode bit is set) not only stores intermediate results in output type but also accumulates them internally in the same precision, which may result in lower than expected accuracy. Please use CUBLASLT_MATMUL_PREF_REDUCTION_SCHEME_MASK or CUBLAS_MATH_DISALLOW_REDUCED_PRECISION_REDUCTION if this results in numerical precision issues in your application.
CuBLAS: Release 11.2:
Known Issues:
cublas<s/d/c/z>Gemm() with very large n and m=k=1 may fail on Pascal devices.
CuBLAS: Release 11.1 Update 1:
New Features:
cuBLASLt Logging is officially stable and no longer experimental. cuBLASLt Logging APIs are still experimental and may change in future releases.
Resolved Issues:
cublasLt Matmul fails on Volta architecture GPUs with CUBLAS_STATUS_EXECUTION_FAILED when n dimension > 262,137 and epilogue bias feature is being used. This issue exists in 11.0 and 11.1 releases but has been corrected in 11.1 Update 1
CuBLAS: Release 11.1:
Resolved Issues:
A performance regression in the cublasCgetrfBatched and cublasCgetriBatched routines has been fixed.
The IMMA kernels do not support padding in matrix C and may corrupt the data when matrix C with padding is supplied to cublasLtMatmul. A suggested work around is to supply matrix C with leading dimension equal to 32 times the number of rows when targeting the IMMA kernels: computeType = CUDA_R_32I and CUBLASLT_ORDER_COL32 for matrices A,C,D, and CUBLASLT_ORDER_COL4_4R2_8C (on NVIDIA Ampere GPU architecture or Turing architecture) or CUBLASLT_ORDER_COL32_2R_4R4 (on NVIDIA Ampere GPU architecture) for matrix B. Matmul descriptor must specify CUBLAS_OP_T on matrix B and CUBLAS_OP_N (default) on matrix A and C. The data corruption behavior was fixed so that CUBLAS_STATUS_NOT_SUPPORTED is returned instead.
Fixed an issue that caused an Address out of bounds error when calling cublasSgemm().
A performance regression in the cublasCgetrfBatched and cublasCgetriBatched routines has been fixed.
CuBLAS: Release 11.0 Update 1:
New Features:
The cuBLAS API was extended with a new function, cublasSetWorkspace(), which allows the user to set the cuBLAS library workspace to a user-owned device buffer, which will be used by cuBLAS to execute all subsequent calls to the library on the currently set stream.
cuBLASLt experimental logging mechanism can be enabled in two ways:
By setting the following environment variables before launching the target application:
CUBLASLT_LOG_LEVEL=<level> -- where level is one of the following levels:
"0" - Off - logging is disabled (default)
"1" - Error - only errors will be logged
"2" - Trace - API calls that launch CUDA kernels will log their parameters and important information
"3" - Hints - hints that can potentially improve the application's performance
"4" - Heuristics - heuristics log that may help users to tune their parameters
"5" - API Trace - API calls will log their parameter and important information
CUBLASLT_LOG_MASK=<mask> -- where mask is a combination of the following masks:
"0" - Off
"1" - Error
"2" - Trace
"4" - Hints
"8" - Heuristics
"16" - API Trace
CUBLASLT_LOG_FILE=<value> -- where value is a file name in the format of "<file_name>.%i", %i will be replaced with process id.If CUBLASLT_LOG_FILE is not defined, the log messages are printed to stdout.
By using the runtime API functions defined in the cublasLt header:
typedef void(*cublasLtLoggerCallback_t)(int logLevel, const char* functionName, const char* message) -- A type of callback function pointer.
cublasStatus_t cublasLtLoggerSetCallback(cublasLtLoggerCallback_t callback) -- Allows to set a call back functions that will be called for every message that is logged by the library.
cublasStatus_t cublasLtLoggerSetFile(FILE* file) -- Allows to set the output file for the logger. The file must be open and have write permissions.
cublasStatus_t cublasLtLoggerOpenFile(const char* logFile) -- Allows to give a path in which the logger should create the log file.
cublasStatus_t cublasLtLoggerSetLevel(int level) -- Allows to set the log level to one of the above mentioned levels.
cublasStatus_t cublasLtLoggerSetMask(int mask) -- Allows to set the log mask to a combination of the above mentioned masks.
cublasStatus_t cublasLtLoggerForceDisable() -- Allows to disable to logger for the entire session. Once this API is being called, the logger cannot be reactivated in the current session.
CuBLAS: Release 11.0:
New Features:
cuBLASLt Matrix Multiplication adds support for fused ReLU and bias operations for all floating point types except double precision (FP64).
Improved batched TRSM performance for matrices larger than 256.
CuBLAS: Release 11.0 RC:
New Features:
Many performance improvements have been implemented for NVIDIA Ampere, Volta, and Turing Architecture based GPUs.
The cuBLASLt logging mechanism can be enabled by setting the following environment variables before launching the target application:
CUBLASLT_LOG_LEVEL=<level> - while level is one of the following levels:
"0" - Off - logging is disabled (default)
"1" - Error - only errors will be logged
"2" - Trace - API calls will be logged with their parameters and important information
CUBLASLT_LOG_FILE=<value> - while value is a file name in the format of "<file_name>.%i", %i will be replaced with process id. If CUBLASLT_LOG_FILE is not defined, the log messages are printed to stdout.
For matrix multiplication APIs:
cublasGemmEx, cublasGemmBatchedEx, cublasGemmStridedBatchedEx and cublasLtMatmul added new data type support for __nv_bfloat16 (CUDA_R_16BF).
A new compute type TensorFloat32 (TF32) has been added to provide tensor core acceleration for FP32 matrix multiplication routines with full dynamic range and increased precision compared to BFLOAT16.
New compute modes Default, Pedantic, and Fast have been introduced to offer more control over compute precision used.
Tensor cores are now enabled by default for half-, and mixed-precision- matrix multiplications.
Double precision tensor cores (DMMA) are used automatically.
Tensor cores can now be used for all sizes and data alignments and for all GPU architectures:
Selection of these kernels through cuBLAS heuristics is automatic and will depend on factors such as math mode setting as well as whether it will run faster than the non-tensor core kernels.
Users should note that while these new kernels that use tensor cores for all unaligned cases are expected to perform faster than non-tensor core based kernels but slower than kernels that can be run when all buffers are well aligned.
Deprecated Features:
Algorithm selection in cublasGemmEx APIs (including batched variants) is non-functional for NVIDIA Ampere Architecture GPUs. Regardless of selection it will default to a heuristics selection. Users are encouraged to use the cublasLt APIs for algorithm selection functionality.
The matrix multiply math mode CUBLAS_TENSOR_OP_MATH is being deprecated and will be removed in a future release. Users are encouraged to use the new cublasComputeType_t enumeration to define compute precision.
CuFFT Library:
CuFFT: Release 11.5:
Known Issues:
FFTs of certain sizes in single and double precision (multiples of size 6) could fail on future devices. This issue will be fixed in an upcoming release.
CuFFT: Release 11.4 Update 2:
Resolved Issues:
Since cuFFT 10.3.0 (CUDA Toolkit 11.1), cuFFT may require user to make sure that all operations on input and output buffers are complete before calling cufft[Xt]Exec* if:
sm70 or later, 3D FFT, batch > 1, total size of transform is greater than 4.5MB
FFT size for all dimensions is in the set of the following sizes: {2, 4, 8, 16, 32, 64, 128, 3, 9, 81, 243, 729, 2187, 6561, 5, 25, 125, 625, 3125, 6, 36, 216, 1296, 7776, 7, 49, 343, 2401, 11, 121}
Some V100 FFTs were slower than expected. This issue is resolved.
Known Issues
Some T4 FFTs are slower than expected.
Plans for FFTs of certain sizes in single precision (including some multiples of 1024 sizes, and some large prime numbers) could fail on future devices with less than 64 kB of shared memory. This issue will be fixed in an upcoming release.
CuFFT: Release 11.4 Update 1:
Resolved Issues:
Some cuFFT multi-GPU plans exhibited very long creation times.
cuFFT sometimes produced incorrect results for real-to-complex and complex-to-real transforms when the total number of elements across all batches in a single execution exceeded 2147483647.
Known Issues:
Some V100 FFTs are slower than expected.
Some T4 FFTs are slower than expected.
CuFFT: Release 11.4:
New Features:
Performance improvements.
Known Issues:
Some T4 FFTs are slower than expected.
cuFFT may produce incorrect results for real-to-complex and complex-to-real transforms when the total number of elements across all batches in a single execution exceeds 2147483647.
Some cuFFT multi-GPU plans may exhibit very long creation time. Issue will be fixed in the next update.
cuFFT may produce incorrect results for transforms with strides when the index of the last element across all batches exceeds 2147483647 (see Advanced Data Layout).
Deprecated Features
Support for callback functionality using separately compiled device code is deprecated on all GPU architectures. Callback functionality will continue to be supported for all GPU architectures.
CuFFT: Release 11.3:
New Features:
cuFFT shared libraries are now linked statically against libstdc++ on Linux platforms.
Improved performance of certain sizes (multiples of large powers of 3, powers of 11) in SM86.
Known Issues:
cuFFT planning and plan estimation functions may not restore correct context affecting CUDA driver API applications.
Plans with strides, primes larger than 127 in FFT size decomposition and total size of transform including strides bigger than 32GB produce incorrect results.
CuFFT: Release 11.2 Update 2:
Known Issues:
cuFFT planning and plan estimation functions may not restore correct context affecting CUDA driver API applications.
Plans with strides, primes larger than 127 in FFT size decomposition and total size of transform including strides bigger than 32GB produce incorrect results.
CuFFT: Release 11.2 Update 1:
Resolved Issues:
Previously, reduced performance of power-of-2 single precision FFTs was observed on GPUs with sm_86 architecture. This issue has been resolved.
Large prime factors in size decomposition and real to complex or complex to real FFT type no longer cause cuFFT plan functions to fail.
Known Issues:
cuFFT planning and plan estimation functions may not restore correct context affecting CUDA driver API applications.
Plans with strides, primes larger than 127 in FFT size decomposition and total size of transform including strides bigger than 32GB produce incorrect results.
CuFFT: Release 11.2:
New Features:
Multi-GPU plans can be associated with a stream using the cufftSetStream API function call.
Performance improvements for R2C/C2C/C2R transforms.
Performance improvements for multi-GPU systems.
Resolved Issues:
cuFFT is no longer stuck in a bad state if previous plan creation fails with CUFFT_ALLOC_FAILED.
Previously, single dimensional multi-GPU FFT plans ignored user input on cufftXtSetGPUswhichGPUs argument and assumed that GPUs IDs are always numbered from 0 to N-1. This issue has been resolved.
Plans with primes larger than 127 in FFT size decomposition or FFT size being a prime number bigger than 4093 do not perform calculations on second and subsequent cufftExecute* calls. The regression was introduced in cuFFT 11.1.
Known Issues:
cuFFT planning and plan estimation functions may not restore correct context affecting CUDA driver API applications.
CuFFT: Release 11.1:
New Features:
cuFFT is now L2-cache aware and uses L2 cache for GPUs with more than 4.5MB of L2 cache. Performance may improve in certain single-GPU 3D C2C FFT cases.
After successfully creating a plan, cuFFT now enforces a lock on the cufftHandle. Subsequent calls to any planning function with the same cufftHandle will fail.
Added support for very large sizes (3k cube) to multi-GPU cuFFT on DGX-2.
Improved performance on multi-gpu cuFFT for certain sizes (1k cube).
Resolved Issues
Resolved an issue that caused cuFFT to crash when reusing a handle after clearing a callback.
Fixed an error which produced incorrect results / NaN values when running a real-to-complex FFT in half precision.
Known Issues
cuFFT will always overwrite the input for out-of-place C2R transform.
Single dimensional multi-GPU FFT plans ignore user input on the whichGPUs parameter of cufftXtSetGPUs() and assume that GPUs IDs are always numbered from 0 to N-1.
CuFFT: Release 11.0 RC:
New Features:
cuFFT now accepts __nv_bfloat16 input and output data type for power-of-two sizes with single precision computations within the kernels.
Reoptimized power of 2 FFT kernels on Volta and Turing architectures.
Resolved Issues:
Reduced R2C/C2R plan memory usage to previous levels.
Resolved bug introduced in 10.1 update 1 that caused incorrect results when using custom strides, batched 2D plans and certain sizes on Volta and later.
Known Issues:
cuFFT modifies C2R input buffer for some non-strided FFT plans.
There is a known issue with certain cuFFT plans that causes an assertion in the execution phase of certain plans. This applies to plans with all of the following characteristics: real input to complex output (R2C), in-place, native compatibility mode, certain even transform sizes, and more than one batch.
CuRAND Library:
CuRAND: Release 11.5 Update 1:
New Features:
Improved performance of CURAND_RNG_PSEUDO_MRG32K3A pseudorandom number generator when using ordering CURAND_ORDERING_PSEUDO_BEST or CURAND_ORDERING_PSEUDO_DEFAULT.
Added a new type of order parameter: CURAND_ORDERING_PSEUDO_DYNAMIC.
Supported PRNGs:
CURAND_RNG_PSEUDO_XORWOW
CURAND_RNG_PSEUDO_MRG32K3A
CURAND_RNG_PSEUDO_MTGP32
CURAND_RNG_PSEUDO_PHILOX4_32_10
Improved performance compared to CURAND_ORDERING_PSEUDO_DEFAULT, especially on NVIDIA Ampere architecture GPUs.
The output ordering of generated random numbers for CURAND_ORDERING_PSEUDO_DYNAMIC depends on the number of SMs on a GPU, and thus can be different on different GPUs.
The CURAND_ORDERING_PSEUDO_DYNAMIC ordering can't be used with a host generator created using curandCreateGeneratorHost().
Resolved Issues:
Added information about cuRAND thread safety.
Known Issues:
CURAND_RNG_PSEUDO_XORWOW with ordering CURAND_ORDERING_PSEUDO_DYNAMIC can produce incorrect results on architectures newer than SM86.
CuRAND: Release 11.3:
Resolved Issues:
Fixed inconsistency between random numbers generated by GPU and host generators when CURAND_ORDERING_PSEUDO_LEGACY ordering is selected for certain generator types.
CuRAND: Release 11.0 Update 1:
Resolved Issues:
Fixed an issue that caused linker errors about the multiple definitions of mtgp32dc_params_fast_11213 and mtgpdc_params_11213_num when including curand_mtgp32dc_p_11213.h in different compilation units.
CuRAND: Release 11.0:
Resolved Issues:
Fixed an issue that caused linker errors about the multiple definitions of mtgp32dc_params_fast_11213 and mtgpdc_params_11213_num when including curand_mtgp32dc_p_11213.h in different compilation units.
CuRAND: Release 11.0 RC:
Resolved Issues:
Introduced CURAND_ORDERING_PSEUDO_LEGACY ordering. Starting with CUDA 10.0, the ordering of random numbers returned by MTGP32 and MRG32k3a generators are no longer the same as previous releases despite being guaranteed by the documentation for the CURAND_ORDERING_PSEUDO_DEFAULT setting. The CURAND_ORDERING_PSEUDO_LEGACY provides pre-CUDA 10.0 ordering for MTGP32 and MRG32k3a generators.
Starting with CUDA 11.0 CURAND_ORDERING_PSEUDO_DEFAULT is the same as CURAND_ORDERING_PSEUDO_BEST for all generators except MT19937. Only CURAND_ORDERING_PSEUDO_LEGACY is guaranteed to provide the same for all future cuRAND releases.
CuSOLVER Library:
CuSOLVER: Release 11.4:
New Features:
Introducing cusolverDnXtrtri, a new generic API for triangular matrix inversion (trtri).
Introducing cusolverDnXsytrs, a new generic API for solving systems of linear equations using a given factorized symmetric matrix from SYTRF.
CuSOLVER: Release 11.3:
Known Issues:
For values N<=16, cusolverDn[S|D|C|Z]syevjBatched hits out-of-bound access and may deliver the wrong result. The workaround is to pad the matrix A with a diagonal matrix D such that the dimension of [A 0 ; 0 D] is bigger than 16. The diagonal entry D(j,j) must be bigger than maximum eigenvalue of A, for example, norm(A, ‘fro’). After the syevj, W(0:n-1) contains the eigenvalues and A(0:n-1,0:n-1) contains the eigenvectors.
CuSOLVER: Release 11.2 Update 2:
New Features:
New singular value decomposition (GESVDR) is added. GESVDR computes partial spectrum with random sampling, an order of magnitude faster than GESVD.
libcusolver.so no longer links libcublas_static.a; instead, it depends on libcublas.so. This reduces the binary size of libcusolver.so. However, it breaks backward compatibility. The user has to link libcusolver.so with the correct version of libcublas.so.
CuSOLVER: Release 11.2:
Resolved Issues:
cusolverDnIRSXgels sometimes returned CUSOLVER_STATUS_INTERNAL_ERROR when the precision is ‘z’. This issue has been fixed in CUDA 11.2; now cusolverDnIRSXgels works for all precisions.
ZSYTRF sometimes returned CUSOLVER_STATUS_INTERNAL_ERROR due to insufficient resources to launch the kernel. This issue has been fixed in CUDA 11.2.
GETRF returned early without finishing the whole factorization when the matrix was singular. This issue has been fixed in CUDA 11.2.
CuSOLVER: Release 11.1 Update 1:
Resolved Issues:
cusolverDnDDgels reports IRS_NOT_SUPPORTED when m > n. The issue has been fixed in release 11.1 U1, so cusolverDnDDgels will support m > n.
cusolverMgDeviceSelect can consume over 1GB device memory. The issue has been fixed in release 11.1 U1. The hidden memory allocation inside cusolverMG handle is about 30 MB per device.
Known Issues:
cusolverDnIRSXgels may return CUSOLVER_STATUS_INTERNAL_ERROR. when the precision is ‘z’ due to insufficient workspace which causes illegal memory access.
The cusolverDnIRSXgels_bufferSize() does not report the correct size of workspace. To workaround the issue, the user has to add more workspace than what is reported by cusolverDnIRSXgels_bufferSize(). For example, if x is the size of workspace returned by cusolverDnIRSXgels_bufferSize(), then the user has to allocate (x + min(m,n)*sizeof(cuDoubleComplex)) bytes.
CuSOLVER: Release 11.1:
New Features:
Added new 64-bit APIs:
cusolverDnXpotrf_bufferSize
cusolverDnXpotrf
cusolverDnXpotrs
cusolverDnXgeqrf_bufferSize
cusolverDnXgeqrf
cusolverDnXgetrf_bufferSize
cusolverDnXgetrf
cusolverDnXgetrs
cusolverDnXsyevd_bufferSize
cusolverDnXsyevd
cusolverDnXsyevdx_bufferSize
cusolverDnXsyevdx
cusolverDnXgesvd_bufferSize
cusolverDnXgesvd
Added a new SVD algorithm based on polar decomposition, called GESVDP which uses the new 64-bit API, including cusolverDnXgesvdp_bufferSize and cusolverDnXgesvdp.
Deprecated Features:
The following 64-bit APIs are deprecated:
cusolverDnPotrf_bufferSize
cusolverDnPotrf
cusolverDnPotrs
cusolverDnGeqrf_bufferSize
cusolverDnGeqrf
cusolverDnGetrf_bufferSize
cusolverDnGetrf
cusolverDnGetrs
cusolverDnSyevd_bufferSize
cusolverDnSyevd
cusolverDnSyevdx_bufferSize
cusolverDnSyevdx
cusolverDnGesvd_bufferSize
cusolverDnGesvd
CuSOLVER: Release 11.0:
New Features:
Add 64-bit API of GESVD. The new routine cusolverDnGesvd_bufferSize() fills the missing parameters in 32-bit API cusolverDn[S|D|C|Z]gesvd_bufferSize() such that it can estimate the size of the workspace accurately.
Added the single process multi-GPU Cholesky factorization capabilities POTRF, POTRS and POTRI in cusolverMG library.
Resolved Issues:
Fixed an issue where SYEVD/SYGVD would fail and return error code 7 if the matrix is zero and the dimension is bigger than 25.
CuSPARSE Library:
CuSPARSE: Release 11.6:
New Features:
Better performance for cusparseSpGEMM, cusparseSpGEMMreuse, cusparseCsr2cscEx2, and cusparseDenseToSparse routines.
Resolved Issues:
Fixed forward compatibility issues with axpby, rot, spvv, scatter, gather.
Fixed incorrect results in COO SpMM Alg1 which occurred in some rare cases.
CuSPARSE: Release 11.5 Update 1:
New Features:
New routine cusparseSpMMOp that exploits Just-In-Time Link-Time-Optimization (JIT LTO) for providing sparse matrix-dense matrix multiplication with custom (user-defined) operators. See https://docs.nvidia.com/cuda/cusparse/index.html#cusparse-generic-function-spmm-op.
cuSPARSE now supports logging functionalities. See https://docs.nvidia.com/cuda/cusparse/index.html#cusparse-logging.
Resolved Issues:
Added memory requirements, graph capture, and asynchronous notes for cusparseXcsrsm2_analysis.
CSR, CSC, and COO format descriptions wrongly reported sorted column indices requirement. All routines support unsorted column indices, except where strictly indicated
Clarified cusparseSpSV and cusparseSpSM memory management.
cusparseSpSM produced wrong results in some cases when the matB operation is CUSPARSE_ OPERATION_NON_TRANSPOSE or CUSPARSE_OPERATION_CONJUGATE_TRANSPOSE.
cusparseSpSM produced wrong results in some cases when the matrix layout is row-major.
CuSPARSE: Release 11.4 Update 1:
Resolved Issues:
cusparseSpSV and cusparseSpSM could produce wrong results
cusparseSpSV and cusparseSpSM did not work correctly when vecX == vecY or matB == matC.
CuSPARSE: Release 11.4:
Known Issues:
cusparseSpSV and cusparseSpSM could produce wrong results
cusparseSpSV and cusparseSpSM do not work correctly when vecX == vecY or matB == matC.
CuSPARSE: Release 11.3 Update 1:
New Features:
Introduced a new routine for sparse matrix - sparse matrix multiplication (cusparseSpGEMMreuse) where the output matrix structure is reused for multiple computation. The new routine supports CSR storage format and mixed-precision computation.
Sparse triangular solver adds support for COO format.
Introduced a new routine for sparse triangular solver with multiple right-hand sides cusparseSpSM().
cusparseDenseToSparse() routine adds the conversion from dense matrix (row-major/column-major) to Blocked-ELL format.
Blocke-ELL format now support empty blocks
Better performance for Blocked-ELL SpMM with block size > 64, double data type, and alignments smaller than 128-byte on NVIDIA Ampere sm80.
All cuSPARSE APIs are now asynchronous on platforms that support stream ordered memory allocators https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#stream-ordered-querying-memory-support.
Improved NTVX trace with distinction between light calls and kernel routines
Resolved Issues
cusparseCnnz_compress produced wrong results when the number of rows are greater than 128 * resident CTAs.
cusparseSnnz produced wrong results for some particular sparsity pattern.
Deprecated Features
cusparseXcsrsm2_zeroPivot, cusparseXcsrsm2_solve, cusparseXcsrsm2_analysis, and cusparseScsrsm2_bufferSizeExt have been deprecated in favor of cusparseSpSM Generic APIs
CuSPARSE: Release 11.3:
New Features:
Added new routine cusparesSpSV for sparse triangular solver with better performance. The new Generic API supports:
CSR storage format
Non-transpose, transpose, and transpose-conjugate operations
Upper, lower fill mode
Unit, non-unit diagonal type
32-bit and 64-bit indices
Uniform data type computation
Deprecated Features
cusparseScsrsv2_analysis, cusparseScsrsv2_solve, cusparseXcsrsv2_zeroPivot, and cusparseScsrsv2_bufferSize have been deprecated in favor of cusparseSpSV.
CuSPARSE: Release 11.2 Update 2:
Resolved Issues:
cusparseDestroy(NULL) no longer crashes on Windows.
Known Issues:
cusparseDestroySpVec, cusparseDestroyDnVec, cusparseDestroySpMat, cusparseDestroyDnMat, cusparseDestroy with NULL argument could cause segmentation fault on Windows.
CuSPARSE: Release 11.2 Update 1:
New Features:
New Tensor Core-accelerated Block Sparse Matrix - Matrix Multiplication (cusparseSpMM) and introduction of the Blocked-Ellpack storage format.
New algorithms for CSR/COO Sparse Matrix - Vector Multiplication (cusparseSpMV) with better performance.
Extended functionalities for cusparseSpMV:
Support for the CSC format.
Support for regular/complex bfloat16 data types for both uniform and mixed-precision computation.
Support for mixed regular-complex data type computation.
Support for deterministic and non-deterministic computation.
New algorithm (CUSPARSE_SPMM_CSR_ALG3) for Sparse Matrix - Matrix Multiplication (cusparseSpMM) with better performance especially for small matrices.
New routine for Sampled Dense Matrix - Dense Matrix Multiplication (cusparseSDDMM) which deprecated cusparseConstrainedGeMM and provides better performance.
Better accuracy of cusparseAxpby, cusparseRot, cusparseSpVV for bfloat16 and half regular/complex data types.
All routines support NVTX annotation for enhancing the profiler time line on complex applications.
Resolved Issues:
cusparseAxpby, cusparseGather, cusparseScatter, cusparseRot, cusparseSpVV, cusparseSpMV now support zero-size matrices.
cusparseCsr2cscEx2 now correctly handles empty matrices (nnz = 0).
cusparseXcsr2csr_compress now uses 2-norm for the comparison of complex values instead of only the real part.
Known Issues:
cusparseDestroySpVec, cusparseDestroyDnVec, cusparseDestroySpMat, cusparseDestroyDnMat, cusparseDestroy with NULL argument could cause segmentation fault on Windows.
Deprecated Features:
cusparseConstrainedGeMM has been deprecated in favor of cusparseSDDMM.
cusparseCsrmvEx has been deprecated in favor of cusparseSpMV.
COO Array of Structure (CooAoS) format has been deprecated including cusparseCreateCooAoS, cusparseCooAoSGet, and its support for cusparseSpMV.
CuSPARSE: Release 11.2:
Known Issues:
cusparseXdense2csr provides incorrect results for some matrix sizes.
CuSPARSE: Release 11.1 Update 1:
New Features:
cusparseSparseToDense
CSR, CSC, or COO conversion to dense representation
Support row-major and column-major layouts
Support all data types
Support 32-bit and 64-bit indices
Provide performance 3x higher than cusparseXcsc2dense, cusparseXcsr2dense
cusparseDenseToSparse
Dense representation to CSR, CSC, or COO
Support row-major and column-major layouts
Support all data types
Support 32-bit and 64-bit indices
Provide performance 3x higher than cusparseXcsc2dense, cusparseXcsr2dense
Known Issues:
cusparseXdense2csr provides incorrect results for some matrix sizes.
Deprecated Features:
Legacy conversion routines: cusparseXcsc2dense, cusparseXcsr2dense, cusparseXdense2csc, cusparseXdense2csr
CuSPARSE: Release 11.0:
New Features:
Added new Generic APIs for Axpby (cusparseAxpby), Scatter (cusparseScatter), Gather (cusparseGather), Givens rotation (cusparseRot). __nv_bfloat16/ __nv_bfloat162 data types and 64-bit indices are also supported.
This release adds the following features for cusparseSpMM:
Support for row-major layout for cusparseSpMM for both CSR and COO format
Support for 64-bit indices
Support for __nv_bfloat16 and __nv_bfloat162 data types
Support for the following strided batch mode:
Ci=A⋅Bi
Ci=Ai⋅B
Ci=Ai⋅Bi
CuSPARSE: Release 11.0 RC:
New Features:
Added new Generic APIs for Axpby (cusparseAxpby), Scatter (cusparseScatter), Gather (cusparseGather), Givens rotation (cusparseRot). __nv_bfloat16/ __nv_bfloat162 data types and 64-bit indices are also supported.
This release adds the following features for cusparseSpMM:
Support for row-major layout for cusparseSpMM for both CSR and COO format
Support for 64-bit indices
Support for __nv_bfloat16 and __nv_bfloat162 data types
Support for the following strided batch mode:
Ci=A⋅Bi
Ci=Ai⋅B
Ci=Ai⋅Bi
Added new generic APIs and improved performance for sparse matrix-sparse matrix multiplication (SpGEMM): cusparseSpGEMM_workEstimation, cusparseSpGEMM_compute, and cusparseSpGEMM_copy.
SpVV: added support for __nv_bfloat16.
Deprecated Features:
The following functions have been removed:
cusparse<t>gemmi()
cusparseXaxpyi, cusparseXgthr, cusparseXgthrz, cusparseXroti, cusparseXsctr
Hybrid format enums and helper functions: cusparseHybPartition_t, cusparseHybPartition_t, cusparseCreateHybMat, cusparseDestroyHybMat
Triangular solver enums and helper functions: cusparseSolveAnalysisInfo_t, cusparseCreateSolveAnalysisInfo, cusparseDestroySolveAnalysisInfo
Sparse dot product: cusparseXdoti, cusparseXdotci
Sparse matrix-vector multiplication: cusparseXcsrmv, cusparseXcsrmv_mp
Sparse matrix-matrix multiplication: cusparseXcsrmm, cusparseXcsrmm2
Sparse triangular-single vector solver: cusparseXcsrsv_analysis, cusparseCsrsv_analysisEx, cusparseXcsrsv_solve, cusparseCsrsv_solveEx
Sparse triangular-multiple vectors solver: cusparseXcsrsm_analysis, cusparseXcsrsm_solve
Sparse hybrid format solver: cusparseXhybsv_analysis, cusparseShybsv_solve
Extra functions: cusparseXcsrgeamNnz, cusparseScsrgeam, cusparseXcsrgemmNnz, cusparseXcsrgemm
Incomplete Cholesky Factorization, level 0: cusparseXcsric0
Incomplete LU Factorization, level 0: cusparseXcsrilu0, cusparseCsrilu0Ex
Tridiagonal Solver: cusparseXgtsv, cusparseXgtsv_nopivot
Batched Tridiagonal Solver: cusparseXgtsvStridedBatch
Reordering: cusparseXcsc2hyb, cusparseXcsr2hyb, cusparseXdense2hyb, cusparseXhyb2csc, cusparseXhyb2csr, cusparseXhyb2dense
The following functions have been deprecated:
SpGEMM: cusparseXcsrgemm2_bufferSizeExt, cusparseXcsrgemm2Nnz, cusparseXcsrgemm2
Math Library:
CUDA Math: Release 11.6:
New Features:
New half and bfloat16 APIs for addition/multiplication in round-to-nearest-even mode that do not get contracted into an fma instruction. Please see __hadd_rn, __hsub_rn, __hmul_rn, __hadd2_rn, __hsub2_rn, and __hmul2_rn in https://docs.nvidia.com/cuda/cuda-math-api/index.html.
CUDA Math: Release 11.5:
Deprecations:
The following undocumented CUDA Math APIs are deprecated and will be removed in a future release. Please consider switching to similar intrinsic APIs documented here: https://docs.nvidia.com/cuda/cuda-math-api/index.html
__device__ int mulhi(const int a, const int b)
__device__ unsigned int mulhi(const unsigned int a, const unsigned int b)
__device__ unsigned int mulhi(const int a, const unsigned int b)
__device__ unsigned int mulhi(const unsigned int a, const int b)
__device__ long long int mul64hi(const long long int a, const long long int b)
__device__ unsigned long long int mul64hi(const unsigned long long int a, const unsigned long long int b)
__device__ unsigned long long int mul64hi(const long long int a, const unsigned long long int b)
__device__ unsigned long long int mul64hi(const unsigned long long int a, const long long int b)
__device__ int float_as_int(const float a)
__device__ float int_as_float(const int a)
__device__ unsigned int float_as_uint(const float a)
__device__ float uint_as_float(const unsigned int a)
__device__ float saturate(const float a)
__device__ int mul24(const int a, const int b)
__device__ unsigned int umul24(const unsigned int a, const unsigned int b)
__device__ int float2int(const float a, const enum cudaRoundMode mode = cudaRoundZero)
__device__ unsigned int float2uint(const float a, const enum cudaRoundMode mode = cudaRoundZero)
__device__ float int2float(const int a, const enum cudaRoundMode mode = cudaRoundNearest)
__device__ float uint2float(const unsigned int a, const enum cudaRoundMode mode = cudaRoundNearest)
CUDA Math: Release 11.4:
Beginning in 2022, the NVIDIA Math Libraries official hardware support will follow an N-2 policy, where N is an x100 series GPU.
2.6.4. CUDA Math: Release 11.3
Resolved Issues:
Previous releases of CUDA were potentially delivering incorrect results in some Linux distributions for the following host Math APIs: sinpi, cospi, sincospi, sinpif, cospif, sincospif. If passed huge inputs like 7.3748776e+15 or 8258177.5 the results were not equal to 0 or 1. These have been corrected with this release.
CUDA Math: Release 11.1:
New Features:
Added host support for half and nv_bfloat16 converts to/from integer types.
Added __hcmadd() device only API for fast half2 and nv_bfloat162 based complex multiply-accumulate.
CUDA Math: Release 11.0 Update 1:
Resolved Issues:
nv_bfloat16 comparison functions could trigger a fault with misaligned addresses.
Performance improvements in half and nv_bfloat16 basic arithmetic implementations.
CUDA Math: Release 11.0 RC:
New Features:
Add arithmetic support for __nv_bfloat16 floating-point data type with 8 bits of exponent, 7 explicit bits of mantissa.
Performance and accuracy improvements in single precision math functions: fmodf, expf, exp10f, sinhf, and coshf.
Resolved Issues:
Corrected documented maximum ulp error thresholds in erfcinvf and powf.
Improved cuda_fp16.h interoperability with Visual Studio C++ compiler.
Updated libdevice user guide and CUDA math API definitions for j1, j1f, fmod, fmodf, ilogb, and ilogbf math functions.
NVIDIA Performance Primitives (NPP):
NPP: Release 11.5:
New Features:
New APIs added to compute Signed Anti-aliased Distance Transform using PBA, the anti-aliased Euclidean distance between pixel sites in images. This will improve the accuracy of distance transform.
nppiSignedDistanceTransformAbsPBA_xxxxx_C1R_Ctx() – Input and output combination supports (xxxxxx) - 32f, 32f64f, 64f
New API for Absolute Manhattan distance transform; another method to improve the accuracy of distance transform using Manhattan distance transform between pixels.
nppiDistanceTransformAbsPBA_xxxxx_C1R_Ctx() – Input and output combination supports (xxxxxx) - 8u16u, 8s16u, 16u16u, 16s16u, 8u32f, 8s32f, 16u32f, 16s32f, 8u64f, 8s64f, 16u64f, 16s64f, 32f64f, 64f
Resolved Issues
Fixed an issue in FilterMedian() API with add interpolation when mask even size.
Improved Contour function performance by parallelizing more of it and also improving quality.
Resolved an issue with Alpha composition used to accumulate output buffers multiple times.
Resolved an issue with nppiLabelMarkersUF_8u32u_C1R column processing incorrect results.
NPP: Release 11.4:
New Features:
New API FindContours .FindContours can be explained simply as a curve joining all the continuous points (along the boundary), having the same color or intensity. The contours are a useful tool for shape analysis and object detection and recognition.
NPP: Release 11.3:
New Features
Added nppiDistanceTransformPBA functions.
NPP: Release 11.2 Update 2:
New Features:
Added nppiDistanceTransformPBA functions.
NPP: Release 11.2 Update 1:
New Features:
New APIs added to compute Distance Transform using Parallel Banding Algorithm (PBA):
nppiDistanceTransformPBA_xxxxx_C1R_Ctx() – where xxxxx specifies the input and output combination: 8u16u, 8s16u, 16u16u, 16s16u, 8u32f, 8s32f, 16u32f, 16s32f
nppiSignedDistanceTransformPBA_32f_C1R_Ctx()
Resolved Issues:
Fixed the issue in which Label Markers adds zero pixel as object region.
NPP: Release 11.0:
New Features:
Batched Image Label Markers Compression that removes sparseness between marker label IDs output from LabelMarkers call.
Image Flood Fill functionality fills a connected region of an image with a specified new value.
Stability and performance fixes to Image Label Markers and Image Label Markers Compression.
NPP: Release 11.0 RC:
New Features:
Batched Image Label Markrs Compression that removes sparseness between marker label IDs output from LabelMarkers call.
Image Flood Fill functionality fills a connected region of an image with a specified new value.
Added batching support for nppiLabelMarkersUF functions.
Added the nppiCompressMarkerLabelsUF_32u_C1IR function.
Added nppiSegmentWatershed functions.
Added sample apps on GitHub demonstrating the use of NPP application managed stream contexts along with watershed segmentation and batched and compressed UF image label markers functions.
Added support for non-blocking streams.
Resolved Issues
Stability and performance fixes to Image Label Markers and Image Label Markers Compression.e
Improved quality of nppiLabelMarkersUF functions.
nppiCompressMarkerLabelsUF_32u_C1IR can now handle a huge number of labels generated by the nppiLabelMarkersUF function.
Known Issues:
The nppiCopy API is limited by CUDA thread for large image size. Maximum image limits is a minimum of 16 * 65,535 = 1,048,560 horizontal pixels of any data type and number of channels and 8 * 65,535 = 524,280 vertical pixels for a maximum total of 549,739,036,800 pixels.
NvJPEG Library:
NvJPEG: Release 11.5 Update 1:
Resolved Issues:
Fixed the issue in which nvcuvid() released uncompressed frames causing a memory leak.
NvJPEG: Release 11.4:
Resolved Issues:
Additional subsampling added to solve the NVJPEG_CSS_2x4.
NvJPEG: Release 11.2 Update 1:
New Features:
nvJPEG decoder added new APIs to support region of interest (ROI) based decoding for batched hardware decoder:
nvjpegDecodeBatchedEx()
nvjpegDecodeBatchedSupportedEx()
NvJPEG: Release 11.1 Update 1:
New Features:
Added error handling capabilities for nonstandard JPEG images.
NvJPEG: Release 11.0 Update 1:
Known Issues:
NVJPEG_BACKEND_GPU_HYBRID has an issue when handling bit-streams which have corruption in the scan.
NvJPEG: Release 11.0:
New Features:
nvJPEG allows the user to allocate separate memory pools for each chroma subsampling format. This helps avoid memory re-allocation overhead. This can be controlled by passing the newly added flag NVJPEG_FLAGS_ENABLE_MEMORY_POOLS to the nvjpegCreateEx API.
nvJPEG encoder now allow compressed bitstream on the GPU Memory.
NvJPEG: Release 11.0 RC:
New Features:
nvJPEG allows the user to allocate separate memory pools for each chroma subsampling format. This helps avoid memory re-allocation overhead. This can be controlled by passing the newly added flag NVJPEG_FLAGS_ENABLE_MEMORY_POOLS to the nvjpegCreateEx API.
nvJPEG encoder now allow compressed bitstream on the GPU Memory.
Hardware accelerated decode is now supported on NVIDIA A100.
The nvJPEG decode API (nvjpegDecodeJpeg()) now has the flexibility to select the backend when creating nvjpegJpegDecoder_t object. The user has the option to call this API instead of making three separate calls to nvjpegDecodeJpegHost(), nvjpegDecodeJpegTransferToDevice(), and nvjpegDecodeJpegDevice().
Known Issues:
NVJPEG_BACKEND_GPU_HYBRID has an issue when handling bit-streams which have corruption in the scan.
Deprecated Features:
The following multiphase APIs have been removed:
nvjpegStatus_t NVJPEGAPI nvjpegDecodePhaseOne
nvjpegStatus_t NVJPEGAPI nvjpegDecodePhaseTwo
nvjpegStatus_t NVJPEGAPI nvjpegDecodePhaseThree
nvjpegStatus_t NVJPEGAPI nvjpegDecodeBatchedPhaseOne
nvjpegStatus_t NVJPEGAPI nvjpegDecodeBatchedPhaseTwo

New in NVIDIA CUDA Toolkit 11.5.0 (Oct 21, 2021)

New in NVIDIA CUDA Toolkit 11.3.0 (Apr 16, 2021)

CUDA Toolkit Major Component Versions:
CUDA Components:
Starting with CUDA 11, the various components in the toolkit are versioned independently.
CUDA Driver:
Running a CUDA application requires the system with at least one CUDA capable GPU and a driver that is compatible with the CUDA Toolkit. See Table 2. For more information various GPU products that are CUDA capable, visit https://developer.nvidia.com/cuda-gpus.
Each release of the CUDA Toolkit requires a minimum version of the CUDA driver. The CUDA driver is backward compatible, meaning that applications compiled against a particular version of the CUDA will continue to work on subsequent (later) driver releases.
More information on compatibility can be found at https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#cuda-runtime-and-driver-api-version.
General CUDA:
Stream ordered memory allocator enhancements. See https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#stream-ordered-memory-allocator for more information.
CUDA Graph Enhancements:
Enhancements to make stream capture more flexible: Functionality to provide read-write access to the graph and the dependency information of a capturing stream, while the capture is in progress. See cudaStreamGetCaptureInfo_v2() and cudaStreamUpdateCaptureDependencies().
User object lifetime assistance: Functionality to assist user code in lifetime management for user-allocated resources referenced in graphs. Useful when graphs and their derivatives and asynchronous executions have an unknown/unbounded lifetime not under control of the code that created the resource, such as libraries under stream capture. See cudaUserObjectCreate() and cudaGraphRetainUserObject(), or http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-user-objects.
Graph Debug: New API to produce a DOT graph output from a given CUDA Graph.
New Stream Priorities:
The CUDA Driver API cuCtxGetStreamPriorityRange() now exposes a total of 6 stream priorities, up from the 3 exposed in prior releases.
Expose driver symbols in runtime API
New CUDA Driver API cuGetProcAddress() and CUDA Runtime API cudaDriverGetEntryPoint() to query the memory addresses for CUDA Driver API functions.
Support for virtual aliasing across kernel boundaries.
Added support for Ubuntu 20.04.2 on x86_64 and Arm sbsa platforms.
CUDA Tools:
CUDA Compilers:
Cu++flt demangler tool
NVRTC versioning changes
Preview support for alloca().
Nsight Eclipse Plugin:
Eclipse versions 4.10 to 4.14 are currently supported in CUDA 11.3.
CUDA Libraries:
cuFFT Library:
cuFFT shared libraries are now linked statically against libstdc++ on Linux platforms.
Improved performance of certain sizes (multiples of large powers of 3, powers of 11) in SM86.
cuSPARSE Library:
Added new routine cusparesSpSV for sparse triangular solver with better performance. The new Generic API supports:
CSR storage format
Non-transpose, transpose, and transpose-conjugate operations
Upper, lower fill mode
Unit, non-unit diagonal type
32-bit and 64-bit indices
Uniform data type computation
NVIDIA Performance Primitives (NPP):
Added nppiDistanceTransformPBA functions.
Deprecated Features:
The following features are deprecated in the current release of the CUDA software. The features still work in the current release, but their documentation may have been removed, and they will become officially unsupported in a future release. We recommend that developers employ alternative solutions to these features in their software.
CUDA Libraries:
cuSPARSE: cusparseScsrsv2_analysis, cusparseScsrsv2_solve, cusparseXcsrsv2_zeroPivot, and cusparseScsrsv2_bufferSize have been deprecated in favor of cusparseSpSV.
Tools:
Nsight Eclipse Plugin: Docker support is deprecated in Eclipse 4.14 and earlier versions as of CUDA 11.3, and Docker support will be dropped for Eclipse 4.14 and earlier in a future CUDA Toolkit release.
Resolved Issues:
General CUDA:
Historically, the CUDA driver has serialized most APIs operating on the same CUDA context between CPU threads. In CUDA 11.3, this has been relaxed for kernel launches such that the driver serialization may be reduced when multiple CPU threads are launching CUDA kernels into distinct streams within the same context.
cuRAND Library:
Fixed inconsistency between random numbers generated by GPU and host generators when CURAND_ORDERING_PSEUDO_LEGACY ordering is selected for certain generator types.
CUDA Math API:
Previous releases of CUDA were potentially delivering incorrect results in some Linux distributions for the following host Math APIs: sinpi, cospi, sincospi, sinpif, cospif, sincospif. If passed huge inputs like 7.3748776e+15 or 8258177.5 the results were not equal to 0 or 1. These have been corrected with this release.
Known Issues:
cuBLAS Library:
The planar complex matrix descriptor for batched matmul has inconsistent interpretation of batch offset.
Mixed precision operations with reduction scheme CUBLASLT_REDUCTION_SCHEME_OUTPUT_TYPE (might be automatically selected based on problem size by cublasSgemmEx() or cublasGemmEx() too, unless CUBLAS_MATH_DISALLOW_REDUCED_PRECISION_REDUCTION math mode bit is set) not only stores intermediate results in output type but also accumulates them internally in the same precision, which may result in lower than expected accuracy. Please use CUBLASLT_MATMUL_PREF_REDUCTION_SCHEME_MASK or CUBLAS_MATH_DISALLOW_REDUCED_PRECISION_REDUCTION if this results in numerical precision issues in your application.
cuFFT Library:
cuFFT planning and plan estimation functions may not restore correct context affecting CUDA driver API applications.
Plans with strides, primes larger than 127 in FFT size decomposition and total size of transform including strides bigger than 32GB produce incorrect results.
cuSOLVER Library:
For values N<=16, cusolverDn[S|D|C|Z]syevjBatched hits out-of-bound access and may deliver the wrong result. The workaround is to pad the matrix A with a diagonal matrix D such that the dimension of [A 0 ; 0 D] is bigger than 16. The diagonal entry D(j,j) must be bigger than maximum eigenvalue of A, for example, norm(A, ‘fro’). After the syevj, W(0:n-1) contains the eigenvalues and A(0:n-1,0:n-1) contains the eigenvectors.

New in NVIDIA CUDA Toolkit 11.2.2 (Mar 11, 2021)

New in NVIDIA CUDA Toolkit 11.2.1 (Feb 10, 2021)

CUDA Compiler:
Resolved Issues:
Previously, when using recent versions of VS 2019 host compiler, a call to pow(double, int) or pow(float, int) in host or device code sometimes caused build failures. This issue has been resolved.
CuSOLVER:
New Features:
New singular value decomposition (GESVDR) is added. GESVDR computes partial spectrum with random sampling, an order of magnitude faster than GESVD.
libcusolver.so no longer links libcublas_static.a; instead, it depends on libcublas.so. This reduces the binary size of libcusolver.so. However, it breaks backward compatibility. The user has to link libcusolver.so with the correct version of libcublas.so.
CuSPARSE:
New Features:
New Tensor Core-accelerated Block Sparse Matrix - Matrix Multiplication (cusparseSpMM) and introduction of the Blocked-Ellpack storage format.
New algorithms for CSR/COO Sparse Matrix - Vector Multiplication (cusparseSpMV) with better performance.
New algorithm (CUSPARSE_SPMM_CSR_ALG3) for Sparse Matrix - Matrix Multiplication (cusparseSpMM) with better performance especially for small matrices.
New routine for Sampled Dense Matrix - Dense Matrix Multiplication (cusparseSDDMM) which deprecated cusparseConstrainedGeMM and provides better performance.
Better accuracy of cusparseAxpby, cusparseRot, cusparseSpVV for bfloat16 and half regular/complex data types.
All routines support NVTX annotation for enhancing the profiler time line on complex applications.
Deprecations:
cusparseConstrainedGeMM has been deprecated in favor of cusparseSDDMM.
cusparseCsrmvEx has been deprecated in favor of cusparseSpMV.
COO Array of Structure (CooAoS) format has been deprecated including cusparseCreateCooAoS, cusparseCooAoSGet, and its support for cusparseSpMV.
Known Issues:
cusparseDestroySpVec, cusparseDestroyDnVec, cusparseDestroySpMat, cusparseDestroyDnMat, cusparseDestroy with NULL argument could cause segmentation fault on Windows.
Resolved Issues:
cusparseAxpby, cusparseGather, cusparseScatter, cusparseRot, cusparseSpVV, cusparseSpMV now support zero-size matrices.
cusparseCsr2cscEx2 now correctly handles empty matrices (nnz = 0).
cusparseXcsr2csr_compress now uses 2-norm for the comparison of complex values instead of only the real part.
Extended functionalities for cusparseSpMV:
Support for the CSC format.
Support for regular/complex bfloat16 data types for both uniform and mixed-precision computation.
Support for mixed regular-complex data type computation.
Support for deterministic and non-deterministic computation.
NPP:
New features:
New APIs added to compute Distance Transform using Parallel Banding Algorithm (PBA) - nppiDistanceTransformPBA_xxxxx_C1R_Ctx() – where xxxxx specifies the input and output combination 8u16u, 8s16u, 16u16u, 16s16u, 8u32f, 8s32f, 16u32f, 16s32f) and nppiSignedDistanceTransformPBA_32f_C1R_Ctx()
Resolved issues:
Fixed the issue in which Label Markers adds zero pixel as object region.
NVJPEG:
New Features:
nvJPEG decoder added a new API to support region of interest (ROI) based decoding for batched hardware decoder: nvjpegDecodeBatchedEx() and nvjpegDecodeBatchedSupportedEx()
CuFFT:
Known Issues:
cuFFT planning and plan estimation functions may not restore correct context affecting CUDA driver API applications.
Plans with strides, primes larger than 127 in FFT size decomposition and total size of transform including strides bigger than 32GB produce incorrect results.
Resolved Issues:
Previously, reduced performance of power-of-2 single precision FFTs was observed on GPUs with sm_86 architecture. This issue has been resolved.
Large prime factors in size decomposition and real to complex or complex to real FFT type no longer cause cuFFT plan functions to fail.
CUPTI:
Deprecations early notice:
The following functions are scheduled to be deprecated in 11.3 and will be removed in a future release:
NVPW_MetricsContext_RunScript and NVPW_MetricsContext_ExecScript_Begin from the header nvperf_host.h.
cuptiDeviceGetTimestamp from the header cupti_events.h

New in NVIDIA CUDA Toolkit 10.0.130 (Sep 20, 2018)

CUDA TOOLKIT MAJOR COMPONENTS:
Compiler:
The CUDA-C and CUDA-C++ compiler, nvcc, is found in the bin/ directory. It is built on top of the NVVM optimizer, which is itself built on top of the LLVM compiler infrastructure. Developers who want to target NVVM directly can do so using the Compiler SDK, which is available in the nvvm/ directory.
Please note that the following files are compiler-internal and subject to change without any prior notice:
Any file in include/crt and bin/crt
include/common_functions.h, include/device_double_functions.h, include/device_functions.h, include/host_config.h, include/host_defines.h, and include/math_functions.h
nvvm/bin/cicc
bin/cudafe++, bin/bin2c, and bin/fatbinary
Tools:
The following development tools are available in the bin/ directory (except for Nsight Visual Studio Edition (VSE) which is installed as a plug-in to Microsoft Visual Studio).
IDEs: nsight (Linux, Mac), Nsight VSE (Windows)
Debuggers: cuda-memcheck, cuda-gdb (Linux), Nsight VSE (Windows)
Profilers: nvprof, nvvp, Nsight VSE (Windows)
Utilities: cuobjdump, nvdisasm, gpu-library-advisor
Libraries:
The scientific and utility libraries listed below are available in the lib/ directory (DLLs on Windows are in bin/), and their interfaces are available in the include/ directory:
cublas (BLAS)
cublas_device (BLAS Kernel Interface)
cuda_occupancy (Kernel Occupancy Calculation [header file implementation])
cudadevrt (CUDA Device Runtime)
cudart (CUDA Runtime)
cufft (Fast Fourier Transform [FFT])
cupti (Profiling Tools Interface)
curand (Random Number Generation)
cusolver (Dense and Sparse Direct Linear Solvers and Eigen Solvers)
cusparse (Sparse Matrix)
npp (NVIDIA Performance Primitives [image and signal processing])
nvblas ("Drop-in" BLAS)
nvcuvid (CUDA Video Decoder [Windows, Linux])
nvgraph (CUDA nvGRAPH [accelerated graph analytics])
nvml (NVIDIA Management Library)
nvrtc (CUDA Runtime Compilation)
nvtx (NVIDIA Tools Extension)
thrust (Parallel Algorithm Library [header file implementation])
CUDA Samples:
Code samples that illustrate how to use various CUDA and library APIs are available in the samples/ directory on Linux and Mac, and are installed to C:ProgramDataNVIDIA CorporationCUDA Samples on Windows. On Linux and Mac, the samples/ directory is read-only and the samples must be copied to another location if they are to be modified. Further instructions can be found in the Getting Started Guides for Linux and Mac.
Documentation:
The most current version of these release notes can be found online at http://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html. Also, the version.txt file in the root directory of the toolkit will contain the version and build number of the installed toolkit.
Documentation can be found in PDF form in the doc/pdf/ directory, or in HTML form at doc/html/index.html and online at http://docs.nvidia.com/cuda/index.html.
CUDA Driver:
Running a CUDA application requires the system with at least one CUDA capable GPU and a driver that is compatible with the CUDA Toolkit. For more information various GPU products that are CUDA capable, visit https://developer.nvidia.com/cuda-gpus. Each release of the CUDA Toolkit requires a minimum version of the CUDA driver. The CUDA driver is backward compatible, meaning that applications compiled against a particular version of the CUDA will continue to work on subsequent (later) driver releases. More information on compatibility can be found at https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#cuda-runtime-and-driver-api-version.
CUDA-GDB Sources:
For CUDA Toolkit 7.0 and newer, in the installation directory extras/. The directory is created by default during the toolkit installation unless the .rpm or .deb package installer is used. In this case, the cuda-gdb-src package must be manually installed.
For CUDA Toolkit 6.5, 6.0, and 5.5, at https://github.com/NVIDIA/cuda-gdb.
For CUDA Toolkit 5.0 and earlier, at ftp://download.nvidia.com/CUDAOpen64/.
Upon request by sending an e-mail to mailto:[email protected].
RELEASE NOTES:
General CUDA:
CUDA 10.0 adds support for the Turing architecture (compute_75 and sm_75).
CUDA 10.0 adds support for new programming constructs called CUDA Graphs, a new asynchronous task-graph programming model that enables more efficient launch and execution. See the API documentation for more information.
Warp matrix functions now support additional matrix shapes 32x8x16 and 8x32x16. Warp matrix functions also include the ability (experimental in CUDA 10.0) to perform sub-byte operations (4-bit unsigned, 4-bit signed and 1-bit) using the Tensor Cores.
Added support for CUDA-Vulkan and CUDA-DX12 interoperability APIs.
Added support for a new instruction nanosleep that suspends a thread for a specified duration.
Added 6.3 version of the Parallel Thread Execution instruction set architecture (ISA). For more details on new (sm_75 target, wmma, nanosleep, FP16 atomics) and deprecated instructions, see this section in the PTX documentation.
Starting with CUDA 10.0, the CUDA runtime is compatible with specific older NVIDIA drivers. A new package called “cuda-compat-<version>” is included in the toolkit installer packages. For more information on compatibility, see the section in the Best Practices Guide.
The following new operating systems are supported by CUDA. See the System Requirements section in the NVIDIA CUDA Installation Guide for Linux for a full list of supported operating systems:
Ubuntu 18.04 LTS*
Ubuntu 14.04 LTS
SUSE SLES 15
OpenSUSE Leap 15
*Ubuntu 18.04 LTS support on POWER 9 is in technology preview for CUDA 10.0 and may have issues on POWER 9 systems. This configuration will be fully supported in a later release of the CUDA driver in Q4 2018.
Added support for peer-to-peer (P2P) with CUDA on Windows (WDDM 2.0+ only).
Added a new CUDA sample to demonstrate multi-device cooperative group APIs.
CUDA samples are now also available on GitHub: https://github.com/NVIDIA/cuda-samples.
Added APIs to retrieve the LUID of CUDA devices (cuDeviceGetLuid).
Added cudaLimitMaxL2FetchGranularity in the device management APIs (cudaDeviceGetLimit) to set the maximum fetch granularity of L2 (in Bytes).
The cudaDeviceProp struct now includes the device UUID.
Added support for synchronization across multiple devices with Cooperative Groups (cuLaunchCooperativeKernelMultiDevice) on Windows in TCC mode.
Added the ability to lock clocks in nvidia-smi and NVML (nvmlDeviceSetGpuLockedClocks and nvmlDeviceResetGpuLockedClocks APIs). The following commands can be used in nvidia-smi:
$ nvidia-smi -lgc/--lock-gpu-clock <minGpuClock, maxGpuClock>
$ nvidia-smi -rgc/--reset-gpu-clock
CUDA Tools. CUDA Compilers:
The following compilers are supported as host compilers in nvcc:
Clang 6.0
Microsoft Visual Studio 2017* (RTW, Update 8 and later)
Xcode 9.4
XLC 16.1.x
ICC 18
PGI 18.x (with -std=c++14 mode)
*Starting with CUDA 10.0, nvcc supports all versions of Visual Studio 2017 (past and upcoming updates)
A new libNVVM API function, nvvmLazyAddModuleToProgram, is introduced. This function should be used for adding the libdevice module (and other similar modules) to a program to make it more efficient. Refer to the libNVVM specification for more information about this function and its use.
nvcc has added the --extensible-whole-program (or -ewp) option. This can be used to do whole-program optimizations (as are done in default compiles), but allows calls to certain external functions, particularly the functions in libcudadevrt. With this option one could use cuda-device-parallelism features without using separate compilation. However, for this option a link must happen. This link happens automatically when nvcc is used to create the executable, but if you are using the host linker directly, then you will need to perform an explicit nvcc -dlink step.
Warp matrix functions (wmma) were first introduced in PTX ISA version 6.0 as a preview feature, and are now fully supported retroactively from PTX ISA version 6.0 onwards.
CUDA Tools. CUDA Developer Tools:
CUDA 10.0 now includes Nsight Compute, a suite of new developer tools for profiling and debugging supported on Windows, Linux and Mac.
For new features (support for Turing compute capability, profiling of OpenMP applications, CUDA graphs) in CUPTI, see the Changelog section in the CUPTI documentation.
For new features in CUDA-MEMCHECK, see the Release Notes section in the CUDA-MEMCHECK documentation.
OpenMP tools interface is now supported in nvprof.
Profiler now supports version 3 of NVIDIA Tools Extension API (NVTX).
CUDA Libraries:
This release of the toolkit includes optimized libraries for Turing architecture and performance-tuned across single and multi-GPU environments. Also this release introduces a new library, nvJPEG, for GPU accelerated hybrid JPEG decoding.
CUDA Libraries. nvJPEG:
A new library for GPU-accelerated JPEG decoding. nvJPEG supports decoding of single and batched images, color space conversion, multiple phase decoding, and hybrid decoding using both CPU and GPU.
Provides low-latency decoder for common JPEG formats used in computer vision applications such as image classification, object detection and image segmentation.
nvJPEG can accelerate data loading and pre-processing for DL training with GPU-accelerated augmentation such as translation, zoom, scale, corp, flip, and others.
nvJPEG can perform JPEG decoding and resizing in real-time for applications that demand low-latency deep learning inference.
Key Features include:
Hybrid decoding using both the CPU and the GPU
Single image and batched image decoding
Color space conversion to RGB, BGR, RGBI, BGRI, and YUV, and
Single and multi phase decoding
CUDA Libraries. cuFFT Library:
This release includes strong FFT performance scaling across 8 and 16-GPU systems.
This release also enables applications with large FFT models to be used on dense GPU systems.
Added support for R2C and C2R for multi-GPU, and improved performance of 1D/2D/3D R2C/C2R for both single and multi-GPU FFTs.
CUDA Libraries. cuBLAS Library:
Includes Turing architecture-optimized mixed-precision GEMMs for Deep Learning applications.
Added batched GEMV (General Matrix Vector Multiplication) support for mixed precision (FP16 input and output, and FP32 accumulation) to enable deep learning RNNs using attention models.
Several improvements made to API logging, including logging the following formats:
Append print,
Print local time,
Append printing version,
Append synchronziation via mutex use.
Added different levels of logging where detailed information can be printed about gemm, such as:
Tensor-Core vs. Non-Tensor Core
Tile sizes and other performance options that are used internally, and
Grid dimensions and kernel name.
CUDA Libraries. NVIDIA Performance Primitives (NPP):
Added batched resize for 32F and U8 image formats.
This release also extends the image resize functionality to batches of variable sized regions of interest (ROI).
Improved performance for several image processing primitives.
CUDA Libraries. cuSOLVER Library:
Improved performance of the following dense linear algebra routines:
Cholesky factorization,
Symmetric eigensolver,
Generalized symmetric eigensolver, and
QR factorization.
DEPRECATED FEATURES:
The following features are deprecated in the current release of the CUDA software. The features will still work in the current release, but their documentation may have been removed, and they will become officially unsupported in a future release. We recommend that developers employ alternative solutions to these features in their software.
General CUDA:
Support for RHEL 6.x is now deprecated starting with CUDA 10.0. It may be dropped in a future release of CUDA. Customers are encouraged to adopt RHEL 7.x to use new versions of CUDA.
The usage of the following attributes of a pointer returned by cudaPointerGetAttributes::memoryType and isManaged is deprecated and will be removed in the next release of CUDA.
The following compilers are no longer supported as host compilers for nvcc:
PGI 17.x
Microsoft Visual Studio 2010
Clang versions lower than 3.7
Microsoft Visual Studio 2011 is now deprecated as a host compiler for nvcc. Support may be removed in a future release of CUDA.
32-bit tools are no longer supported starting with CUDA 10.0.
Usage of non-sync versions of warp instructions (shfl and vote) is deprecated on Volta+ architectures.
Installation of the CUDA Toolkit using the .run package no longer supports in-execution permission elevation to admin privileges.
The following samples are no longer available in the CUDA toolkit:
simpleDevLibCUBLAS
cdpLUDecomposition
Nsight Eclipse Edition is deprecated and may be dropped in a future release of CUDA. The following files may no longer be available in a future release:
/usr/local/cuda/libnsight
/usr/local/cuda/bin/nsight
/usr/local/cuda/doc/html/nsight-eclipse-edition-getting-started-guide
/usr/local/cuda/doc/pdf/Nsight_Eclipse_Edition_Getting_Started.pdf
CUDA Libraries:
nvGRAPH library is deprecated. The library will no longer be shipped in the future releases of CUDA.
RESOLVED ISSUES:
General CUDA:
Fixed a bug in conjugateGradientMultiDeviceCG sample where it would fail to run on some Volta GPUs.
Fixed an issue with the GPU memory resource consumption of the MPS server on Volta V100.
Fixed an issue that caused a 6% performance difference in FAHBench (DP_average test) with OpenCL.
Fixed a memory allocation issue in the CUDA runtime (cudart) library that was reported as an error by AddressSanitizer.
Fixed an issue when NVML would not show all active NVLinks (via nvidia-smi nvlink -s) after a reboot of the system.
Fixed an issue with the CUDA sample reduction (under 6_Advanced/reduction), where the sample would fail when reducing less than 16384 elements.
CUDA Tools:
Fixed a bug in nvprof where --print-nvlink-topology would crash on POWER 8 systems.
Fixed an issue with nvprof where system memory related metrics would fail to be collected on Pascal GPUs.
Fixed an issue where nvprof would result in a crash when no kernel is specified in an nvtx range.
Fixed an issue where the --profile-child-processes option with nvprof would result in no profile data being collected.
Fixed a NullPointerException in some OpenMP codes when running the Visual Profiler.
Fixed an issue with Nsight Eclipse Edition that was preventing auto-complete from working with CUDA.
Fixed an issue in nvvp where the CPU profiling button should be disabled if tracing is enabled. This issue would result in a failure to generate timelines.
Fixed an issue in nvvp to show the correct topology and link information for GPUs in a POWER 9 system.
Fixed an issue where profiling of CUDA applications would result in invalid profiling results on Azure GPU instances.
Fixed an issue in ptxas where the vmax instruction in specific cases (vmax2.s32.s32.s32) would result in incorrect results.
Fixed an issue in nvcc where an error would be reported in some cases in relaxed_constexpr mode for references to host variables in device code.
Fixed an issue in the visual profiler where the CPU profiling does not default correctly for PGI.
Fixed an issue where cuda-gdb would assert when running some OpenACC codes with break_on_launch set.
Fixed an issue where nvprof was incorrectly calculating the percentage overhead for NVLink (see nvlink_overhead_* metrics).
Fixed an issue which prevented tracing and profiling (event and metric collection) for multidevice cooperative kernels, that is, kernels launched by using the API functions cudaLaunchCooperativeKernelMultiDevice or cuLaunchCooperativeKernelMultiDevice.
Fixed an issue in nvcc where an internal compiler error is encountered when generating debug info for code containing a pointer to data member or member function.
Fixed an issue where nvcc would return an error for a template that is defined inside an unnamed namespace.
Fixed an issue in the CUDA compiler where using __shfl_xor_sync() in some cases would cause kernels to hang on Volta GPUs.
CUDA Libraries:
Fixed the following performance issues with cuBLAS:
SGEMM (FP32) on V100/Ubuntu 16.04,
INT8 GEMM on P4 for small input m and n sizes, and
Batched CGEMM performance on Tesla V100.
Fixed performance issue in cuRAND for Philox one-at-a-time random number generation.
Improved cuBLAS performance and heuristics selection for large K matrices.
Added METIS reordering in the cuSOLVER sample for sparse linear solver.
Fixed an issue with cuSOLVER QR factorization (geqrf) that sometimes returned R factors containing a non numeric value.
Fixed overflow and underflow issues in cuSOLVER householder reflection and givens rotations.
Fixed an issue with cuBLAS SGEMM algorithm 102 that occasionally returned non-deterministic results.
Fixed an issue in cuSOLVER cusolverDnDsyevj that returned incorrectly sorted eigenvalues.
Improved dense SVD performance in cuSOLVER for tall skinny sizes by adding QR when m>n.
Improved performance of dense QR with domino scheme in cuSOLVER.
Fixed an issue in NPP nppiCountInRange routine that returned NPP_RANGE_ERROR when called from multiple threads.
Fixed missing check for uplo parameter in cuSOLVER routines cusolverDn<X>potrf and cusolverDn<X>potrf_bufferSize that resulted in incorrect return code.
Fixed an issue with cuBLAS batched GEMM where it selected wrong kernels for bandwidth bound GEMMs.
Fixed an issue in Thrust where merge_by_key uses the wrong element type in the default comparator.
KNOWN ISSUES:
General CUDA:
Ubuntu 18.04 LTS support on POWER 9 is in technology preview for CUDA 10.0 and may have issues on POWER 9 systems. This configuration will be fully supported in a later release of the CUDA driver in Q4 2018.
Using the CUDA Graphs stream capture APIs with CUDA libraries (e.g. cuBLAS) may result in errors. This issue will be fixed in a patch update to CUDA 10.0.
CUDA Samples:
cuda-memcheck may report uninitialized reads in CUB’s radix sort in the CUDA sample smokeParticles.
Some graphic CUDA samples may terminate with an error when switching to the desktop using Windows + D key combination.
CUDA Tools:
For known issues in cuda-memcheck, see the Known Issues section in the cuda-memcheck documentation

New in NVIDIA CUDA Toolkit 9.2.88 (Jul 9, 2018)

New in NVIDIA CUDA Toolkit 9.0.176 (Sep 28, 2017)

New in NVIDIA CUDA Toolkit 8.x (Aug 10, 2017)

8.0.61.2 NEW FEATURES:
cuBLAS Library:
This update contains performance enhancements and bug-fixes to the cuBLAS library
in CUDA Toolkit 8. Deep Learning applications based on Recurrent Neural Networks
(RNNs) and Fully Connected Networks (FCNs) will benefit from new GEMM kernels
and improved heuristics in this release.
This update supports the x86_64 architecture on Linux, Windows, and Mac OS
operating systems, and the ppc64le architecture on Linux only.
The highlights of this update are as follows:
Performance enhancements for GEMM matrices used in speech and natural
language processing
Integration of OpenAI GEMM kernels
Improved GEMM heuristics to select optimized algorithms for given input sizes
Heuristic fixes for batched GEMMs
GEMM performance bug fixes for Pascal and Kepler platforms
8.0.61 NEW FEATURES:
CUDA Tools:
CUDA Compilers. The CUDA compiler now supports Xcode 8.2.1.
NVRTC. NVRTC is no longer considered a preview feature.
CUDA Libraries:
cuBLAS. The cuBLAS library added a new function cublasGemmEx(), which is
an extension of cublas<t/>gemm(). It allows the user to specify the algorithm,
as well as the precision of the computation and of the input and output matrices.
NVIDIA CUDA Toolkit 8.0.61 RN-06722-001 _v8.0 | iv
The function can be used to perform matrix-matrix multiplication at lower
precision.
RESOLVED ISSUES:
General CUDA:
CUDA Installer. On some SLES or openSUSE system configurations, the
NVIDIA GL library package may need to be locked before the steps for a GL-less
installation are performed. The NVIDIA GL library package can be locked with
this command:
sudo zypper addlock nvidia-glG04
Unified memory. On GP10x systems, applications that use
cudaMallocManaged() and attempt to use cuda-gdb will incur random
spurious MMU faults that will take down the application.
Unified memory. Functions cudaMallocHost() and cudaHostRegister()
don't work correctly on multi-GPU systems with the IOMMU enabled on
Linux. The only workaround is to disable unified memory support with the
CUDA_DISABLE_UNIFIED_MEMORY=1 environment variable.
Unified memory. Fixed an issue where cuda-gdb or cuda-memcheck would
crash when used on an application that calls cudaMemPrefetchAsync().
Unified memory. Fixed a potential issue that can cause an application to hang
when using cudaMemPrefetchAsync().
CUDA Tools:
CUDA Compilers. Fixed an issue with wrong code generation for computing the
address of an array when using a 64-bit index.
CUDA Compilers. When a program is compiled with whole program
optimization, applying launch bounds to recursive functions or to indirect
function calls may have unpredictable results.
CUDA Profiler. The PC sampling warp state counts were incorrect in some cases.
CUDA Profiler. Profiling applications using nvprof or Visual Profiler on
systems without an NVIDIA driver resulted in an error. This is now reported as a
warning.
cuSOLVER. Fixed an issue with the cuSOLVER library where some of its
functions were not exposed, resulting in link errors.
NVTX. The NVIDIA Tools Extension SDK (NVTX) function
nvtxGetExportTable() was missing from the export table list.
CUDA Libraries:
cuBLAS. Fixed GEMM performance issues on Kepler and Pascal for different
matrix sizes, including small batches. Note that this fix is available only in the
cuBLAS packages on the CUDA network repository.
cuBLAS. Updated the cuBLAS headers to use comments that are in compliance
with ANSI C standards.
cuBLAS. Made optimizations for mixed-precision (FP16, INT8) matrix-matrix
multiplication of matrices with a small number of columns (n).
cuBLAS. Fixed an issue with the trsm() function for large-sized matrices.
KNOWN ISSUES:
General CUDA:
CUDA library. Function cuDeviceGetP2PAttribute() was not published in
the cuda library (libcuda.so). Until a new build of the toolkit is issued, users
can either use the driver version, cudaDeviceGetP2PAttribute(), or perform
the link to use libcuda directly instead of the stub (usually it can be done by
adding -L/usr/lib64).
CUDA Tools:
CUDA Profiler. When a device is in the "exclusive" process compute mode, the
profiler may fail to collect events or metrics in "application replay" mode. In this
case, use "kernel replay" mode.
CUDA Profiler. In the Visual Profiler, the Run > Configure Metrics and Events...
dialog does not work for the device that has NVLink support. It's suggested to
collect all metrics and events using nvprof and then import into nvvp.
CUDA Profiler, CUPTI. Some devices with compute capability 6.1 don't support
multi-context scope collection for metrics. This issue affects nvprof, Visual
Profiler, and CUPTI.

New in NVIDIA CUDA Toolkit 5.5.20 (Aug 1, 2013)

General CUDA:
MPS (Multi-Process Service) is a runtime service designed to let multiple MPI (Message Passing Interface) processes using CUDA run concurrently on a single GPU in a way that's transparent to the MPI program. A CUDA program runs in MPS mode if the MPS control daemon is running on the system. When a CUDA program starts, it connects to the MPS control daemon (if possible), which then creates an MPS server for the connecting client if one does not already exist for the user (UID) that launched the client.
With the CUDA 5.5 Toolkit, there are some restrictions that are now enforced that may cause existing projects that were building on CUDA 5.0 to fail. For projects that use -Xlinker with nvcc, you need to ensure the arguments after -Xlinker are quoted. In CUDA 5.0, -Xlinker -rpath /usr/local/cuda/lib would succeed; in CUDA 5.5 -Xlinker "-rpath /usr/local/cuda/lib" is now necessary.
The Toolkit is using a new installer on Windows. The installer is able to install any selection of components and to customize the installation locations per user request.
The CUDA Sample projects have makefiles that are now more self-contained and robust.
The following documents are now available in the CUDA toolkit documentation portal:
Programming guides: CUDA Video Encoder, CUDA Video Decoder, Developer Guide to Optimus, Parallel Thread Execution (PTX) ISA, Using Inline PTX Assembly in CUDA, NPP Library Programming Guide
Tools manuals:
CUDA Binary Utilities
White papers:
Floating-Point and IEEE 754 Compliance, Incomplete-LU and Cholesky
Preconditioned Iterative Methods
Compiler SDK:
libNVVM API, libdevice Users's Guide, NVVM IR Specification
General:
CUDA Toolkit Release Notes, End-User License Agreements
CUDA Libraries:
CUBLAS:
The routines cublas{S,D,C,Z}getriBatched() and cublas{S,D,C,Z}matinvBatched() have been added to the CUBLAS Library.
Routine cublas{S,D,C,Z}getriBatched() must be called after the LU batched
factorization routine, cublas{S,D,C,Z}getrfBatched(), to obtain the inverse
matrices. The routine cublas{S,D,C,Z}matinvBatched() does a direct inversion
with pivoting based on the Gauss-Jordan algorithm but is limited to matrices of dimension
The limitation on the dimension n of the routine cublasgetrfbatched() has been removed. However, for performance reasons it is still recommended to use this routine for small values of n, typically n < 256.
CUFFT:
CUFFT 5.5 extends the existing API. The new calls allow creation of a CUFFT plan handle separate from the actual creation of the plan, allow insertion of new calls to set plan attributes before the work of plan creation is done, and allow advanced users more control over memory space allocation. Details can be found in the CUFFT Library User's Guide.
CUFFT 5.5 provides FFTW3 interfaces that enables applications using FFTW to gain performance with NVIDIA CUFFT with minimal changes to program source code. The CUFFT Library User's Guide documents which FFTW3 API features are supported
CUPTI:
The CUPTI (CUDA Profiling Tools Interface) release notes are now part of this document. Compiling and Running CUPTI Samples On Windows, compiling and running CUPTI samples using the included makefiles requires the Cygwin environment.
Changes Incompatible with CUPTI 4.0:
A number of non-backward compatible API changes were made in CUPTI 4.1. These changes require minor source modifications to existing code compiled against CUPTI 4.0. In addition, some previously incorrect and undefined behavior is now prevented by improved error checking. Your code may need to be modified to handle these new error cases.
Multiple CUPTI subscribers are not allowed. In CUPTI 4.0, cuptiSubscribe() could be used to enable multiple subscriber callback functions to be active at the same time. When multiple callback functions were subscribed, invocation of those callbacks did not respect the domain registration for those callback functions. In CUPTI 4.1 and later, cuptiSubscribe() returns CUPTI_ERROR_MAX_LIMIT_REACHED if there is already an active subscriber.
The CUpti_EventID values for Tesla devices have changed in CUPTI 4.1 to make all CUpti_EventID values unique across all devices. Going forward, CUpti_EventID values will be added for new devices and events, but existing values will not be changed. If your application has stored CUpti_EventID values (for example, as part of the data collected for a profiling session), those CUpti_EventIDs must be translated to the new ID values before being used in CUPTI 4.1 and later APIs.
In enumeration CUpti_EventDomainAttribute, CUPTI_EVENT_DOMAIN_MAX_EVENTS has been removed. The number of events in an event domain can be retrieved with cuptiEventDomainGetNumEvents().
Routines cuptiDeviceGetAttribute(), cuptiEventGroupGetAttribute(), and cuptiEventGroupSetAttribute() now take a size parameter, and the value parameter now has type void *.
Routine cuptiEventDomainGetAttribute() no longer takes a CUdevice parameter. This function is now used to get event domain attributes that are device independent. A new function cuptiDeviceGetEventDomainAttribute() has been added to get event domain attributes that are device dependent.
Routines cuptiEventDomainGetNumEvents(), cuptiEventDomainEnumEvents(), and cuptiEventGetAttribute() no longer take a CUdevice parameter.
The contextUid field of the CUpti_CallbackData structure has been changed from type uint64_t to type uint32_t.
CURAND:
CURAND 5.5 introduces support for the random number generator Philox4x32-10
CUSPARSE:
The routine cusparse{S,D,C,Z)crsmm2() is an API extension of cusparse{S,D,C,Z}csrmm() which allows the matrix B to be passed in a transposed form. This can bring up to a 2× speedup in performance due to the better memory access efficiency of transposed matrix B.
The cublasgtsv() routines have been replaced with a version that supports pivoting. The previous version has been renamed cublasgtsv_nopivot() to better reflect that it does not support pivoting. The new algorithm has been developed by Liwen Wang from the Impact Group of the University of Illinois.
The routine cusparsebrsrxmv() is an extension of the routine cusparsebsrmv() that allows the matrix vector product to be performed on a submatrix. This routine also works for block of dimension 1 (CSR format).
CUDA Tools:
CUDA Compiler:
The following changes have been made to the
CUDA Compiler SDK:
An optimizing compiler library (libnvvm.so, nvvm.dll/nvvm.lib, libnvvm.dylib) and its header file nvvm.h are provided for compiler developers who want to generate PTX from a program written in NVVM IR, which is a compiler internal representation based on LLVM.
A set of libraries, libdevice.*.bc, that implement the common math functions for devices in the LLVM bitcode format are provided.
A set of samples that illustrate the use of the compiler SDK are provided.
Documents for the CUDA Compiler SDK (including the specification for LLVM IR, an API document for libnvvm, and an API document for libdevice) are provided.
The default nvcc.profile no longer includes -lcudart (on Linux and Mac OS X) and cudart.lib (on Windows), and the use of the CUDA runtime is now controlled by the option --cudart (-cudart). Consequently, the option -- dont-use-profile (-noprof) no longer prevents nvcc from linking the object files against the CUDA runtime when the default nvcc.profile is used, and the option --cudart=none (-cudart=none) needs to be used instead. If the option --cudart=none (-cudart=none) is not specified, --cudart=static (- cudart=static) is assumed, and nvcc links the object files against the static CUDA runtime.
CUDA 5.5 adds support for JIT linking. This can be done explicitly by using the driver API (see the cuLink* routines in the CUDA Driver API documentation); alternatively, runtime apps that use separate compilation will automatically JIT to a newer architecture if needed (see the Separate Compilation chapter in the CUDA Compiler Driver NVCC document). JIT linking requires rebuilding all objects with the 5.5 toolkit.
Clang is now supported as a host compiler on Mac OS. To use Clang as the host compiler, invoke nvcc with -ccbin=path-to-clang-executable. There are some features that are not yet supported: Clang language extensions (see http:// clang.llvm.org/docs/LanguageExtensions.html), LLVM libc++ (only GNU libstdc++ is currently supported), language features introduced in C++11, and the __global__ function template explicit instantiation definition.
CUDA-GDB:
To represent the parent/child kernel information, two commands were added. The info cuda launch trace command shows the trace of kernel launches that leads to the kernel in focus by default. It is the equivalent of the backtrace command for function calls. The info cuda launch children shows the list of kernels launched by the kernel in focus (by default).
Multiple CUDA-GDB instances can be now used for debugging ranks of an MPI application that uses a separate GPU for each rank. Each CUDA-GDB instance should be invoked with the option --cuda-use-lockfile=0, which allows multiple CUDA-GDB instances to exist simultaneously.
The list of threads returned by the info cuda threads can now be narrowed to the threads currently at a breakpoint. To enable the filter, the keyword breakpoint can simply be added as an option to the info cuda threads command.
The info cuda contexts command was added. The command lists all the CUDA contexts the debugger is aware of and their respective status (active or not).
CUDA-MEMCHECK:
Return code cudaErrorNotReady can be returned by cudaStreamQuery() and cudaEventQuery() in the case where the stream/event being waited on is still busy. This return code is not an error condition and is used by user programs to poll until the stream/event is ready. CUDA-MEMCHECK will no longer report the following conditions as errors when CUDA API call checking is enabled: ‣ cudaErrorNotReady returned by CUDA Run Time API calls
CUDA_ERROR_NOT_READY returned by CUDA Driver API calls
The racecheck tool in CUDA-MEMCHECK now has support for SM 3.5 devices. The racecheck-report mode option of the racecheck tool can be used to enable the generation of analysis records.
CUDA-MEMCHECK now supports displaying error information as errors occur during program execution instead of waiting for program termination to display output.
CUDA Profiler:
The NVIDIA Visual Profiler now supports applications that use CUDA Dynamic Parallelism. The application timeline includes both host-launched and device- launched kernels, and shows the parent-child relationship between kernels.
The application analysis performed by the NVIDIA Visual Profiler has been enhanced. A guided analysis mode has been added that provides step-by-step analysis and optimization guidance. Also, the analysis results now included graphical visualizations to more clearly indicate the optimization opportunities.
The NVIDIA Visual Profiler and the command-line profiler, nvprof, now support power, thermal, and clock profiling.
The NVIDIA Visual Profiler and the command-line profiler, nvprof, now support metrics that report the floating-point operations performed by a kernel. These metrics include both single-precision and double-precision counts for adds, multiplies, multiply-accumulates, and special floating-point operations.
The NVIDIA command-line profiler, nvprof, now supports collection of any number of events and metrics during a single run of a CUDA application. It uses kernel replay to execute each kernel as many times as necessary to collect all the requested profile data.
The NVIDIA command-line profiler, nvprof, now supports profiling of all CUDA processes executed on a system. In this "profile all processes" mode, a user starts nvprof on a system and all CUDA applications subsequently launched by that user are profiled.
Debugger API:
Two new symbols are introduced to control the behavior of the application: CUDBG_ENABLE_LAUNCH_BLOCKING and CUDBG_ENABLE_INTEGRATED_MEMCHECK. The two symbols, when set to 1, have the same effect as setting the environment variables CUDA_LAUNCH_BLOCKING and CUDA_MEMCHECK to 1. Both symbols also have the same restriction: the change takes effect on the next run of the application.
Software preemption is available as a BETA. The option is enabled by setting the symbol CUDBG_ENABLE_PREEMPTION_DEBUGGING to 1. The option is used to debug a CUDA application on the same GPU that is rendering the desktop GUI.
Software preemption (BETA) enables debugging of long-running or indefinite CUDA kernels that would otherwise encounter a launch timeout.
Software preemption (BETA) allows multiple debugger sessions can simultaneously debug CUDA applications on the same GPU. This feature is available on Linux with devices of compute capability of 3.5.
The parent grid information for each kernel is now available as either a new field in the kernelReady event, or as a field in the newly created CUDBGGridInfo struct, which is retrievable via the new getGridInfo() call. Both models, push and pull, complement each other and should be used hand-in-hand to get the most accurate and recent information about the status of a kernel in the application.
To reduce the number of times the debugger stops and resumes the application, the debugger API can be made to defer non-essential host kernel launch notifications instead of producing events in the the synchronous event queue. This behavior is controlled with the new setKernelLaunchNotificationMode() function call. When set to CUDBG_KNL_LAUNCH_NOTIFY_DEFER, the debugger will not receive kernelReady events for every kernel launch. Instead, the debugger must reconstruct this information by calling getGridInfo for every previously unseen grid present on the device the next time it stops.
The gridId is now available as a 64-bit value. New fields and new API functions were added to cover the new type. The old 32-bit values are still accessible but are now deprecated. Whenever possible the 64-bit gridId should be used.
Nsight Eclipse Edition:
The Nsight Eclipse Edition debugger now provides a memory viewer for both host and device memory. The memory viewer supports a number of different data types, including floating point.
Nsight Eclipse Edition now provides CUDA Dynamic Parallelism support for both new and existing project
For applications that use CUDA Dynamic Parallelism, the Nsight Eclipse Edition debugger now shows the parent/child launch trace for device-launched kernels.
Nsight Eclipse Edition now includes the Remote System Explorer plug-in. This plug-in enables accessing of remote systems for file transfer, shell access, and listing running processes.
Nsight Eclipse Edition is updated to use Eclipse Platform 3.8.2 and Eclipse CDT 8.1.2, introducing a number of new features and enhancements to existing features.
Performance Improvements:
CUDA Libraries:
CUBLAS:
The cublastrsv() routines have been significantly optimized with the work of Jonathan Hogg from The Science and Technology Facilities Council (STFC). Subsequently, cublastrsm() was updated to use some of these optimizations in some cases.
Math:
The performance of the double-precision functions fmod(), remainder(), and remquo() has been significantly improved for sm_30.
Resolved issues:
General CUDA:
In CUDA 5.5 the library versioning has been changed on Mac and Windows. Please refer to section 15.4, Distributing the CUDA Runtime and Libraries, in the CUDA C Best Practices Guide.
When the default CUDA 5.0 Windows installer option to silently install the NVIDIA display driver is used, an error message like "display driver has failed to install" may be displayed for certain hardware configurations. If this error message occurs, the installation can be completed by installing the display driver separately using the setup.exe saved under C:\NVIDIA\DisplayDriver\....
In certain hardware configurations, the CUDA 5.0 installer on Windows may fail to install the display driver. This failure occurs when the user disables silent installation of the display driver and instead chooses to interactively select the components of the display driver from the installer UI that appears after the CUDA toolkit and samples are installed. If the UI for interactive selection of the display driver components fails to appear, please reinstall just the display driver by running setup.exe saved under C:\NVIDIA\DisplayDriver\....
CUDA Libraries:
NPP:
The NPP ColorTwist_32f_8u_P3R primitive does not work properly for line strides that are not 64-byte aligned. This issue can be worked around by using the image memory allocators provided by the NPP library.
CUDA Tools:
All user-loaded modules, as well as modules containing system calls, are exposed via the debug API to retain backwards compatibility with existing CUDA toolkits. Other driver internal modules are not exposed.
The hardware counter (event) values may be incorrect in some cases on GPUs with compute capability (SM type) 3.5. Incorrect event values also result in incorrect metric values. These errors are more likely to occur when the same GPU is used for display and compute, or when other graphics applications are running simultaneously on the GPU.
Old-style cubin support in cuobjdump has been deprecated by removing the - cubin and -fname options, and removing support for fatbin versions less than 4
CUDA-GDB:
Conditional breakpoints can now be set before the device ELF image is loaded. The conditions may include built-in variables such as threadIdx and blockIdx. The conditional device breakpoints will be marked as pending until they can be resolved to a device code address.
A new error, CUDBG_ERROR_NO_DEVICE_AVAILABLE, will be returned at initialization time if no CUDA-capable device can be found.
Debugger API:
A new error, CUDBG_ERROR_NO_DEVICE_AVAI

New in NVIDIA CUDA Toolkit 5.0.27 RC (Sep 12, 2012)

General CUDA:
CUDA 5.0 introduces support for Dynamic Parallelism, which is a significant
enhancement to the CUDA programming model. Dynamic Parallelism allows a kernel to launch and synchronize with new grids directly from the GPU using CUDA's standard ">" syntax. A broad subset of the CUDA runtime API is now available on the device, allowing launch, synchronization, streams, events, and more.
The use of a character string to indicate a device symbol, which was possible
with certain API functions, is no longer supported. Instead, the symbol should be used directly.
The cudaStreamAddCallback() routine introduces a mechanism to perform work on the CPU after work is finished on the GPU, without polling.
CUDA Libraries:
CUBLAS:
In addition to the usual CUBLAS Library host interface that supports all
architectures, the CUDA toolkit now delivers a static CUBLAS library (cublas_device.a) that provides the same interface but is callable from the device from within kernels. The device interface is only available on Kepler II because it uses the Dynamic Parallelism feature to launch kernels internally. More details can be found in the CUBLAS Documentation.
The CUBLAS library now supports routines cublas{S,D,C,Z}getrfBatched(), for
batched LU factorization with partial pivoting, and cublas{S,D,C,Z}trsmBatched() a batched triangular solver. Those two routines are restricted to matrices of dimension < = 32x32.
The cublasCsyr(), cublasZsyr(), cublasCsyr2(), and cublasZsyr2() routines were added to the CUBLAS library to compute complex and double-complex symmetric rank 1 updates and complex and double-complex symmetric rank 2 updates respectively. Note, cublasCher(), cublasZher(), cublasCher2(), and cublasZher2() were already supported in the library and are used for Hermitian matrices.
The cublasCsymv() and cublasZsymv() routines were added to the CUBLAS library to compute symmetric complex and double-complex matrix-vector multiplication. Note, cublasChemv() and cublasZhemv() were already supported in the library and are used for Hermitian matrices.
A pair of utilities were added to the CUBLAS API for all data types. The
cublas{S,C,D,Z}geam() routines compute the weighted sum of two optionally transposed matrices. The cublas{S,C,D,Z}dgmm() routines compute the multiplication of a matrix by a purely diagonal matrix (represented as a full matrix or with a packed vector).
CUSPARSE:
Routines to achieve addition and multiplication of two sparse matrices in CSR
format have been added to the CUSPARSE Library.
The combination of the routines cusparse{S,D,C,Z}csrgemmNnz() and
cusparse{S,C,D,Z}csrgemm() computes the multiplication of two sparse matrices in CSR format. Although the transpose operations on the matrices are supported, only the multiplication of two non-transpose matrices has been optimized. For the other operations, an actual transpose of the corresponding matrices is done internally.
The combination of the routines cusparse{S,D,C,Z}csrgeamNnz() and
cusparse{S,C,D,Z}csrgeam() computes the weighted sum of two sparse matrices in CSR format.
The location of the csrVal parameter in the cusparsecsrilu0() and
cusparsecsric0() routines has changed. It now corresponds to the parameter ordering used in other CUSPARSE routines, which represent the matrix in CSR-storage format (csrVal, csrRowPtr, csrColInd).
The cusparseXhyb2csr() conversion routine was added to the CUSPARSE library. It allows the user to verify that the conversion to HYB format was done correctly.
The CUSPARSE library has added support for two preconditioners that perform incomplete factorizations: incomplete LU factorization with no fill in (ILU0), and incomplete Cholesky factorization with no fill in (IC0). These are supported by the new functions cusparse{S,C,D,Z}csrilu0() and cusparse{S,C,D,Z}csric0(),respectively.
The CUSPARSE library now supports a new sparse matrix storage format called Block Compressed Sparse Row (Block-CSR). In contrast to plain CSR which encodes all non-zero primitive elements, the Block-CSR format divides a matrix into a regular grid of small 2-dimensional sub-matrices, and fully encodes all sub-matrices that have any non-zero elements in them. The library supports conversion between the Block-CSR format and CSR via cusparse{S,C,D,Z}csr2bsr() and cusparse{S,C,D,Z}bsr2csr(), and matrix-vector multiplication of Block-CSR matrices via cusparse{S,C,D,Z}bsrmv().
Math:
Single-precision normcdff() and double-precision normcdf() functions were
added. They calculate the standard normal cumulative distribution function.
Single-precision normcdfinvf() and double-precision normcdfinv() functions were also added. They calculate the inverse of the standard normal cumulative distribution function.
The sincospi(x) and sincospif(x) functions have been added to the math library to calculate the double- and single-precision results, respectively, for both sin(x * PI) and cos(x * PI) simultaneously.The performance of sincospi{f}(x) should generally be faster than calling sincos{f}(x * PI) and should generally be faster than calling sinpi{f}(x) and cospi{f}(x) separately.
Intrinsic __frsqrt_rn(x) has been added to compute the reciprocal square root of single-precision argument x, with the single-precision result rounded according to the IEEE-754 rounding mode "nearest or even".
NPP:
The NPP library in the CUDA 5.0 release contains more than 1000 new basic image processing primitives, which include broad coverage for converting colors, copying and moving images, and calculating image statistics.
Added support for a new filtering-mode for Rotate primitives: NPPI_INTER_CUBIC2P_CATMULLROM
This filtering mode uses cubic Catumul-Rom splines to compute the weights for reconstruction. This and the other two CUBIC2P filtering modes are based on the 1988 SIGGRAPH paper: "Reconstruction Filters in Computer Graphics" by Don P. Mitchell and Arun N. Netravali. At this point NPP only supports the Catmul-Rom filtering for Rotate.
CUDA Tools:
(Windows) The file fatbinary.h has been released with the CUDA 5.0 Toolkit. The file, which replaces __cudaFatFormat.h, describes the format used for all fat binaries since CUDA 4.0.
CUDA Compiler:
From this release, the compiler checks the execution space compatibility among multiple declarations of the same function and generates warnings or errors based on the three rules described below.
Generates a warning if a function that was previously declared as __host__ (either implicitly or explicitly) is redeclared with __device__ or with __host__ __device__. After the redeclaration the function is treated as __host__ __device__.
Generates a warning if a function that was previously declared as __device__ is redeclared with __host__ (either implicitly or explicitly) or with __host__
__device__. After the redeclaration the function is treated as __host__ __device__.
Generates an error if a function that was previously declared as __global__ is
redeclared without __global__, or vice versa.
With this release, NVCC allows more than one command-line switch that specifies a compilation phase, unless there is a conflict. Known conflicts are as follows:
lib cannot be used with "--link" or "--run".
"--device-link" and "--generate-dependencies" cannot be used with other
options that specify final compilation phases.
When multiple compilation phases are specified, NVCC stops processing upon the completion of the compilation phase that is reached first. For example, "nvcc -- compile --ptx" is equivalent to "nvcc --ptx", and "nvcc --preprocess --fatbin" equivalent to "nvcc --preprocess".
Separate compilation and linking of device code is now supported.
CUDA-GDB:
(Linux and Mac OS) CUDA-GDB fully supports Dynamic Parallelism, a new feature introduced with the 5.0 Toolkit. The debugger is able to track kernels launched from another kernel and to inspect and modify their variables like any CPU-launched kernel.
When the environment variable CUDA_DEVICE_WAITS_ON_EXCEPTION is used, the application runs normally until a device exception occurs. The application then waits for the debugger to attach itself to it for further debugging.
Inlined subroutines are now accessible from the debugger on SM 2.0 and above.
The user can inspect the local variables of those subroutines and visit the call frame stack as if the routines were not inlined.
Checking the error codes of all CUDA driver API and CUDA runtime API function calls is vital to ensure the correctness of a CUDA application. Now the debugger is able to report, and even stop, when any API call returns an error.
It is now possible to attach the debugger to a CUDA application that is already running. It is also possible to detach it from the application before letting it run to completion. When attached, all the usual features of the debugger are available to the user, just as if the application had been launched from the debugger.
CUDA-MEMCHECK:
CUDA-MEMCHECK, when used from within the debugger, now displays the address space and the address of the faulty memory access.
CUDA-MEMCHECK now displays the backtrace on the host and device when an error is discovered.
CUDA-MEMCHECK now detects double free() and invalid free() on the device.
The precision of the reported errors for local, shared, and global memory
accesses has been improved.
CUDA-MEMCHECK now reports leaks originating from the device heap.
CUDA-MEMCHECK now reports error codes returned by the runtime API and the driver API in the user application.
CUDA-MEMCHECK now supports reporting data access hazards in shared memory. Use the "--tool racecheck" command-line option to activate.
NVIDIA Nsight Eclipse Edition:
(Linux and Mac OS) Nsight Eclipse Edition is an all-in-one development
environment that allows developing, debugging, and optimizing CUDA code in an integrated UI environment.
NVIDIA Visual Profiler, Command Line Profiler:
A new tool, nvprof, is now available in release 5.0 for collecting profiling information from the command-line.

New in NVIDIA CUDA Toolkit 4.2.9 (Apr 19, 2012)

New in NVIDIA CUDA Toolkit 4.1.21 RC (Jan 12, 2012)

New in NVIDIA CUDA Toolkit 4.0.17 (Jan 12, 2012)

New in NVIDIA CUDA Toolkit 4.0 RC (Jan 12, 2012)

New in NVIDIA CUDA Toolkit 3.2.16 (Nov 23, 2010)

New Features:
New CUDA Libraries
CUSPARSE, supporting sparse matrix computations.
CURAND, supporting random number generation for both host and device code with Sobel quasi-random and XORWOW pseudo random routines.
CUFFT performance tuned radix-3, -5, and -7 transform sized on Fermi architecture GPUs.
CUBLAS performance tuned for Fermi architecture GPUs, especially for matrix multiplication of all datatypes and transpose variations.
H.264 encode/decode libraries that were previously available in the GPU Computing SDK are now part of the CUDA Toolkit.
CUDA Driver and CUDA C Runtime
Support for new 6GB Quadro and Tesla products
Cross-stream synchronization
Development Tools
Support for debugging GPUs with more than 4GB device memory
Miscellaneous
Support for malloc() and free() in device code
Integrated Tesla Compute Cluster (TCC) support in standard Windows driver packages
NVIDIA System Management Interface (nvidia-smi) support for reporting % GPU busy, several GPU performance counters
Performance Improvements:
CUFFT Related
Added Radix-7 CUFFT support for GROMACS in CUDA Toolkit 3.2. Added optimizations for transform sizes that contain prime factors of 3, 5 and 7. Transform sizes that can be expressed as (2^i * 3^j * 5^k * 7^l), where i,j,k,l are integer > 0, execute much faster than transform sizes that contain other prime factors. Previous releases of CUFFT were optimized only for 2^i.
CUBLAS GEMM Related
Increased performance for GEMM kernels for non block multiple input sizes achieved through MAGMA licensed code. See Acknowlegements section towards the end of this release notes document.
The performance of the CUBLAS routine CGEMM has been significantly improved on Fermi architecture for sizes larger than 300x300. Peak performance is reached when 'k' is a multiple of 16 and 'm' and 'n' are multiples of 64.
Performance for ZGEMM has been improved on the Fermi architecture for sizes greater than 256x256. Peak performance is reached when 'k' is a multiple of 8 and 'm' and 'n' are multiples of 32.
The performance of the CUBLAS routine DGEMM has significantly improved for the Tesla products based on the Fermi architecture (C20XX, S20XX, M20XX). The peak performance can be achieved for all transpose variations (NN, NT, TN, TT) when the following conditions are met: 'm' and 'n' dimensions are a multiple of 64, the 'k' dimension is a multiple of 16, ((m+n)*k) > (2*784*784). The performance of the CUBLAS routine SGEMM has also been significantly improved on Fermi architecture. The peak performance can be achieved for all transpose variations (NN, NT, TN, TT) when the following conditions are met: 'm' and 'n' dimensions are a multiple of 96, the 'k' dimension is a multiple of 16, ((m+n)*k) > (2*673*673).
CUBLAS Related
The performance of CUBLAS routines {S,D,C,Z}SYRK and {C,Z}HERK on the Fermi architecture has been significantly improved. These routines have been derived respectively from their {S,D,C,Z}GEMM counterparts and have the same requirements to achieve peak performance.
Improved the performance of many Level 1 BLAS functions in the CUDA CUBLAS library.
Note that functions that implement a reduction such as *dot, *min, and *max are not improved.
Other
The performance of round-to-nearest double precision reciprocals in device code has been improved by more than 50% for both Tesla and Fermi-class architectures.
Bug Fixes:
Fixed: CUDA_*_PATH environment variables get resolved properly when used within a Windows command prompt.
CUDA Driver Related
Excessive use of device printf can cause new Timeout Detection and Recovery (TDR) errors to be observed. If a kernel is close to exceeding its run time limit, adding printf may push the kernel over its limit causing it to fail. The more printfs that are added, the more likely this is to occur. If a kernel runs fine without calls to printf, but sees cudaErrorLaunchTimeout or CUDA_ERROR_LAUNCH_TIMEOUT errors when calls to printf are added, then the number of printfs should be reduced or parts of the kernel should be bypassed to bring the execution time below the run time limit.
This is due to the mechanism in WinVista/Win7's WDDM display driver model called "Timeout Detection and Recovery". See http://www.microsoft.com/whdc/device/displ...dm_timeout.mspx for details. The Microsoft webpage shows the registry keys you can use to change these settings. Note that you have to reboot for changes to the regkeys to take effect.
In the previous versions, 2D Texture size 65536 failed for compute capability equal to 2.0. The limit has been fixed to be in line with hardware capability.
CUDA Library Related
On Tesla-architecture GPUs, the cublasSgemm, cublasDgemm, and cublasZgemm routines would fail with an "Unspecified Launch Error" in some cases when the 'k' parameter was not a multiple of 16. This is now fixed.
Note: The cublasCgemm routine was not affected by this bug.
If Beta=0 for SSYRK and DSYRK, the 3.2 Release Candidate could produce incorrect results on Fermi in some cases. This has been fixed in the current 3.2 RC2 release.
Fixed: Performance of the CHERK function was reported as being slower in the CUDA 3.2 Toolkit Release Candidate compared to the 3.1 release.
Previous versions of CUFFT would possibly fail if the input or output data pointers were not aligned to a multiple of 256 bytes. Now, CUFFT requires 8 byte alignment for single-precision input data and 16 bytes for double-precision input data.
CUDA CUFFT: In the previous version, the CUFFT library would sometimes generate incorrect results on systems with multiple GPUS where the GPUs were of different types e.g a system with a GTX275 and a GTX480. This issue has been fixed; CUFFT detects the "active" device and correctly configures execution accordingly.
In some cases, the CUFFT library would cause the entire process to terminate when an internal or input error was encountered. This has been fixed so that the CUFFT APIs correctly catch the errors and return gracefully with an appropriate error code.
The previous version of CUFFT incorrectly created and destroyed planners such that a newly created planner could have a handle that was not unique from another existing active handle in certain situations. This has been fixed and now planners can be created and destroyed safely in any order.
In previous versions of CUFFT, C2C transforms of length 4 and C2R and R2C transforms of length 8 would produce incorrect results when the batch size was not a multiple of 64 for Tesla and not a multiple of 128 for Fermi. This has been fixed and transforms of length 4 and 8 will produce correct results for any batch size.
CUDA Runtime Related
In the previous version 1D Texture Maximum Size allowed did not match documentation. Fixed maximum width for a 1D texture reference bound to linear memory to be 2^27.

New in NVIDIA CUDA Toolkit 3.1.1 (Aug 12, 2010)

New Features:
Hardware Support
On Fermi hardware, CUDA 3.1 supports up to 16 concurrent kernels.
New Toolkit Features:
Device emulation has been removed.
cublasSpsv, cublasDpsv, cublasCpsv, cublasZpsv, and cublasSbsv, cublasDbsv, cublasCbsv, cublasZbsv have been enhanced to remove all previous size limitations on the input vector
Improved interoperability between the CUDA Driver API and the CUDA Runtime API. Includes support for sharing pointers, events, streams, arrays and graphics interop resources between the CUDA Driver API and the CUDA Runtime API. Introduces CUDA Runtime API compatibility with the CUDA Driver context migration API (cuCtxPushCurrent and cuCtxPopCurrent).
Added the ability to call printf() from kernels. This feature is supported only on the Fermi architecture.
Added support for recursion in device functions. This feature is supported only on the Fermi architecture. Note that we default to a stack size limit of 1K per thread, so can run out of stack if recurse too deeply. Can use cuCtxSetLimit() to change the default stack size.
Added support for function pointers. This feature is supported only on the Fermi architecture. Function pointers can only be used inside a single kernel; they cannot be passed to another kernel.
Specific GPUs can be made invisible with the CUDA_VISIBLE_DEVICES environment variable. Visible devices should be included as a comma-separated list in terms of the system-wide list of devices. For example, to use only devices 0 and 2 from the system-wide list of devices, set CUDA_VISIBLE_DEVICES equal to "0,2" before launching the application. The application will then enumerate these devices as device 0 and device 1.
New API Features:
In CUFFT-3.1, R2C and C2R transforms for power-of-2 sizes now experience a similar speedup to their C2C equivalent. However, CUFFT's internal data layout is different to that used by FFTW; by default CUFFT will match FFTW's data format, but at some performance penalty. To enable faster transforms, the user must use cufftSetCompatibilityMode() API to disable FFTW-compatible behavior and enable faster native mode.
CUBLAS now supports CUDA Stream via the cublasSetKernelStream API
Unformatted surface load/store (i.e. the ability to write to textures). This feature is supported only on the Fermi architecture.
New functions cuCtxSetLimit() and cuCtxGetLimit() have been added to control GPU thread stack size and the size of the printf() FIFO queue.
Device-to-device transfers in a non-NULL stream with asynchronous cudaMemcpy calls may overlap with kernels. Runtime documentation has been updated to reflect this.
New device attributes report the PCI bus and device identifiers of a particular GPU for better integration with system management tools.
New Performance Improvements:
Double-precision and C2R/R2C performance of CUFFT has been improved significantly for many transform sizes since the CUFFT 3.0 release.
Double precision divide and reciprocal on the Fermi architecture have been optimized. o The performance of selected transcendental functions from the log, pow, erf, and gamma families.
Bug Fixes:
The CUBLAS SGEMM, CGEMM and small matrix DGEMM performance regressions that were in v3.0 have been restored in v3.1.