ViennaCL Changelog

What's new in ViennaCL 1.6.2 Beta

Dec 12, 2014

The focus of this bugfix release is to further improve the portability of the library and some performance improvements:
sliced_ell_matrix: Using better default values for NVIDIA GPUs for better performance
pipelined CG/BiCGStab/GMRES: Improved parameters and one kernel for NVIDIA GPUs.
pipelined CG/BiCGStab/GMRES: Improved function overloads so that a user call does not accidentally use the non-pipelined implementation.
compressed_matrix: Fixed runtime error when switching the memory location at runtime.
compressed_compressed_matrix: Fixed a wrong buffer size in clear().
hyb_matrix: Using more portable and faster default settings with OpenCL.
coordinate_matrix: Fixed incorrect CUDA and OpenCL kernels for the extraction of diagonals and row-norms.
CUDA with Visual Studio: Fixed compilation errors and warnings (thanks to Andreas Rost for the hint).
direct solve benchmark: Removed unnecessary Boost.uBLAS dependency.
OpenMP: Fixed unspecified behavior for private and shared variables as well as reductions if the type is deduced from a template parameter (thanks to GitHub user aokomoriuta for precious input).
OpenMP: Ensured compatibility with version 2.0 (no unsigned integers as loop variables).
AMG: Fixed incorrect detection of coarse level operator dimensions (thanks again to Andreas Rost).

New in ViennaCL 1.6.1 Beta (Nov 21, 2014)

New in ViennaCL 1.6.0 Beta (Nov 12, 2014)

Major update of the internal OpenCL kernel generator, which is now used for all BLAS operations with device-specific parameters and greatly improves performance particularly on older AMD GPUs.
Iterative solvers: Added pipelined implementations of CG, BiCGStab and GMRES, which are up to three times faster than other GPU-enabled solver libraries
sliced_ell_matrix: Added new sparse matrix type implementing the sliced ELLPACK format for all three compute backends as proposed in the paper A unified sparse matrix data format for efficient general sparse matrix-vector multiply on modern processors with wide SIMD units by Kreutzer et al.
CUDA: Added option for specifying the CUDA architecture through CMake in the CUDA_ARCH_FLAG variable.
Added viennacl/version.hpp for identifying the ViennaCL version (thanks to GitHub user vigsterkr for bringing this up).
Added viennacl::min() and viennacl::max() to obtain the minimum and maximum element in a vector (thanks to sourceforge user cpp1 for the suggestion).
Matrix-matrix products on CPU: Improved performance by about an order of magnitude.
matrix: Added constructor and example for wrapping user-provided CUDA buffers.
Eigenvalues: Implementation of bisection algorithm for tridiagonal symmetric matrices. API still experimental.
Eigenvalues: Extended current implementation of the QR method to OpenMP and CUDA backends. API remains experimental.
-Scan: Implemented inclusive and exclusive scans for all three compute backends . API still experimental.
Triangular solvers: Improved performance particularly for multiple right hand sides (i.e. BLAS level 3 solves).
matrix: Improved performance for matrix transposition
FFT: Now also supports OpenMP and CUDA backends
Nonnegative matrix factorization: Now also supports OpenMP and CUDA backends
Nonnegative matrix factorization: Now using correct contexts and better robustness for initial guesses being all-zero.
Documentation: Integrated former LaTeX manual into Doxygen, resulting in an all-HTML documentation. Enables much better cross-referencing.
Benchmarks: Merged separate BLAS level 1/2/3 benchmarks into a single benchmark printing performance for all three levels.
matrix_range: Fixed incorrect copying of data back to the host
hyb_matrix: Fixed a bug in the constructor for the non-square case
OpenMP: Added linker flags when building with OpenMP on MinGW
OpenCL: Added optional caching of OpenCL kernels at a user-defined location in the file system if the environment variable VIENNACL_CACHE_PATH is defined.
CMake: Enclosing variables in double quotes where necessary
matrix: Better reuse of existing memory buffers to better support user-provided buffers.
Sparse matrices: Added .clear() member function for freeing internal memory buffers
- Better support for passing host scalars of non-matching scalar type for operations on vectors and matrices
Vector: Passing an empty vector as the destination of a two-parameter version of viennacl::copy() now resizes the vector automatically for higher convenience.
AMG: Fixed invalid decrement of auxiliary iterator
Tools: Added sparse matrix generation routine for matrices obtained from finite difference discretizations in 2D.
uBLAS: Including necessary header when working with compressed_matrix
Visual Studio 2012: Fixed compilation problems for some new submodules
Iterators: Fixed incorrect index calculations when using iterators on vectors
matrix_base: The storage layout of matrices is internally managed as a runtime parameter, which allows for a better internal dispatch in order to use faster compute kernels.
Random numbers: Removed experimental random number generation in viennacl/rand/

New in ViennaCL 1.5.2 Beta (May 16, 2014)

New in ViennaCL 1.5.1 Beta (Jan 22, 2014)

New in ViennaCL 1.5.0 Beta (Jan 22, 2014)

This new minor release number update focuses on a more powerful API, and on first steps in making ViennaCL more accessible from languages other than C++.
In addition to many internal improvements both in terms of performance and flexibility, the following changes are visible to users:
API-change: User-provided OpenCL kernels extract their kernels automatically. A call to add_kernel() is now obsolete, hence the function was removed.
API-change: Device class has been extend and supports all informations defined in the OpenCL 1.1 standard through member functions. Duplicate compute_units() and max_work_group_size() have been removed (thanks for Shantanu Agarwal for the input).
API-change: viennacl::copy() from a ViennaCL object to an object of non-ViennaCL type no longer tries to resize the object accordingly. An assertion is thrown if the sizes are incorrect in order to provide a consistent behavior across many different types.
Datastructure change: Vectors and matrices are now padded with zeros by default, resulting in higher performance particularly for matrix operations. This padding needs to be taken into account when using fast_copy(), particularly for matrices.
Fixed problems with CUDA and CMake+CUDA on Visual Studio.
coordinate_matrix now also behaves correctly for tiny matrix dimensions.
CMake 2.6 as new minimum requirement instead of CMake 2.8.
Vectors and matrices can be instantiated with integer template types (long, int, short, char).
Added support for element_prod() and element_div() for dense matrices.
Added element_pow() for vectors and matrices.
Added norm_frobenius() for computing the Frobenius norm of dense matrices.
Added unary element-wise operations for vectors and dense matrices: element_sin(), element_sqrt(), etc.
Multiple OpenCL contexts can now be used in a multi-threaded setting (one thread per context).
Multiple inner products with a common vector can now be computed efficiently via e.g.~inner_prod(x, tie(y, z));
Added support for prod(A, B), where A is a sparse matrix type and B is a dense matrix (thanks to Albert Zaharovits for providing parts of the implementation).
Added diag() function for extracting the diagonal of a vector to a matrix, or for generating a square matrix from a vector with the vector elements on a diagonal (similar to MATLAB).
Added row() and column() functions for extracting a certain row or column of a matrix to a vector.
Sparse matrix-vector products now also work with vector strides and ranges.
Added async_copy() for vectors to allow for a better overlap of computation and communication.
Added compressed_compressed_matrix type for the efficient representation of CSR matrices with only few nonzero rows.
Added possibility to switch command queues in OpenCL contexts.
Improved performance of Block-ILU by removing one spurious conversion step.
Improved performance of Cuthill-McKee algorithm by about 40 percent.
Improved performance of power iteration by avoiding the creation of temporaries in each step.
Removed spurious status message to cout in matrix market reader and nonnegative matrix factorization.
The OpenCL kernel launch logic no longer attempts to re-launch the kernel with smaller work sizes if an error is encountered (thanks to Peter Burka for pointing this out).
Reduced overhead for lenghty expressions involving temporaries (at the cost of increased compilation times).
vector and matrix are now padded to dimensions being multiples of 128 per default. This greatly improves GEMM performance for arbitrary sizes.
Loop indices for OpenMP parallelization are now all signed, increasing compatibility with older OpenMP implementations (thanks to Mrinal Deo for the hint).
Complete rewrite of the generator. Now uses the scheduler for specifying the operation. Includes a full device database for portable high performance of GEMM kernels.
Added micro-scheduler for attaching the OpenCL kernel generator to the user API.
Certain BLAS functionality in ViennaCL is now also available through a shared library (libviennacl).
Removed the external kernel parameter tuning factility, which is to be replaced by an internal device database through the kernel generator.
Completely eliminated the OpenCL kernel conversion step in the developer repository and the source-release. One can now use the developer version without the need for a Boost installation.

New in ViennaCL 1.4.2 Beta (Apr 29, 2013)

New in ViennaCL 1.4.1 Beta (Feb 22, 2013)

New in ViennaCL 1.4.0 Beta (Dec 3, 2012)

Added host-based and CUDA-enabled operations on ViennaCL objects. The default is now a host-based execution for reasons of compatibility. Enable OpenCL- or CUDAbased
execution by defining the preprocessor constant VIENNACL_WITH_OPENCL and VIENNACL_WITH_CUDA respectively. Note that CUDA-based execution requires the use of nvcc.
Added mixed-precision CG solver (OpenCL-based).
Greatly improved performance of ILU0 and ILUT preconditioners (up to 10-fold). Also fixed a bug in ILUT.
Added initializer types fromBoost.uBLAS (unit_vector, zero_vector, scalar_vector, identity_matrix, zero_matrix, scalar_matrix).
Added incomplete Cholesky factorization preconditioner.
Added element-wise operations for vectors as available in Boost.uBLAS (element_prod, element_div).
Added restart-after-N-cycles option to BiCGStab.
Added level-scheduling for ILU-preconditioners. Performance strongly depends on matrix pattern.
Added least-squares example including a function inplace_qr_apply_trans_Q() to
compute the right hand side vector QT b without rebuilding Q.
Improved performance of LU-factorization of dense matrices.
Improved dense matrix-vector multiplication performance
Reduced overhead when copying to/from ublas::compressed_matrix.
ViennaCL objects (scalar, vector, etc.) can now be used as global variables
Refurbished OpenCL vector kernels backend. All operations of the type v1 = a v2 @b v3 with vectors v1, v2, v3 and scalars a and b including += and -= instead of = are now temporary-free. Similarly for matrices.
matrix_range and matrix_slice as well as vector_range and vector_slice can
now be used and mixed completely seamlessly with all standard operations except lu_factorize().
Fixed a bug when using copy() with iterators on vector proxy objects.
Final reduction step in inner_prod() and norms is now computed on CPU if the
result is a CPU scalar.
Reduced kernel launch overhead of simple vector kernels by packing multiple kernel arguments together.
Updated SVD code and added routines for the computation of symmetric eigenvalues using OpenCL.
custom_operation’s constructor now support multiple arguments, allowing multiple expression to be packed in the same kernel for improved performances. However, all the datastructures in the multiple operations must have the same size.
Further improvements to the OpenCL kernel generator: Added a repeat feature for generating loops inside a kernel, added element-wise products and division, added support for every one-argument OpenCL function.
The name of the operation is now amandatory argument of the constructor of custom_operation
Improved performances of the generated matrix-vector product code.
Updated interfacing code for the Eigen library, now working with Eigen 3.x.y.
Converter in source-release now depends on Boost.filesystem3 instead of Boost.filesystem2, thus requiring Boost 1.44 or above.

ViennaCL Changelog

What's new in ViennaCL 1.6.2 Beta

New in ViennaCL 1.6.1 Beta (Nov 21, 2014)

New in ViennaCL 1.6.0 Beta (Nov 12, 2014)

New in ViennaCL 1.5.2 Beta (May 16, 2014)

New in ViennaCL 1.5.1 Beta (Jan 22, 2014)

New in ViennaCL 1.5.0 Beta (Jan 22, 2014)

New in ViennaCL 1.4.2 Beta (Apr 29, 2013)

New in ViennaCL 1.4.1 Beta (Feb 22, 2013)

New in ViennaCL 1.4.0 Beta (Dec 3, 2012)

New in ViennaCL 1.3.1 Beta (Aug 10, 2012)

New in ViennaCL 1.3.0 Beta (May 14, 2012)

New in ViennaCL 1.2.1 Beta (Mar 23, 2012)

New in ViennaCL 1.1.2 Beta (Sep 15, 2011)

New in ViennaCL 1.1.1 Beta (Sep 15, 2011)

New in ViennaCL 1.1.0 Beta (Sep 15, 2011)

New in ViennaCL 1.0.5 Beta (Sep 15, 2011)

New in ViennaCL 1.0.4 Beta (Sep 15, 2011)

New in ViennaCL 1.0.3 Beta (Sep 15, 2011)

New in ViennaCL 1.0.2 Beta (Sep 15, 2011)

New in ViennaCL 1.0.1 Beta (Sep 15, 2011)