ATLAS Changelog

What's new in ATLAS 3.11.32 Dev

Mar 5, 2015
  • Fixed bug in code generator where sometimes the pB ptr was not incremented after the first K iteration peel
  • Removed block-major gemm from ATLAS
  • Fixed error in f77getri wrapper
  • Changed ATLAS to default to using walltime
  • Added support for passing operand movement macros to amm kernel compiles
  • Fixed bugs in [u]ammsearch where ldc passed as kb rather than mb.
  • Added new square NBs to always try in ammm search:
  • 32 & 64 added because LAPACK defaults to them
  • 24 added for small symmetric operations
  • Added src/testing ATL_cmpmatBV, print1dBV, print2dBV
  • Sped up Corei2 ammm kernels, particular single prec & double K-clean
  • added special case for N=K=4 in real ammm for recursive LU: ATL_rk4n4.c
  • Now have atlas_simd.h for simple typeless SIMD; will extend as needed, and need to get rid of duplicate headers eventually
  • New Opteron/K10h8 ammm kernels (based on old block-major kernel)
  • Fixed bug in emit_*amm's GenMakefile, where kmajor rank-K kernels were being compiled with -DKB set to values that weren't a multiple of VLEN
  • Improved single precision Corei1 ammm kernel
  • Improved double precision Core2 ammm kernel based on CASES/ATL_dmm4x2x128_sse2.c
  • install_uamm now installs routines to copy between block-major storage and row/column-major storage in both directions
  • Added am2[rm,cm] copy routines for A/B in atlas_mmg.base
  • Added src/threads/cbc for use in cbc-LU
  • Removed blank line from samcases.idx to avoid ammsearch errors

New in ATLAS 3.11.31 Dev (Mar 5, 2015)

  • Added basic support for ARM64
  • Pre-production hardware provided by Applied Micro (www.apm.com)
  • Extensive patches handling almost all ARM64 functionality, including assembly kernels
  • Fixed bug in ammsearch where generated code often used wrong/slow kerns
  • Fixed bug in 'Right' case of tsymm, where recursion called 'Left'
  • Fixed bug where threadpool only starts a handful of cores when first call is made with small problem

New in ATLAS 3.11.30 Dev (Mar 5, 2015)

  • New timing/ directory, in BLDdir/timing do:
  • in BLDdir/timing issue "make all" to build scripts
  • ./tvgenmf[lst,rng].sh wt no args gives help
  • The following will time LLt,LU,QR and gemm for the 3 problem sizes: ./tvgenmf_lst.sh "3 1000 2000 4000" 5 -P t -b "gemm"
  • Addition of timing manipulation tools in bin/tvec:
  • manpages for them in ATLAS/man/tvec
  • build with "make tvec_all" in BLDdir/bin
  • New timers xl3time_[ab,sb] available in BLDdir/bin
  • Made it so latime, l3time & l3blastst stops manual cache flushing when matrix flushes itself due to size
  • add fflush calls after prints in latime & l3time
  • Attempted to apply IBM patches for new ppc64le:

New in ATLAS 3.11.29 Dev (Mar 5, 2015)

  • Added option to have threadpool that is always polling, and where the master process is assigned affinity 0. Should provide best perf when ATLAS handles all threading (but will interfere with other threads).
  • Presently default, much faster than using mutex or cond vars.
  • should work on Windows (untested)
  • Force turnoff with -D c -DATL_TP_FULLPOLL=0
  • Changed atlas_taffinity.h to only have #defines, so it is safe to include in any/every file.
  • Added ATL_setmyaffinity as standalone file, and made it and ATL_thread_start handle all #includes formerly done in atlas_taffinity.h
  • Changed PCA xover in ATL_getrfC to work better wt tpool
  • Got OpenMP option working again, use: -Si omp 1 -Fa alg -fopenmp
  • Performance is horrific compared to pthreads
  • Removed launch & join option, and quite a bit of associated code
  • Changed xperctvecs so it can take a scalar divisor as well as vector

New in ATLAS 3.11.28 Dev (Jun 12, 2014)

  • Fixed numerous race conditions in PHI-specific SYRK
  • Fixed perf bug where LU failed to use PCA for high-core-count machines
  • Ensured parallel LU uses parallel swap, and serial LU uses serial swap
  • Added tgemm_amm case for tiny M,N, large K (inner product)
  • Prototyped access-major ttrsm, but it doesn't provide speedup over the old stuff except on PHI, where it is hugely faster.
  • Fixed bug in new goParallel so P > nthr is handled correctly
  • Added atlas_ttypes.h that can be included before thread tuning
  • Added optional thread pool that polls a bit before sleeping to improve performance where parallel jobs are performed write after each other

New in ATLAS 3.11.27 Dev (Apr 30, 2014)

  • Fixed error in k-major cm2am code generator for alpha=X
  • Made it so C compiler search prioritizes exact match 'gcc' for gcc
  • Some aliases like c99-gcc don't work with pthreads due include mismatch
  • Hacked ATLAS to use a thread pool rather than launch & join:
  • Only pthread version currently works
  • Wrote some real threaded kernels, using dynamic algorithms:
  • SYRK cases for large N and small N with large K
  • Specialized XeonPHI version for large N
  • GEMM case for tiny N & K, large M (QR & LU panel factorizations)
  • GEMM case for medium-sized squarish matrices (helps LU & QR). These changes provide significant speedup for high-core-count archs
  • Produced bitvector ops in ATLAS/src/auxil/ATL_bitvec.c
  • Fixed it so XeonPHI uses 4*(P-1) physical cores in affinity array, and starts affinity IDs at 1 rather than 0.

New in ATLAS 3.11.18 Dev (Oct 31, 2013)

  • Fixed bug in atlas-mmg.base causing shared library build to fail
  • Fixed bug in ammsearch caused by mu > nb in generated K-cleanup timings
  • Fixed bug in K=1 case of aliased ammm_rkK (TRMM)

New in ATLAS 3.11.11 Dev (Jul 8, 2013)

  • Commented out block-major dynamically scheduled threaded GEMM code since access-major not usually faster
  • New kmaj=4 sse3 single precision kernels for old AMD k8/k10
  • New Kmaj=2 sse3 double precision kernel for old AMD k8/k10
  • Basic support for k-major access-major storage

New in ATLAS 3.10.1 (Jan 9, 2013)

  • Highlights of changes from 3.10.0:
  • Fixed bad SSE guard that prevented PIII archdefs from working
  • Added return to main of ATLAS/tune/sysinfo/matime.c
  • Added ability for archinfo_x86.c to recognize more Corei2 platforms
  • Fixed premature KillAllMMNodes in emit_mm.c

New in ATLAS 3.10.0 (Jul 11, 2012)

  • Rewrite of threading system, providing large parallel speedups, by:
  • Using affinity and master last
  • Ability to use Windows threads instead of pthreads
  • Ability to use OpenMP instead of pthreads (no affinity!)
  • Complete rewrite of L2BLAS support for increased empirical tuning and better performance
  • Addition of SSE generator and search for increased portable performance
  • Added autotuning of QR NB
  • Added native support for QR factorization and related routines
  • Add support for many new architectures and ISA extensions
  • ARM, POWER7, AMD DOZER, many new x86 targets
  • AVX, VSX, NEON, FMA4
  • Ability to build more generic libs for performance loss but portability gain
  • Added ability to autotest lapack and full ATLAS tests
  • Improved reliability to in building dynamic libs using --shared
  • Improved lapack integration using --with-netlib-lapack-tarfile=
  • Increased lapack performance through use of PCA panel factorizations
  • Vastly improved Windows support
  • 64-bit libraries supported with MinGW compilers
  • Native compiler interoperation enabled by using MinGW compilers
  • Ability to build .lib simply by using --shared
  • Much increased autobenchmarking options in ATLAS/results
  • Fixed errors in HARDFP ARM kernel

New in ATLAS 3.9.87 Dev (Jul 10, 2012)

  • Updated atlas_install to discuss dll build & ARM/HARDFP
  • Modified Windows dll build to generate .def files for LIB usage

New in ATLAS 3.9.84 Dev (Jul 7, 2012)

  • Wrote TRMV in terms of GEMV for speedup
  • Wrote TRSV in terms of GEMV for speedup
  • Wrote SYMV in terms of GEMV for speedup on most archs
  • Wrote HEMV in terms of GEMV for speedup on most archs

New in ATLAS 3.9.50 Dev (Sep 3, 2011)

  • Fixed typo causing seg fault in l2 kernel searches
  • Fixed a bunch of warnings coming from clang

New in ATLAS 3.9.49 Dev (Sep 2, 2011)

  • Fixed unitialized var in all l2 kernel searches
  • Fixed out-of-mem bugs in GERC and GER2C
  • Fixed a bunch of warnings coming from clang

New in ATLAS 3.9.48 Dev (Sep 1, 2011)

  • Architectural defaults for Atom64SSE3
  • Improved Real TRSM performance, particularly for small triangle, large RHS
  • Improves Invers, Cholesky, LU (in perf order), part. for SREAL on x8664
  • Fixed bug in gerk assembly reported by Blooox
  • Added Xeon E5645 detection to configure

New in ATLAS 3.9.47 (Sep 1, 2011)

  • Improve parallel performance for LU & QR.
  • Improved performance for serial LQ and RQ.
  • Architectural defaults for ARMv732
  • Made it so config recognizes Atom, and suggests good compiler flags
  • Added ability to chart all QR and Cholesky variants in results/
  • Added a lot of charting options, including charting more than 4 lines
  • Added ability to use -# in l3blastst

New in ATLAS 3.9.46 (Sep 1, 2011)

  • Bug fixes in qrtst.c
  • QR-related routines cleaned up
  • Better PCA crossover rules improve parallel QR performance
  • Fixed error in Core232SSE dMVTK.sum (missing \ from CFLAGS line)
  • Fixed bad return values in ATL_getf2

New in ATLAS 3.9.45 (Sep 1, 2011)

  • New chart creating targets (see ATLAS/doc/atlas_install.pdf)
  • Fix bug in all L2 kernel searches where lda was set < M sometimes in MU search.
  • Found workaround to ATL_dgemvT_2x8_sse3.c Windows compiler bug (-Os)
  • Removed goparallel_prank (unused) to avoid problems wt dynamic linking
  • Architecural defaults for:
  • P4E32SSE3 (gcc 4.2.1)
  • AMD64K10h32SSE3 (gcc 4.4.5)
  • Corei132SSE3 (gcc 4.4.5)
  • Corei232AVX (gcc 4.4.5)

New in ATLAS 3.9.44 (Sep 1, 2011)

  • Fixed errors in ATL_tgemm_bigMN_Kp.c & ATL_tgemm_rkK.c where cleanup was called with K > KB (usually causing seg faults).
  • Several fixes for 32-bit Windows.

New in ATLAS 3.9.43 (Sep 1, 2011)

  • Fixed errors in threaded GEMV and GER
  • Bunch of fixes to make it possible to build 64-bit lib on Win64 > can build, but executables don't work, probably lib issue
  • Changed windows Mhz probe to look in cygwin-provided cpuinfo rather
  • than use QueryPerformanceFrequency, which is not always set to clock rate
  • Fixed lutst to print "fail" on failure.
  • Updated full tester to call QR as well
  • Updated sanity_checks to call QR
  • Increased size of sanity checks for threaded code
  • Added GEMM NaN tester to EXtest
  • Improved charting functions in results/

New in ATLAS 3.9.42 (Sep 1, 2011)

  • Added ability to autobuild performance charts in results/
  • Added EXtest/ and all-aligment testing for GER and GEMV
  • Fixed bug in BETA=0 case of ATL_cgemvN_8x4_sse3.c
  • Added results/ directory that can autobuild performance charts
  • numerous fixes to qrtest and some fixes for the QR fact routines
  • Added missing $(F77SYSLIB) in Make.lib's dylib and ptdylib targets
  • Added chapter in atlas_install explaining how to use mmflagsearch
  • Fixed uninitialized memory read caused by copying data I don't reference
  • in parallel GEMM.
  • Fixed unitialized memory read in gemvT
  • Changed extendedmodel=2, model=5 from Corei2 to Corei1 in archinfo_x86

New in ATLAS 3.9.41 (Sep 1, 2011)

  • Bug fix in EmitMakefile for L2 that should fix some dynamic lib errors
  • Fixed yet another C/Z GEMM JITcp bug where C was read when BETA=0
  • Fixed BETA=0, KB=1 bug in: ATL_mm4x4x2_1_prefCU.c & ATL_mm4x4x2US.c
  • Configure support, kernels, and architectural defaults for ARMv7.
  • Tom Wallace supplied a comprehensive patch for configure support
  • Added single & double precision ARM kernels (single not very good)

New in ATLAS 3.9.40 (Sep 1, 2011)

  • Added beta versions of simple threaded GEMV & GER
  • Added threaded L2 testing to tester
  • Fixed bug in axpby where it called SCAL with alpha=0, which fixes GEMM
  • error for BETA=0 case.
  • Fixed several simple buffer overruns in full tester
  • Added dynamically scheduled tgemm that is used whenever all dimensions
  • are large.
  • Added support for complex types for both dynamic cases (rank-K, large)
  • Fixed several errors in GEMM that occur when K dim is cut

New in ATLAS 3.9.39 (Sep 1, 2011)

  • Basic AVX GEMM kernels and new Corei264AVX arch defs.
  • Now use dynamically scheduled parallel rank-K updates for real types
  • Complete rewrite of all threaded routines to use goparallel, and thus
  • dynamic spawn.
  • OpenMP now uses same codebase as windows & pthreads forall threading.
  • Thread tune now creates atlas_tsumm.h for summation of threaded tuning
  • Added ATL_thread_yield function
  • If affinity is not set, dynamic funcs now yield thread execution when
  • waiting for their peers to signal completion of a stage
  • > Otherwise, active poller prevents thread running on same core from exec