Hadoop-BAM Changelog

What's new in Hadoop-BAM 7.0

Sep 11, 2014
  • Switching from Picard/Samtools to HTSJDK
  • First release to OSS Sonatype
  • Renaming of packages
  • Changes to VariantContextCodec: encoding of genotypes generated by other means than VCF import (thanks to Joel Thibault)
  • Change of JDK version requirements

New in Hadoop-BAM 6.2 (Apr 4, 2014)

  • Bugfix: the update of Picard/Tribble introduced a bug into the VCFInputFormat that affects input files loaded from HDFS that are larger than one split
  • Update to pom.xml:
  • SNAPSHOT versions are now two digits, as releases
  • Added two missing dependencies for CLI tool

New in Hadoop-BAM 6.1 (Mar 26, 2014)

  • Update of Picard from 1.93 to 1.107
  • Moving from ant to maven build
  • Hadoop version 2 compatability
  • Bugfixes for BAM reading

New in Hadoop-BAM 6.0 (Jul 8, 2013)

  • Input and output formats for VCF and BCF. There are no format-specific InputFormat or OutputFormat classes; instead, AnySAM-like support for both in one class is provided.
  • Both compressed and uncompressed BCF can be read, but only compressed BCF can be output.
  • This required adding three .jar files to the Hadoop-BAM distribution, two from Picard and one from Apache Commons: variant-.jar, tribble-.jar, and commons-jexl-.jar. All of these need to be provided in the Hadoop classpath or in '-libjars' when using CLI plugins that require the VCF/BCF functionality.
  • Added new 'fixmate' plugin, akin to the samtools fixmate command or Picard's FixMateInformation, i.e. recomputing mate information in the input SAM/BAM files. Like 'sort', it can also merge multiple SAM/BAM files together.
  • Added new 'vcf-sort' plugin, which sorts a single VCF or BCF input file while possibly performing format conversion, as it can also output either VCF or BCF.
  • Updated provided Picard from 1.76 to 1.93. Note that a breaking change concerning the SeekableStream class occurred in Picard 1.84, so a version older than that may not be used together with this version of Hadoop-BAM.
  • The FASTQ and QSEQ input formats can now skip records that have failed filtering: use the CONF_FILTER_FAILED_QC and CONF_INPUT_FILTER_FAILED_QC properties.
  • The FASTQ input format now accepts Illumina identifiers with a blank index sequence.
  • Fixed BAM records sometimes confusing the reference and mate reference indices, and not always updating the reference names appropriately.
  • Fixed various small misdecodings and misencodings in QSEQ I/O.
  • Fixed 'premature EOF' crashes on some BAM inputs.
  • Fixed crash on headerless SAM inputs.
  • Fixed CLI crash on startup in newer Hadoop versions (at least CDH 4.2.0).
  • Other minor fixes.

New in Hadoop-BAM 5.1 (Nov 26, 2012)

  • Removed the fi.tkk.ics.hadoop.bam.util.hadoop.BAMReader and fi.tkk.ics.hadoop.bam.util.hadoop.BAMSort classes, which were deprecated back in 3.0.
  • MAJOR CHANGE: The command line plugins 'sort', 'summarize', and 'summarysort' now default to 1 reduce task. The amount can be customized
  • with the -r/--reducers command line argument. This bumps up the versions of the plugins to 4.0, 3.0, and 2.0 respectively.
  • Fix: BAMRecordReader.getKey now hashes unmapped keys instead of randomizing them, to ensure consistent results.
  • For compatibility with Hadoop 2.0 and any future Hadoop releases, custom
  • Hadoop classes are now only built and used when using a Hadoop release
  • that does not provide them. This means that bugs MAPREDUCE-1987 and
  • MAPREDUCE-2538, which were previously fixed internally, may cause problems when using the MapReduce-using command line plugins with certain reducer counts.
  • Fixed crash on some BAM inputs caused by a bug in fi.tkk.ics.hadoop.bam.BAMSplitGuesser.
  • Fixed some Illumina identifier scanning issues in the FASTQ input format.
  • Added FASTA input format.
  • The command line plugins 'sort' and 'summarize' now use RandomSamplers for input partitioning, as they probably should have all along.

New in Hadoop-BAM 5.0 (Oct 18, 2012)

  • MAJOR CHANGE: Hadoop-BAM no longer depends on, or even provides, fi.tkk.ics.hadoop.bam.custom.samtools. In other words, users should now import Picard classes from Picard itself, i.e. net.sf.samtools.
  • Fix data loss/duplication and crash-on-valid issues in SAM input.
  • Fix FASTQ record writer to also write the flow cell ID and to emit null fields correctly.
  • Fix crash on some inputs caused by a bug in fi.tkk.ics.hadoop.bam.custom.hadoop.InputSampler. (Not the same bug fixed in 4.0.)
  • Updated provided Picard from 1.56 to 1.76.
  • BAMRecordReader.getKey now randomizes the order of unmapped reads instead of giving them all the same key, improving performance since they can now be sent to different reduce tasks.
  • AnySAMInputFormat now has a nullary constructor, allowing it to be used directly in Job.setInputFormatClass.
  • FASTQ and QSEQ input formats now report isSplitable correctly for compressed files.
  • QSEQ output format and record writer now use a Text instead of a NullWritable key.

New in Hadoop-BAM 4.0 (Oct 18, 2012)

  • SAM input and output support. AnySAMInputFormat handles transparent support of both SAM and BAM inputs even in the same Hadoop job. For output, there is no SAMOutputFormat; only AnySAMOutputFormat, which can be used to output either SAM or BAM. BAMOutputFormat will be deprecated in
  • the future.
  • Fix longstanding regression in the embedded Picard library causing end-of-file markers to be written into BAM files by every reduce task. For this reason e.g. 'samtools view' refused to show the contents of BAM files output by Hadoop-BAM.
  • Fix crash on some inputs caused by a bug in fi.tkk.ics.hadoop.bam.custom.hadoop.InputSampler.
  • Fix possible crash-on-valid situations in heuristic BAM splitting.
  • Various I/O classes from the Seal project are now incorporated. This includes input formats for FASTQ and QSEQ and an output format for QSEQ.
  • Unmapped reads are now ordered after, not before, all other reads.
  • Allow using Hadoop's "-libjars" command line argument instead of
  • HADOOP_CLASSPATH to specify the Picard .jars. This ended up being fiendishly complicated and somewhat fragile.
  • Partitioning files are now saved in the output, not input, directory.
  • 'sort' plugin version 3.0:
  • Important bug fix for merging: conflicting IDs from different files
  • weren't being properly corrected.
  • SAM input and output support. Can input SAM and BAM files at the same time and output to either format.
  • When not using -o, each reducer now outputs headers into the BAM files.
  • 'view' plugin version 1.1, with SAM input support.
  • Add new 'cat' plugin version 1.0, for concatenating SAM/BAM files. The main intended use case is joining the output of 'sort' when it is used without -o.
  • 'summarize' plugin version 2.0, with SAM input support.
  • SplittingBAMIndexer can now be used from within the library as well as a command line tool and can index files directly in HDFS.
  • Various minor bug fixes.
  • Lots of documentation updates.
  • Various clarifications in the README.
  • Much quieter error messages when plugin loading fails.
  • build.xml now looks in the HADOOP_HOME environment variable for Hadoop
  • jars. As a result, the required minimum version of Ant is now 1.7.1.
  • fi.tkk.ics.hadoop.bam.custom is now compiled with warnings off, for less noisy builds.

New in Hadoop-BAM 3.3 (Feb 23, 2012)

  • Fix embedded Picard to not have an accidentally leftover dependency on Picard 1.47.
  • Clarify some .jar dependencies in the README.

New in Hadoop-BAM 3.2 (Feb 23, 2012)

  • Important bug fix to avoid looping infinitely on some BAM files.

New in Hadoop-BAM 3.1 (Feb 23, 2012)

  • Important data loss bug fixes!
  • 'sort' plugin updated to 2.0: it can now take multiple input files,
  • merging them together.
  • New 'summary' and 'summarysort' command line plugins, respectively for
  • creating and sorting Chipster summary files. Not very generically useful;
  • intended more as example code.
  • Some minor command line argument handling bug fixes.
  • Updated embedded and provided Picard from 1.47 to 1.56.
  • As Hadoop-BAM now depends on Picard proper as well as the SAM-JDK, a
  • compatible JAR, currently picard-1.56.jar, is distributed together with
  • Hadoop-BAM.

New in Hadoop-BAM 3.0 (Feb 23, 2012)

  • Plugin-extensible command line interface.
  • The 'view', 'sort', and 'index' command line plugins. These supersede the
  • fi.tkk.ics.hadoop.bam.util.hadoop.BAMReader and
  • fi.tkk.ics.hadoop.bam.util.hadoop.BAMSort classes, which have much less
  • functionality than the new plugins and are considered deprecated.
  • Embedded Picard SAM-JDK parts updated from version 1.27 to 1.47.
  • A compatible Picard SAM-JDK JAR, currently sam-1.47.jar, is now
  • distributed together with Hadoop-BAM.

New in Hadoop-BAM 2.0 (Feb 23, 2012)

  • Heuristic splitting of BAM and BGZF files: indexing is no longer required.
  • build.xml now defaults to making a .jar file, no need for explicit 'ant jar'.