What's new in Hadoop-BAM 7.0
Sep 11, 2014
- Switching from Picard/Samtools to HTSJDK
- First release to OSS Sonatype
- Renaming of packages
- Changes to VariantContextCodec: encoding of genotypes generated by other means than VCF import (thanks to Joel Thibault)
- Change of JDK version requirements
New in Hadoop-BAM 6.2 (Apr 4, 2014)
- Bugfix: the update of Picard/Tribble introduced a bug into the VCFInputFormat that affects input files loaded from HDFS that are larger than one split
- Update to pom.xml:
- SNAPSHOT versions are now two digits, as releases
- Added two missing dependencies for CLI tool
New in Hadoop-BAM 6.1 (Mar 26, 2014)
- Update of Picard from 1.93 to 1.107
- Moving from ant to maven build
- Hadoop version 2 compatability
- Bugfixes for BAM reading
New in Hadoop-BAM 6.0 (Jul 8, 2013)
- Input and output formats for VCF and BCF. There are no format-specific InputFormat or OutputFormat classes; instead, AnySAM-like support for both in one class is provided.
- Both compressed and uncompressed BCF can be read, but only compressed BCF can be output.
- This required adding three .jar files to the Hadoop-BAM distribution, two from Picard and one from Apache Commons: variant-.jar, tribble-.jar, and commons-jexl-.jar. All of these need to be provided in the Hadoop classpath or in '-libjars' when using CLI plugins that require the VCF/BCF functionality.
- Added new 'fixmate' plugin, akin to the samtools fixmate command or Picard's FixMateInformation, i.e. recomputing mate information in the input SAM/BAM files. Like 'sort', it can also merge multiple SAM/BAM files together.
- Added new 'vcf-sort' plugin, which sorts a single VCF or BCF input file while possibly performing format conversion, as it can also output either VCF or BCF.
- Updated provided Picard from 1.76 to 1.93. Note that a breaking change concerning the SeekableStream class occurred in Picard 1.84, so a version older than that may not be used together with this version of Hadoop-BAM.
- The FASTQ and QSEQ input formats can now skip records that have failed filtering: use the CONF_FILTER_FAILED_QC and CONF_INPUT_FILTER_FAILED_QC properties.
- The FASTQ input format now accepts Illumina identifiers with a blank index sequence.
- Fixed BAM records sometimes confusing the reference and mate reference indices, and not always updating the reference names appropriately.
- Fixed various small misdecodings and misencodings in QSEQ I/O.
- Fixed 'premature EOF' crashes on some BAM inputs.
- Fixed crash on headerless SAM inputs.
- Fixed CLI crash on startup in newer Hadoop versions (at least CDH 4.2.0).
- Other minor fixes.
New in Hadoop-BAM 5.1 (Nov 26, 2012)
- Removed the fi.tkk.ics.hadoop.bam.util.hadoop.BAMReader and fi.tkk.ics.hadoop.bam.util.hadoop.BAMSort classes, which were deprecated back in 3.0.
- MAJOR CHANGE: The command line plugins 'sort', 'summarize', and 'summarysort' now default to 1 reduce task. The amount can be customized
- with the -r/--reducers command line argument. This bumps up the versions of the plugins to 4.0, 3.0, and 2.0 respectively.
- Fix: BAMRecordReader.getKey now hashes unmapped keys instead of randomizing them, to ensure consistent results.
- For compatibility with Hadoop 2.0 and any future Hadoop releases, custom
- Hadoop classes are now only built and used when using a Hadoop release
- that does not provide them. This means that bugs MAPREDUCE-1987 and
- MAPREDUCE-2538, which were previously fixed internally, may cause problems when using the MapReduce-using command line plugins with certain reducer counts.
- Fixed crash on some BAM inputs caused by a bug in fi.tkk.ics.hadoop.bam.BAMSplitGuesser.
- Fixed some Illumina identifier scanning issues in the FASTQ input format.
- Added FASTA input format.
- The command line plugins 'sort' and 'summarize' now use RandomSamplers for input partitioning, as they probably should have all along.
New in Hadoop-BAM 5.0 (Oct 18, 2012)
- MAJOR CHANGE: Hadoop-BAM no longer depends on, or even provides, fi.tkk.ics.hadoop.bam.custom.samtools. In other words, users should now import Picard classes from Picard itself, i.e. net.sf.samtools.
- Fix data loss/duplication and crash-on-valid issues in SAM input.
- Fix FASTQ record writer to also write the flow cell ID and to emit null fields correctly.
- Fix crash on some inputs caused by a bug in fi.tkk.ics.hadoop.bam.custom.hadoop.InputSampler. (Not the same bug fixed in 4.0.)
- Updated provided Picard from 1.56 to 1.76.
- BAMRecordReader.getKey now randomizes the order of unmapped reads instead of giving them all the same key, improving performance since they can now be sent to different reduce tasks.
- AnySAMInputFormat now has a nullary constructor, allowing it to be used directly in Job.setInputFormatClass.
- FASTQ and QSEQ input formats now report isSplitable correctly for compressed files.
- QSEQ output format and record writer now use a Text instead of a NullWritable key.
New in Hadoop-BAM 4.0 (Oct 18, 2012)
- SAM input and output support. AnySAMInputFormat handles transparent support of both SAM and BAM inputs even in the same Hadoop job. For output, there is no SAMOutputFormat; only AnySAMOutputFormat, which can be used to output either SAM or BAM. BAMOutputFormat will be deprecated in
- the future.
- Fix longstanding regression in the embedded Picard library causing end-of-file markers to be written into BAM files by every reduce task. For this reason e.g. 'samtools view' refused to show the contents of BAM files output by Hadoop-BAM.
- Fix crash on some inputs caused by a bug in fi.tkk.ics.hadoop.bam.custom.hadoop.InputSampler.
- Fix possible crash-on-valid situations in heuristic BAM splitting.
- Various I/O classes from the Seal project are now incorporated. This includes input formats for FASTQ and QSEQ and an output format for QSEQ.
- Unmapped reads are now ordered after, not before, all other reads.
- Allow using Hadoop's "-libjars" command line argument instead of
- HADOOP_CLASSPATH to specify the Picard .jars. This ended up being fiendishly complicated and somewhat fragile.
- Partitioning files are now saved in the output, not input, directory.
- 'sort' plugin version 3.0:
- Important bug fix for merging: conflicting IDs from different files
- weren't being properly corrected.
- SAM input and output support. Can input SAM and BAM files at the same time and output to either format.
- When not using -o, each reducer now outputs headers into the BAM files.
- 'view' plugin version 1.1, with SAM input support.
- Add new 'cat' plugin version 1.0, for concatenating SAM/BAM files. The main intended use case is joining the output of 'sort' when it is used without -o.
- 'summarize' plugin version 2.0, with SAM input support.
- SplittingBAMIndexer can now be used from within the library as well as a command line tool and can index files directly in HDFS.
- Various minor bug fixes.
- Lots of documentation updates.
- Various clarifications in the README.
- Much quieter error messages when plugin loading fails.
- build.xml now looks in the HADOOP_HOME environment variable for Hadoop
- jars. As a result, the required minimum version of Ant is now 1.7.1.
- fi.tkk.ics.hadoop.bam.custom is now compiled with warnings off, for less noisy builds.
New in Hadoop-BAM 3.3 (Feb 23, 2012)
- Fix embedded Picard to not have an accidentally leftover dependency on Picard 1.47.
- Clarify some .jar dependencies in the README.
New in Hadoop-BAM 3.2 (Feb 23, 2012)
- Important bug fix to avoid looping infinitely on some BAM files.
New in Hadoop-BAM 3.1 (Feb 23, 2012)
- Important data loss bug fixes!
- 'sort' plugin updated to 2.0: it can now take multiple input files,
- merging them together.
- New 'summary' and 'summarysort' command line plugins, respectively for
- creating and sorting Chipster summary files. Not very generically useful;
- intended more as example code.
- Some minor command line argument handling bug fixes.
- Updated embedded and provided Picard from 1.47 to 1.56.
- As Hadoop-BAM now depends on Picard proper as well as the SAM-JDK, a
- compatible JAR, currently picard-1.56.jar, is distributed together with
- Hadoop-BAM.
New in Hadoop-BAM 3.0 (Feb 23, 2012)
- Plugin-extensible command line interface.
- The 'view', 'sort', and 'index' command line plugins. These supersede the
- fi.tkk.ics.hadoop.bam.util.hadoop.BAMReader and
- fi.tkk.ics.hadoop.bam.util.hadoop.BAMSort classes, which have much less
- functionality than the new plugins and are considered deprecated.
- Embedded Picard SAM-JDK parts updated from version 1.27 to 1.47.
- A compatible Picard SAM-JDK JAR, currently sam-1.47.jar, is now
- distributed together with Hadoop-BAM.
New in Hadoop-BAM 2.0 (Feb 23, 2012)
- Heuristic splitting of BAM and BGZF files: indexing is no longer required.
- build.xml now defaults to making a .jar file, no need for explicit 'ant jar'.