Terrier Changelog

What's new in Terrier 3.0

Jun 17, 2011
  • Major update:
  • Support for indexing WARC collections; improved index structure layout; improved MapReduce mode indexing; refined, scalable structure access at retrieval time; moved all code to terrier.org namespace; added field-based and proximity term dependence models; added HTTP-based retrieval interface; added many JUnit tests. All indices must be rebuilt.
  • Indexing:
  • TR-14, TR-42, TR-56, TR-102: Various changes to the format of the index, to promote reuse, scalability and speed.
  • TR-17, TR-50, TR-54, TR-77: Added MetaIndex for document metadata. DOCNOs etc need not be in lexographical order.
  • TR-43, TR-48, TR-69, TR-70: Fields should contain frequency information.
  • TR-39, TR-40, TR-41, TR-46, TR-50, TR-83, TR-88: Various improvements and bug fixes to MapReduce indexing.
  • TR-44, TR-55: Improve robustness of single-pass indexing.
  • TR-71, TR-98: Allow Bit posting structures to be split across multiple files.
  • TR-28, TR-91: Index WARC collections (UK-2006, ClueWeb09).
  • TR-34: Documentation update: Property values for single-pass indexing are not scaled.
  • TR-37, TR-38, TR-47,TR-57, TR-78, TR-79, TR-93, TR-94: Generate the direct file from an inverted index as a MapReduce job.
  • Retrieval
  • TR-20, TR-42, TR-64: Access the posting list for one term as a stream - see Posting and IterablePosting.
  • TR-86: Matching should be an interface.
  • TR-87: PorterStemmer doesn't match expected output by Porter himself.
  • TR-81: Implements proximity term dependence models. For more information, see Configuring Retrieval.
  • TR-19: Support relevance feedback as well as pseudo-relevance feedback.
  • TR-68, TR-73, TR-74, TR-94: Implement field-based weighting models. For more information, see Configuring Retrieval.
  • TR-99: Provide way to integrate static doc prior easily. For more information, see Configuring Retrieval.
  • TR-90: MatchingQueryTerms does not retain query term order.
  • TR-26: Parse Million Query track topic files.
  • TR-49: Let TRECQuerying filename be predetermined by property.
  • TR-75: Allow to set runtag in runs.
  • TR-60: Removed PonteCroft language modelling.
  • TR-66, TR-84: Refactor TRECQuery.
  • TR-67: Request object should contain the Index.
  • TermScoreModifiers have been deprecated, and no longer work. You should use WeightingModel instead.
  • Testing:
  • Added considerable number of end-to-end and unit tests.
  • TR-59: Fixed reset problem in Terrier evaluation tool.
  • TR-76: Bump Junit version.
  • Desktop:
  • TR-61: Desktop example app should use MetaIndex.
  • Other:
  • TR-89: Check all .java and .sh files have Terrier license header.
  • TR-82: Have a simple webapps search results interface.
  • TR-80: Move code to terrier.org Java package namespaces.
  • TR-45: Add (read|write)(Delta|Golomb) etc to BitIn/BitOut.
  • TR-52: FSOrderedMapFile causes seek(-1) when searching for an entry less than the first.
  • TR-72: FSOrderedMapFile.EntryIterator.skip() breaks FSOrderedMapFile.EntryIterator.hasNext().
  • TR-95: FSArrayFile.ArrayFileIterator.skip() does not update entry index correctly.
  • TR-92: utility.io.CountingInputStream does not count single bytes correctly.
  • TR-53: Rounding.toString() doesnt work for 10dp.
  • TR-62: Files layer can transparently cache files.
  • TR-2, TR-65: Replace Terrier's Makefile with Ant build.xml. Makefile, compile.sh, compile.bat have now been removed. See Developing with Terrier to see how to compile Terrier.
  • TR-63,TR-101: Documentation updates.
  • TR-100: Update default and sample terrier.properties files.

New in Terrier 2.2.1 (Jun 17, 2011)

  • Minor update - fixes some small bugs in 2.2 relating to MapReduce mode indexing, clarifies some documentation, and includes a missing source file.
  • Added missing source file for SkipTermPipeline.
  • Clarified java documentation for single-pass indexing memory control.
  • TR-8: Delay index path checking during indexing till HDFS is loaded.
  • TR-7: Files.list() does not work for HDFS paths.
  • TR-4: Update the year for the copyright to 2009.
  • TR-3: Partitioned Mode fails unexpectedly due to missing run status files.

New in Terrier 2.2 (Jun 17, 2011)

  • Substantial update, consisting of new support for Hadoop, a Hadoop MapReduce indexing system, and various minor improvements and bug fixes. This is intended to be the ultimate release in the 2.x series.
  • Indexing:
  • Added new Hadoop MapReduce indexing system, and corresponding support for Hadoop MapReduce jobs.
  • Refactoring of various indexing (in particular single pass indexing) to support MapReduce indexing.
  • Block indexing can now use marker tokens to designate block boundaries. See Configuring Indexing for more details.
  • Indexing supports named tokens which should not be passed through the term pipeline. See Configuring Indexing and SkipTermPipeline for more details.
  • TRECCollection and TRECUTFCollection now index CDATA sections, to support indexing of CDIP1 collection, as used by the TREC Legal track.
  • SimpleXMLCollection now indexes CDATA sections. Thanks to Giovanni Stilo (University of Roma "Tor Vergata").
  • LexiconBuilder no longer uses java.io.File to create temporary directories.
  • Double memory reserve threshold for singlepass indexing on 64bit Sun JVMs.
  • Record docno.byte.length setting in index properties file, so that indices with different docno lengths can be loaded at once.
  • CollectionFactory now supports non-default Collection constructors.
  • Retrieval:
  • BUG: Lexicon and sub-classes could give incorrect results for very large lexicons (15M terms+) due to overflow of file offsets. Thanks to Giovanni Stilo (University of Roma "Tor Vergata") and Gianni Amati (Fondazione Ugo Bordoni).
  • BUG: Lexicon hashing would produce NPE when no term in lexicon had same initial character as query term. Thanks to Gianni Amati (Fondazione Ugo Bordoni).
  • BUG: .res.setting files not created for query expansion runs.
  • BUG: rocchio_beta was property name while rocchio.beta was documented. rocchio.beta is now the property, while rocchio_beta is supported but deprecated.
  • WeightingModels are now cloneable.
  • Other:
  • Added HadoopPlugin, allowing the HDFS filesystem paths (e.g. hdfs://namenode:9000/path/to/file) to be accessed directly from Terrier. Hadoop MapReduce jobs can also be created, using Hadoop on Demand if so-configured.
  • Filesystem layer can now support deleteOnExit() semantics for filesystems.
  • Worked to reduce javac generics and other compiler warnings.
  • BUG: Adding a Terrier Application plugin does not cause an NPE.
  • Added method to reconfigure ApplicationSetup during MapReduce jobs.

New in Terrier 2.1 (Jun 17, 2011)

  • Minor update, containing mostly bug fixes, and a FileSystem abstraction layer.
  • Indexing:
  • BUG: trec_terrier -i -d then trec_terrier -i -v doesn't work.
  • BUG: Indexing on Windows doesn't work as the index properties output streams aren't closed.
  • BUG: DocumentExtraction doesn't work (thanks Brantman).
  • BUG: LM Indexing throws NPE if Index doesnt exist (thanks Brantman).
  • BUG: Mixed-case TrecDocTag specifications dont work (thanks Marco Bianchi and Giovanni Stilo, University of Rome "Tor Vergata").
  • BUG: Corner-case HTML terms from HTMLDocument and TRECDocument may contain whitespace. trim() terms in both classes to prevent indexing problems (thanks to Carlos M Lorenzetti).
  • Allow case-sensitive and case-insensitive TrecDocTag specifications. Default is case-sensitive, change using TrecDocTags.casesensitive property.
  • Add useful error message when DOCNO is longer than docno.byte.length
  • Retrofit all indexing file IO to Files class.
  • Retrieval:
  • Retrofit all retrieval file IO to Files class.
  • Added a new TREC topics tokeniser for non-English topics. This is automatically used by TRECQuery if string.use_utf is set.
  • Allow case-sensitive and case-insensitive TrecQueryTag specifications. Default is case-insensitive, change using TrecQueryTags.casesensitive property.
  • Interactive Querying: Added properties to allow model, manager and matching to be controlled: interactive.manager, interactive.matching, and interactive.model.
  • Try harder to allow weighting models to use their default parameter value, if no parameter value is specified.
  • Desktop:
  • Added a property desktop.indexing.singlepass to the DesktopTerrier to allow the use of single-pass indexing.
  • Added properties to allow model, manager and matching to be controlled: desktop.manager, desktop.matching, and desktop.model.
  • Testing:
  • Test the trec_terrier -i -d then trec_terrier -i -v indexing strategy.
  • Other
  • Trec_setup.bat should sort collection.spec file: change to FindFiles.
  • Adding missing environment options to trec_terrier.bat and anyclass.bat.
  • Fix the documentation concerning specifying TREC topic file tags in Configuring Retrieval.
  • Add API to TRECQrelsInMemory for getting docnos of pooled documents with graded/non-graded relevance assessments.
  • Add the FileSystem abstraction layer, which allows various types of files to be accessed through a uniform API. For example, indexing an HTTP Web page is as straightforward as indexing a local document. The FileSystem abstraction layer is described in detail in Developing Terrier.
  • Removed deprecated code in CollectionStatistics.
  • Removed entire uk.ac.gla.terrier.structures.trees package which had been deprecated since 1.1.0. Indexing code based on the 1.0.x release series may break.
  • FindBugs: Some minor changes to the source-code reflect issues identified by FindBugs. We continue to work to address issues raised by FindBugs.

New in Terrier 2.0 (Jun 17, 2011)

  • Major update, integrating a new (alternative) single-pass indexing architecture, and a new index format. Some bug fixes and some change of APIs.
  • Indexing:
  • New single-pass indexing, thanks to Roi Blanco (University of A Coruña). This indexing method is faster, can build larger indices, and is more robust to memory usage during indexing. To use single-pass indexing, add the -j option to TrecTerrier: bin/trec_terrier.sh -i -j. For more information on single-pass indexing, see Configuring Indexing, or TREC Example for examples of its use.
  • New index on disk format: prefix.log has been replaced by prefix.properties, which records much more metadata about an index, including the Terrier version used to create it, the classes (and their constructor parameters) to read the Index, and the statistics of the index. The compressed data structures (DirectIndex and InvertedIndex) have also changed (the endian-ness of the index format changed, for efficiency reasons). However, Terrier 2.0 can read all Terrier 1.x indices, so indices do not have to be upgraded or recreated.
  • The format of block (term positions) has been altered slightly, so that a term can exist in 0 blocks in a document. Again, this change does not require older indices to be upgrade or recreated.
  • Retrieval:
  • The new DFRee DFR weighting model is included, which provides robust performance on a range of test collections.
  • Query parsing is now performed by the Manager.newSearchRequest(String,String). Direct access to the query parser is now not recommended, as the query parsing is likely to evolve in future versions of Terrier. See Extending Retrieval for the recommended querying code for applications.
  • Desktop:
  • BUG: When no terrier.properties present, the Desktop cannot perform query expansion using qe:on control.
  • BUG: Set reasonable defaults when no terrier.properties file open.
  • Testing:
  • Merchant of Venice test harness now tests indexing using TRECUTFCollection.
  • Merchant of Venice test harness now tests indexing using merging.
  • Merchant of Venice test harness now tests single-pass indexing.
  • Merchant of Venice test harness now tests required (+/-) term queries.
  • Other:
  • Documentation: We provide indexing times and recommended parameters settings with corresponding MAP performances in Example TREC Experiments.
  • Deprecation: Methods which do not return a LexiconEntry have been deprecated in the Lexicon.
  • FindBugs: Many minor changes to the source-code reflect issues identified by FindBugs. We continue to work to address issues raised by FindBugs.
  • Upgraded trove4j to latest version.

New in Terrier 1.1.1 (Jun 17, 2011)

  • Minor update. Mostly bug fixes. Some minor code enhancements, plus the inclusion of a test harness. Snowball stemmers were added to boost support for languages other than English. This will likely be the last release in the 1.x.x series.
  • Indexing:
  • BUG: When merging block indices, ensure that the resulting inverted index has blocks.
  • BUG: Field indexing not working properly.
  • BUG: Block ids recorded incorrectly when fields are enabled.
  • BUG Resilience: Don't throw NPE in SimpleFileCollection if no files are processed.
  • BUG Resilience: Don't throw exceptions if index has no terms/documents - fail more gracefully (LexiconBuilder, Indexer).
  • When parsing a TREC-like document collection, use Streams at TRECCollection level, and Reader at Document level. This allows easier change of encoding, etc.
  • Retrieval:
  • BUG: When retrieving phrases, prevent Exception from debugging code in Manager.
  • BUG: Regression when retrieving phrases, some documents not matched.
  • BUG: DFRWeightingModel breaks when first normalisation or tf normalisation is not specified.
  • BUG Resilience: Do not throw NPE in ExpansionTerms if original query terms are not set by client code.
  • Create a .settings file for each TREC results file, so that it is easy to determine the setting for a run.
  • Added an alternative batch query parser, known as SingleLineTRECQuery, mostly to support the test harness.
  • Desktop:
  • BUG: PDF parsing too noisy through log4j, indexing may never finish. Turned down default logging level to info.
  • BUG: Logging may not appear for indexing Terrier's own documentation. Indexing run in new Thread, not SwingUtilities.invokeLater().
  • Other:
  • Tokenisation: Added Snowball stemmers. For more information, see documentation on Non English language support.
  • Java: Various Java Generics changed.
  • Testing: Added test harness, which checks that the correct documents are retrieved for various queries and index formats. Uses Shakespeare's Merchant of Venice play for the test document collection.
  • Shell scripts: Take notice of TERRIER_ETC environment variable and pass to Terrier.
  • Shell scripts: added anyclass.bat.

New in Terrier 1.1.0 (Jun 17, 2011)

  • Major update. Many changes to the source code, including more robust indexing and index structure merging.
  • Indexing:
  • Indexing architecture has been updated for Terrier 1.1.0, however indices created with 1.1.0 are completely compatible with those created with 1.0.x, and vice-versa.
  • Separated string.byte.length property into two properties: max.term.length and docno.byte.length.
  • Allow UTF characters in indexing, and use a compatible method for saving these in the Lexicon. This enables Terrier to be used for non-English languages. Set string.use_utf to true when indexing, and use TRECUTFCollection to parse the collection.
  • Merge multiple temporary lexicons at once in LexiconBuilders. 16 seems to be a good default setting.
  • Don't use tree structures for indexing, they are slower and larger (20% indexing speed improvement). New classes DocumentPostingList and LexiconMap.
  • Writing structures (direct and inverted) flush normally, to reduce memory consumption.
  • Add lexicon hashing to reduce size of binary search.
  • LookAheadStream and LookAheadReader are now case-sensitive, as the String.toUpperCase() affects indexing speed.
  • When the current indexing hits a threshold, finish it, then start a new index. Merge indices at completion. See Indexer.
  • Added code for merging indices - see StructureMerger and BlockStructureMerger.
  • Added a CollectionFactory, to allow Collections to wrap other Collections.
  • TRECCollection no longer throws exception when used for re-indexing and docPointers.col exists.
  • Retrieval:
  • CollectionStatistics is now non-static.
  • Added Hiemstra LM and Lemur TF_IDF weighting models.
  • BUG: Lexicon would match prefixes of terms when the desired term does not exist in the Lexicon.
  • Use a LexiconEntry, to support easier thread-safety with the Lexicon.
  • Added generic DFRWeightingModel, which can generate many DFR document weighting models. More information in Extending Retrieval.
  • Other:
  • Improved documentation.
  • Java: Move to Java 1.5 source, and upgrade GNU Trove jar.
  • Logging: Use log4j throughout source. Log4j config can be read from etc/terrier-log.xml.
  • Java: Various source code changes, to allow easier extension and re-use.
  • Compiling: Included compile.bat, by Jurrie Overgoor (Univ of Twente).

New in Terrier 1.0.2 (Jun 17, 2011)

  • BUG: Language modelling didn't index properly when block indexing was enabled.
  • BUG: Lexicon Merging compare strings the same way as the LexiconTree outputs them, to ensure sorting is correct.
  • BUG: Block ids are correctly recorded in the inverted index for large collections.
  • BUG: Block ids are correctly read from the direct index.
  • BUG: The phrase score modifier has been rewritten to a more correct implementation.
  • BUG: HTML Stack only lives for one document.
  • BUG: Cropping the resultset did not function properly with metadata.
  • BUG: If more than one control mapped to a post(process/filter) then only the last one would be noted. This is now fixed, and simpler data structures are used for the controls and the post(process/filter).
  • TREC: During indexing, start indexing from the beginning of a new file, not from the previous state.
  • TREC: Added trec.collection.class property to allow TRECIndexing to determine the TREC class to be used during indexing.
  • Added DLH Divergence From Randomness model - this hyper-geometric weighting model is completely parameter free and is very robust over many test collections.
  • Query Parser: Allow characters in the extended character set to be in terms.
  • LookAheadReader: Corrected implementation of Reader interface to give better support wrt EOF and subsequent method calls.
  • Added more TermPipeline classes: CropTerm, DumpTerm.
  • Updated and organised documentation and Javadoc.

New in Terrier 1.0.1 (Jun 17, 2011)

  • BUG 1: bin/interactive_terrier.bat doesn't run the correct class.
  • BUG 2: bin/compile.sh compiles the ANTLR parser correctly.
  • BUG: Lexicon binary search failed when searching for the last entry. Binary search has been updated.
  • Document Index binary search made more robust for different types of documentIds.
  • Desktop Terrier: starts new threads using correct Swing utility API.
  • Desktop Terrier: close PDF documents correctly.
  • Desktop Terrier: search text logging is slightly more robust.
  • Desktop Terrier: always disable search tab while indexing.
  • Desktop Terrier: temporary lexicon folders are deleted if they exist in the index folder before indexing.
  • Desktop Terrier: process only 25,000 terms at a time during block inverted index building, as only 120MB heap space is restrictive.
  • TREC: Model, QEModel & C value is displayed correctly in TREC querying and results file.
  • Documentation: Removed Known Issue 1 from doc/todo.html.
  • Documentation: Updated javadoc in ApplicationSetup.java.
  • Documentation: Added more details about compiling in doc/terrier_develop.html.

New in Terrier 1.0.0 (Jun 17, 2011)

  • New Indexing APIs, that allow more diverse forms of collections to be easily indexed.
  • New Querying API and languages (eg fields, phrases, proximity, requirements).
  • More Statistical IR Models: tf-idf, BM25, Divergence From Randomness models, and Ponte-Croft language model.
  • More example applications, including a Desktop Search application.