Heritrix Changelog

What's new in Heritrix 3.1.1

Aug 7, 2012
  • Nicer code editor for crawl config and script console (HER-2001)
  • The crawl configuration cxml editor and the scripting console editor now use CodeMirror, which adds syntax highlighting, line numbers and other features
  • Fixed occasional mangling of DNS records in ARCs and WARCs (HER-1983)
  • A longstanding bug that caused some DNS records in ARCs and WARCs to be mangled, due to unsafe use of a shared variable among threads, is now fixed.
  • Remember all surts across checkpoint/resume (HER-1985)
  • Surts that were derived from seeds, or listed as surts in the seeds source, or that were added using a .seeds file in the action directory, can now be remembered across checkpoint/resume. For that to work the relevant SurtPrefixedDecideRule must be a top-level bean. The default cxml distributed with heritrix now includes the key decide rule as a top-level bean with id "acceptSurts".
  • Support for saving script state (HER-1984)
  • Added a shared map for arbitrary use during a crawl. It can be used for state persisting for the duration of the crawl, shared among ScriptedProcessor, scripting console and other scripts, or other purposes. In scripts it can be obtained with appCtx.getData().

New in Heritrix 3.0.0 (Aug 7, 2012)

  • List of classes is not present in select menu for DecideRules
  • WARC metadata records should declare MIME-type 'application/warc-fields' (rather than 'text/anvl')
  • bottleneck in StatisticsTracker.saveSourceStats?
  • META http-equiv refresh content containing only a number misinterpreted as a URI