Web-Harvest Changelog

What's new in Web-Harvest 2.001

Mar 31, 2010
  • Plug-in mechanism enabled - users may develop their own processors and seamlessly integrate them to Web-Harvest.
  • New processors developed:
  • database: perform select/insert/update/delete operations against specified database (JDBC driver is required on classpath).
  • mail: send emails with optional attachments.
  • zip: crate ZIP archives with specified content.
  • ftp: access FTP server and perform common operations: list, get, put, del, etc.
  • tokenize: split text to list of elements.
  • json-to-xml: convert JSON formatted value to XML.
  • xml-to-json: convert XML to JSON formatted value.
  • file processor updated with action to list files with specified name filter.
  • http processor updated to support multipart forms (enabling file uploads).
  • charset and delimiter attributes added to text processor.
  • empty attribute added to loop and while processors in order to prevent accumulating of large results that may produce memory leaks. This is replacement for putting empty processor inside the loop body.
  • Several new attributes added to regexp processor to enable regular expression fine-tunning.
  • Complete access to http response headers.
  • GUI improvements:
  • Simple debugging added: user may define breakpoints where execution pauses and runtime values can be seen.
  • Charset selection enabled in settings dialog for configuration files.
  • Editor auto-completion improvements - auto-completion is available for attribute values wherever possible.
  • Editor improvements: copying lines/selection, deleting lines, (un)commenting xml fragments.
  • List of recently opened files added to File menu.
  • Dependency libraries updated:
  • HtmlCleaner updated to version 2.1.
  • Saxon updated to version 9 (XSLT 2.0, XQuery 1.0, XPath 2.0).
  • Number of new attributes supported in html-to-xml processor.
  • Number of bug fixes.
  • Java 1.4 is no more supported - JRE 1.5 or higher is required.

New in Web-Harvest 1.0 (Mar 31, 2010)

  • Graphical user interface is introduced giving the environment for easier configuration development and testing.
  • html-to-xml processor, which is based on HtmlCleaner, now exposes attributes for controlling cleaner's behaviour.
  • Besides BeanShell scripting engine, two others are added: Groovy and JavaScript. Now it is possible to choose the favourite scripting engine or even mix them in a single Web-Harvest configuration. This option is supported by adding new attributes to config, script and template processors.
  • Access to HTTP client is supported by introducing implicit context varibale http. Now it is possible to check important HTTP response values, like http.mimeType, http.headers, http.statusCode, or even to obtain instance of org.apache.commons.httpclient.HttpClient class with http.client and manipulate it in the runtime.
  • New attribute cookie-policy added to the http processor, specifying the way HttpClient manage cookies.
  • Command-line use is improved by adding several new parameters.
  • For more comfortable use of Web-Harvest context variables in the script engines' runtime scopes, several handy methods are added to the class org.webharvest.runtime.variables.Variable (interface IVariable in previous versions of Web-Harvest).
  • Several useful methods added in implicit Web-Harvest context variable sys, like sys.xpath(expression, xml), sys.isVariableDefined(varname) and sys.defineVariable(varName, varValue, [overwrite]).
  • Attribute overwrite added in the ver-def processor, giving possibility to specify whether existing variables with specified name will be overwriten or not.
  • New proccessor is introduced in order to support conditional execution break.
  • Encoding selection in http processor is changed - if no explicitely specified with charset attribute, one given from HTTP response is used instead to read downloaded text content.
  • NTLM proxy authentication scheme is supported.
  • Performance improvements and bug fixes.

New in Web-Harvest 0.5 (Mar 31, 2010)

  • html-to-xml parser is changed - HtmlCleaner is used instead of TagSoup. The bad point in this is that some existing Web-Harvest configurations may need corrections of XPath or XQuery processors. On the other hand, lot of problems previously existing are now solved.
  • Script processor is introduced. It adds scripting support based on BeanShell scripting language. Check more detailed description in User manual, and see an example illustrating it's power.
  • template processor is now based also on BeanShell instead of OGNL, this way giving possibilty to share the same variables and methods with script processing.
  • Optional attribute type is now added to xq-param defaulting to node(). It specifies type of external XQuery parameter. Up to the Web-Harvest 0.5 this parameter was implicitely declared at the beginning of XQuery expression and was always of node()* type. Now on, for each parameter defined with xq-param the matching explicit declaration inside xq-expression is required (declare variable $var_name as var_type external;). For more details see User manual and the example showing the usage of XQuery inside Web-Harvest.
  • A couple of new constructors is added to the class ScraperConfiguration allowing loading configuration from URL or from arbitrary input stream.
  • file and include processors now support both absolute and relative paths. File paths are regarded as absolute if they begin with X:, /, or \, where X is a letter.
  • In order to avoid ambiguity in exchanging values with script and template processing, Web-Harvest variables are case-sensitive from this version.