GATE Changelog

What's new in GATE 7.1 Build 4485

Mar 12, 2013
  • New plugins:
  • The TermRaider plugin has provides a toolkit and sample application for term extraction.
  • Two new plugins, Tagger_Zemanta and Tagger_Lupedia provide PRs that wrap online annotation services provided by Zemanta and Ontotext.
  • A new plugin named Coref_Tools includes a framework for fast co-reference processing, and one PR that performs orthographical co-reference in the style of the ANNIE Orthomatcher.
  • A new Configurable Exporter PR in the Tools plugin, allowing annotations and features to be exported in formats specified by the user (e.g. for use with external machine learning tools).
  • Support for reading a number of new document formats has also been added:
  • PubMed and the Cochrane Library formats.
  • CoNLL “IOB” format.
  • MediaWiki markup, both plain text and XML dump files such as those from Wikipedia.
  • In addition, “ready-made applications” have been added to many existing plugins (notably the Lang_* non-English language plugins) to make it easier to experiment with their PRs.
  • Library updates:
  • Updated the Stanford Parser plugin to version 2.0.4 of the parser itself, and added run-time parameters to the PR to control the parser’s dependency options.
  • The Measurement and Number taggers have been upgraded to use JAPE+ instead of JAPE. This should result in faster processing, and also allows for more memory efficient duplication of PR instances, i.e. when a pool of applications is created.
  • The OpenNLP plugin has been completely revised to use Apache OpenNLP 1.5.2 and the corresponding set of models.
  • GATE Embedded API changes:
  • Some of the most significant changes in this version are “under the bonnet” in GATE Embedded:
  • The class loading architecture underlying the loading of plugins and the generation of code from JAPE grammars has been re-worked. The new version allows for the complete unloading of plugins and for better memory handling of generated classes. Different plugins can now also use different versions of the same 3rd party libraries. There have also been a number of changes to the way plugins are (un)loaded which should provide for more consistent behavior.
  • The GATE XML format has been updated to handle more value types (essentially every data type supported by XStream should be usable as feature name or value. Files in the new format can be opened without error by older GATE versions, but the data for the previously-unsupported types will be interpreted as a String, containing an XML fragment.
  • The PRs defined in the ANNIE plugin are now described by annotations on the Java classes rather than explicitly inside creole.xml. The main reason for this change is to enable the definitions to be inherited to any subclasses of these PRs. Creating an empty subclass is a common way of providing a PR with a different set of default parameters (this is used extensively in the language plugins to provide custom gazetteers and named entity transducers). This has the added benefit of ensuring that new features also automatically percolate down to these subclasses. If you have developed your own PR that extends one of the ANNIE ones you may find it has acquired new parameters that were not there previously, you may need to use the @HiddenCreoleParameter annotation to suppress them.
  • The corpus parameter of LanguageAnalyser (an interface most, if not all, PRs implement) is now annotated as @Optional as most implementations do not actually require the parameter to be set.
  • When saving an application the plugins are now saved in the same order in which they were originally loaded into GATE. This ensures that dependencies between plugins are correctly maintained when applications are restored.
  • API support for working with relations between annotations was added.
  • The method of populating a corpus from a single file has been updated to allow any mime type to be used when creating the new documents.
  • And numerous smaller bug fixes and performance improvements…

New in GATE 7.0 (Mar 12, 2013)

  • Major new features:
  • The CREOLE Plugin Manager has been completely re-written and now includes support for installing new plugins from remote update sites.
  • In addition, plugins can now contribute additional “ready-made applications” to the GATE Developer menus alongside the standard applications (ANNIE, etc.).
  • A new plugin named JAPE_Plus has been added. It contains a new JAPE execution engine that includes various optimizations and should be significantly faster than the standard engine. JAPE_Plus has not yet been comprehensively tested, so it should be considered beta software, and used with caution.
  • A new Java-based launcher has been implemented which now replaces the use of Apache ANT for starting-up GATE Developer. The GATE Developer application now behaves in a more natural way in dock-based desktop environments such as Mac OS X and Ubuntu Unity.
  • Improved the support for processing biomedical text by adding new PRs to incorporate the following tools: AbGene, the NormaGene tagger, the GENIA sentence splitter, MutationFinder and the Penn BioTagger (contains a tokenizer and three taggers for gene, malignancy and variation).
  • The Flexible Gazetteer PR has been rewritten to provide a better and faster implementation. The two parameters inputAnnotationSetName and outputAnnotationSetName have been renamed to inputASName and outputASName, however old applications with the old parameters should still work.
  • Removal of deprecated functionality:
  • Various components were removed in this release as they have been unsupported and deprecated in previous releases:
  • The GATE Unicode Kit (GUK), which has been superseded by improved native support for localisation in the various target operating systems.
  • the database-backed datastore implementation.
  • the plugins Jape_Compiler (superseded by JAPE_Plus) and Ontology_OWLIM2.
  • In addition the Web_Search_Google, Web_Search_Yahoo and Web_Translate_Google plugins have been removed as the underlying web services on which they depend are no longer available. Documentation for obsolete plugins can be found in appendix C, and if you require any of them for your application please see plugins/Obsolete/README.TXT in the GATE Developer distribution.
  • Other enhancements and bug fixes:
  • CREOLE plugins can now use Apache Ivy to include third-party dependencies.
  • The Default ANNIE Gazetteer now allows a user to specify different annotation types to be used for annotating entries from different lists. For example, a user may want to find city names mentioned in a gazetteer list (e.g. city.lst) and annotate the matching strings as City.
  • The Segment Processing PR has two additional run-time parameters called segmentAnnotationFeatureName and segmentAnnotationFeatureValue. These features allow users to specify a constraint on feature name and feature value. If user has provided values for these parameters, only the annotations with the specified feature name and feature value are processed with the Segment Processing PR. Also, the parameter controller has been renamed to analyser which means the Segment Processing PR can now also run an individual PR on the specified segments1.
  • The Hash Gazetteernow properly supports the caseSensitive parameter (previously the parameter could be set but had no effect).
  • The Document Reset PR now defaults to keeping the Key set as well as Original markups. This makes working with pre-annotated gold standard document less dangerous (assuming you put the gold standard annotations in a set called Key).
  • Updated Stanford Parser plugin to version 1.6.8.
  • The TextCat based Language Identification PR now supports generating new language fingerprints.
  • Added support for reading XCAS and XMI-format documents created by UIMA.
  • Various improvements to the GATE Developer GUI:
  • Added support in the document editor to switch the principal text orientation, to better support documents written in right-to-left languages such as Arabic, Hebrew or Urdu.
  • Added new mouse shortcuts to the Annotation Stack view in the document editor to speed up the curation process.
  • The document editor layout is now saved to the user preferences file, gate.xml. It means that you can give this file to a new user so s/he will have a pre-configured document editor.
  • The script behind an instance of the Groovy Scripting PR can now be edited from within GATE Developer through a new visual resource which supports syntax highlighting.
  • The rule and phase names are now accessible in a JAPE Java RHS by the ruleName() and phaseName() methods and the name of the JAPE processing resource executing the JAPE transducer is accessible through the action context getPRName() method.

New in GATE 6.1 (Mar 12, 2013)

  • New CREOLE Plugins:
  • Tagger_Numbers to annotate many kinds of numbers in documents and determine their numeric values. The tagger can annotate numbers expressed in many forms including Arabic and Roman numerals, words (in English, French, German and Spanish) and scientific notation (4.3e6 = 4300000).
  • Tagger_Measurements to annotate many different forms of measurement expressions (“5.5 metres”, “1 minute 30 seconds”, “10 to 15 pounds”, etc.) along with their normalized values in SI units.
  • Tagger_Boilerpipe, which contains a boilerpipe2 based PR for performing content detection.
  • Tagger_DateNormalizer to annotate and normalize dates within a document.
  • Schema_Tools providing a “Schema Enforcer” PR that can be used to create a clean output annotation set based on a set of annotation schemas.
  • Teamware_Tools providing a new PR called QA Summariser for Teamware. When documents are annotated using GATE Teamware, this PR can be used for generating a summary of agreements among annotators.
  • Tagger_MetaMap has been rewritten to make use of the new MetaMap Java API features. There are numerous performance enhancements and bug fixes detailed in section 16.1.2. Note that this version of the plugin is not compatible with the version provided in GATE 6.0, though this earlier version is still available in the Obsolete directory if required.
  • Other new features and improvements:
  • Added support for handling controller events to JAPE by making it possible to define ControllerStarted, ControllerFinished, and ControllerAborted code blocks in a JAPE file.
  • JAPE Java right-hand-side code can now access an ActionContext object through the predefined field ctx which allows access to the corpus LR and the transducer PR and their features.
  • Three new optional attributes can be specified in element of gate.xml or local configuration file:
  • addNamespaceFeatures - set to “true” to deserialize namespace prefix and URI information as features.
  • namespaceURI - The feature name to use that will hold the namespace URI of the element, e.g. “namespace”
  • namespacePrefix - The feature name to use that will hold the namespace prefix of the element, e.g. “prefix”
  • Setting these attributes will alter GATE’s default namespace deserialization behaviour to remove the namespace prefix and add it as a feature, along with the namespace URI. This allows namespace-prefixed elements in the Original markups annotation set to be matched with JAPE expressions, and also allows namespace scope to be added to new annotations when serialized to XML.
  • Searchable Serial Datastores (Lucene-based) are now portable and can be moved across different systems. Also, several GUI improvements have been made to ease the creation of Lucene datastores.
  • The populate method that allowed populating corpus from a trecweb file has been made more generic to accept a tag. The method extracts content between the start and end of this tag to create new documents. In GATE Developer, right-clicking on an instance of the Corpus and choosing the option “Populate from Single Concatenated File" allows users to populate the corpus using this functionality.
  • Fixed a regression in the JAPE parser that prevented the use of RHS macros that refer to a LHS label (named blocks :label { ... } and assignments :label.Type = {}
  • Enhanced the Groovy scriptable controller with some features inspired by the realtime controller, in particular the ability to ignore exceptions thrown by PRs and the ability to limit the running time of certain PRs.
  • The Ontology and Gazetteer_LKB plugins have been upgraded to use Sesame 3.2.3 and OWLIM 3.5.
  • The Websphinx Crawler PR (section 21.17) has new runtime parameters for controlling the maximum page size and spoofing the user-agent.
  • A few bug fixes and improvements to the “recover” logic of the packagegapp Ant task.
  • …and many other smaller bugfixes.

New in GATE 6.0 (Mar 12, 2013)

  • Major new features:
  • Added an annotation tool for the document editor: the Relation Annotation Tool (RAT). It is designed to annotate a document with ontology instances and to create relations between annotations with ontology object properties. It is close and compatible with the Ontology Annotation Tool (OAT) but focus on relations between annotations.
  • Added a new scriptable controller to the Groovy plugin, whose execution strategy is controlled by a simple Groovy DSL. This supports more powerful conditional execution than is possible with the standard conditional controllers (for example, based on the presence or absence of a particular annotation, or a combination of several document feature values), rich flow control using Groovy loops, etc.
  • A new version of Alignment Editor has been added to the GATE distribution. It consists of several new features such as the new alignment viewer, ability to create alignment tasks and store in xml files, three different views to align the text (links view and matrix view - suitable for character, word and phrase alignments, parallel view - suitable for sentence or long text alignment), an alignment exporter and many more.
  • MetaMap, from the National Library of Medicine (NLM), maps biomedical text to the UMLS Metathesaurus and allows Metathesaurus concepts to be discovered in a text corpus. The Tagger_MetaMap plugin for GATE wraps the MetaMap Java API client to allow GATE to communicate with a remote (or local) MetaMap PrologBeans mmserver and MetaMap distribution. This allows the content of specified annotations (or the entire document content) to be processed by MetaMap and the results converted to GATE annotations and features.
  • A new plugin called Web_Translate_Google has been added with a PR called Google Translator PR in it. It allows users to translate text using the Google translation services.
  • New Gazetteer Editor for ANNIE Gazetteer that can be used instead of Gaze. It uses tables instead of text area to display the gazetteer definition and lists, allows sorting on any column, filtering of the lists, reloading a list, etc.
  • Breaking changes:
  • This release contains a few small changes that are not backwards-compatible:
  • Changed the semantics of the ontology-aware matching mode in JAPE to take account of the default namespace in an ontology. Now class feature values that are not complete URIs will be treated as naming classes within the default namespace of the target ontology only, and not (as previously) any class whose URI ends with the specified name. This is more consistent with the way OWL normally works, as well as being much more efficient to execute.
  • Updated the WordNet plugin to support more recent releases of WordNet than 1.6. The format of the configuration file has changed, if you are using the previous WordNet 1.6 support you will need to update your configuration.
  • The deprecated Tagger_TreeTagger plugin has been removed, applications that used it will need to be updated to use the Tagger_Framework plugin instead.
  • Other new features and bugfixes:
  • The concept of templates has been introduced to JAPE. This is a way to declare named “variables” in a JAPE grammar that can contain placeholders that are filled in when the template is referenced.
  • Added a JAPE operator to get the string covered by a left-hand-side label and assign it to a feature of a new annotation on the right hand side.
  • Added a new API to the CREOLE registry to permit plugins that live entirely on the classpath. CreoleRegister.registerComponent instructs the registry to scan a single java Class for annotations, adding it to the set of registered plugins.
  • Maven artifacts for GATE are now published to the central Maven repository.
  • Bugfix: DocumentImpl no longer changes its stringContent parameter value whenever the document’s content changes. Among other things, this means that saved application states will no longer contain the full text of the documents in their corpus, and documents containing XML or HTML tags that were originally created from string content (rather than a URL) can now safely be stored in saved application states and the GATE Developer saved session.
  • A processing resource called Quality Assurance PR has been added in the Tools plugin. The PR wraps the functionality of the Quality Assurance Tool.
  • A new section for using the Corpus Quality Assurance from GATE Embedded has been written.
  • The Generic Tagger PR (in the Tagger_Framework plugin) now allows more flexible specification of the input to the tagger, and is no longer limited to passing just the “string” feature from the input annotations. See section 21.3 for details.
  • Added new parameters and options to the LingPipe Language Identifier PR. (section 21.23.5), and corrected the documentation for the LingPipe POS Tagger (section 21.23.3).
  • In the document editor, fixed several exceptions to make editing text with annotations highlighted working. So you should now be able to edit the text and the annotations should behave correctly that is to say move, expand or disappear according to the text insertions and deletions.
  • Options for document editor: read-only and insert append/prepend have been moved from the options dialogue to the document editor toolbar at the top right on the triangle icon that display a menu with the options.
  • Added new parameters and options to the Crawl PR and document features to its output.
  • Fixed a bug where ontology-aware JAPE rules worked correctly when the target annotation’s class was a subclass of the class specified in the rule, but failed when the two class names matched exactly.
  • Improved support for conditional pipelines containing non-LanguageAnalyser processing resources.
  • Added the current Corpus to the script binding for the Groovy Script PR, allowing a Groovy script to access and set corpus-level features. Also added callbacks that a Groovy script can implement to do additional pre- or post-processing before the first and after the last document in a corpus.

New in GATE 5.2.1 (Mar 12, 2013)

  • Fixed some bugs with the automatic “create instance” feature in OAT (the ontology annotation tool) when used with the new Ontology plugin.
  • Added validation to datatype property values of the date, time and datetime types.
  • Fixed a bug with Gazetteer_LKB that prevented it working when the dictionaryPath contained spaces.
  • Added a utility class to handle common cases of encoding URIs for use in ontologies, and fixed the example code to show how to make use of this. See chapter 14 for details.
  • The annotation set transfer PR now copies the feature map of each annotation it transfers, rather than re-using the same FeatureMap (this means that when used to copy annotations rather than move them, the copied annotation is independent from the original and modifying the features of one does not modify the other). See section 21.14 for details.
  • The Log4J log files are now created by default in the .gate directory under the user’s home directory, rather than being created in the current directory when GATE starts, to be more friendly when GATE is installed in a shared location where the user does not have write permission.
  • This release also fixes some shortcomings in the Groovy support added by 5.2, in particular:
  • The corpora variable in the console now includes persistent corpora (loaded from a datastore) as well as transient corpora.
  • The subscript notation for annotation sets works with long values as well as ints, so someAS[annotation.start()..annotation.end()] works as expected.