Jericho HTML Parser Changelog

What's new in Jericho HTML Parser 3.2

Mar 5, 2011
  • Bug Fixes:
  • IllegalCharsetNameException thrown when illegal encoding specified in the document.
  • Potential multithreading bug in Source.getNewLine()
  • NullPointerException when run with stringent java.policy
  • TextExtractor did not include any attribute values.
  • All unterminated character references were decoded regardless of the configuration settings (bug introduced in 3.1).
  • Renderer class - under resulted in new line.
  • SourceFormatter did not handle TEXTAREA elements correctly.
  • No exceptions thrown if invalid charset is specified by server or in source document.
  • Byte order mark character was included in the source document.
  • HTML5 elements added to HTMLElementName and HTMLElements classes.
  • Detects HTML5 character encoding declaration.
  • Uses Windows-1252 as the default 8-bit encoding when available instead of the subset encoding ISO-8859-1.
  • Added Renderer.setIncludeAlternateText(boolean) method.
  • Added Renderer.renderAlternateText(StartTag) method.
  • Added Renderer.setIncludeFirstElementTopMargin(boolean) method.
  • Added Renderer.setDefaultTopMargin(String,int) static method.
  • Added Renderer.setDefaultBottomMargin(String,int) static method.
  • Added Renderer.setDefaultIndent(String,boolean) static method.
  • Renderer now evaluates inline styles for top, bottom and left margins.
  • Added Attribute.getStartTag() method.
  • Added Segment.getURIAttributes() method.
  • Added Segment.getStyleURISegments() method.
  • Added deregister() methods to the extended tag type classes.
  • Added MicrosoftConditionalCommentTagTypes class.
  • Added StartTagType.SERVER_COMMON_COMMENT tag type.
  • SourceFormatter now inlines DOCTYPE tags.
  • Added Segment.getMaxDepthIndicator() method.
  • Added static Config.IsHTMLEmptyElementTagRecognised parameter.
  • Deprecated MicrosoftTagTypes class.
  • Upgraded to the following logger APIs: slf4j-api-1.6.1, log4j-1.2.16

New in Jericho HTML Parser 3.1 (Mar 31, 2010)

  • Bug Fixes:
  • [2793556] Infinite loop on Segment.getAllStartTags()
  • Infinite loop on Segment.getAllElements()
  • Segment.getFirst* methods returned segments outside the bounding
  • segment.
  • Segment.getAllElements methods did not return all enclosed elements
  • in some circumstances.
  • Fixed documentation errors in Segment.getAllElements methods.
  • Added StreamedSource class.
  • CHANGES THAT COULD AFFECT THE BEHAVIOUR OF EXISTING PROGRAMS:
  • Changed ParseText from class to interface.
  • Segment.getNodeIterator() now returns character references as
  • separate nodes.
  • Added tag search methods based on attribute value regular expressions.
  • Added tag search methods based on HTML class attribute.
  • Added static Source.LegacyNodeIteratorCompatabilityMode property
  • temporarily to restore Segment.getNodeIterator() functionality to
  • that of previous versions.
  • Removed char[] based search methods in ParseText.
  • Added CharacterReference.appendCharTo(Appendable) method.
  • Added OutputDocument(Segment) constructor.
  • Added StreamedSourceCopy sample program.

New in Jericho HTML Parser 3.0 (Mar 31, 2010)

  • Requires runtime Java 5 or later
  • Bug Fixes:
  • Character references representing unicode supplementary characters
  • were not decoded correctly to UTF-16 code unit pairs.
  • [2188446] Element.getDepth() and Element.getParentElement()
  • returned incorrect results if called in parse on demand mode.
  • Comments are now recognised inside elements.
  • API CHANGES THAT ARE NOT BACKWARD COMPATIBLE:
  • Changed package name to net.htmlparser.jericho
  • Attribute values must now be String rather than CharSequence.
  • Removed all deprecated methods/classes from previous versions.
  • All find* methods deprecated in favour of get* methods in order to
  • apply a consistent naming convention across all tag search methods.
  • Tag, Element and HTMLElements classes no longer implement the
  • HTMLElementName interface. (use static import instead)
  • All collections now stongly typed using generics.
  • Changed FormControlOutputStyle class to enum.
  • Changed FormControlType class to enum.
  • Added CharStreamSource.appendTo(Appendable) method.
  • Added Source.iterator() method.
  • Source now implements Iterable.
  • Internally uses StringBuilder for better performance.
  • Added Source.getNextStartTag(StartTagType) method.
  • Added Source.getNextEndTag(EndTagType) method.
  • Added Source.getPreviousStartTag(StartTagType) method.
  • Added Source.getPreviousEndTag(EndTagType) method.
  • Added Segment.getAllStartTags(StartTagType) method.
  • Added all Segment.getFirst* methods.
  • Added Renderer.renderHyperlinkURL(StartTag) method.
  • Added HTMLSanitiser sample program.
  • Upgraded to slf4j-api-1.5.6

New in Jericho HTML Parser 2.6.1 (Mar 31, 2010)

  • MAVEN release only to fix corrupt MANIFEST.MF file.

New in Jericho HTML Parser 2.6 (Mar 31, 2010)

  • Bug Fixes:
  • [1906051] Exponential recursion when non-server tags are present
  • inside attribute values during full seq parse (introduced v2.5).
  • [1927391] Renderer had indenting problems.
  • [1991529] Wrong encoding with DISPLAY_VALUE and select Tags.
  • An element whose start tag and end tag have different names, such
  • as a Mason component called with content, had no end tag.
  • SourceFormatter did not preserve original indentation inside server
  • tags as specified in documentation.
  • A start tag containing a server tag immediately before its closing
  • delimiter was not parsed correctly.
  • StartTag.tidy() removed server tags outside of attribute values.
  • Nested elements formed from non-normal tag types were not parsed
  • correctly.
  • CharStreamSourceUtil.toString(charStreamSource) broke if
  • charStreamSource.getEstimatedMaximumOutputLength()

New in Jericho HTML Parser 2.5 (Mar 31, 2010)

  • Bug Fixes:
  • [1747493] RenderToText does not handle multiple correctly.
  • RenderToText does not handle whitespace after correctly.
  • Resetting to invalid mark exception during encoding detection.
  • INPUT elements of type "button" and "reset" incorrectly
  • interpreted as form controls of type FormControlType.TEXT.
  • Valid end tags containing white space rejected.
  • Elements inside elements are now ignored, up until the first
  • occurrence of the character sequence "

New in Jericho HTML Parser 2.4 (Aug 22, 2007)

  • Released under dual EPL/LGPL licence.
  • Bug Fixes:
  • [1583814] Indent method outputs multiple tags
  • [1576991] Bug in ConvertStyleSheets sample program
  • [1597587] various NPEs in findFormFields()
  • [1599700] Segment.findAllStartTags(attributeName...) infinite loop
  • Overlapping elements resulted in some elements being listed as a child of more than one parent element.
  • OutputDocument.writeTo(Writer) closed the writer.
  • Server tags no longer interfere with parsing of start tag attributes.
  • Added Renderer class and Segment.getRenderer() method.
  • Added TextExtractor class and Segment.getTextExtractor() method.
  • Deprecated segment.extractText methods.
  • Added SourceFormatter class and Source.getSourceFormatter() method.
  • Deprecated Source.indent method.
  • Added Logger interface along with the related LoggerProvider interface and BasicLoggerProvider and WriterLogger classes.
  • Added Source.setLogger(Logger) and Source.getLogger() methods.
  • Deprecated Source.setLogWriter(Writer) and Source.getLogWriter() methods.
  • Added Source.findNextElement(int pos, String attributeName, String value, boolean valueCaseSensitive) method.
  • Added Segment.findAllElements(String attributeName, String value, boolean valueCaseSensitive) method.
  • Calling the ignoreWhenParsing methods on overlapping segments no longer results in an OverlappingOutputSegmentsException.
  • Added CharacterReference.getEncodingFilterWriter(Writer) method.
  • Added CharacterReference.encode(char) method.
  • Added Source.getNewLine() method.
  • Added static Config.NewLine parameter.
  • All text output now uses Config.NewLine instead of hard-coded '
  • '.
  • Source.fullSequentialParse() method no longer parses the source again if it has already been called.
  • Some methods that require the parsing of the entire source now call Source.fullSequentialParse() automatically.
  • Some changes to the output of various getDebugInfo() methods.
  • Added categorised class list in javadoc.
  • Removed all methods/constants deprecated in 2.0.