What's new in Jericho HTML Parser 3.2
Mar 5, 2011
- Bug Fixes:
- IllegalCharsetNameException thrown when illegal encoding specified in the document.
- Potential multithreading bug in Source.getNewLine()
- NullPointerException when run with stringent java.policy
- TextExtractor did not include any attribute values.
- All unterminated character references were decoded regardless of the configuration settings (bug introduced in 3.1).
- Renderer class - under resulted in new line.
- SourceFormatter did not handle TEXTAREA elements correctly.
- No exceptions thrown if invalid charset is specified by server or in source document.
- Byte order mark character was included in the source document.
- HTML5 elements added to HTMLElementName and HTMLElements classes.
- Detects HTML5 character encoding declaration.
- Uses Windows-1252 as the default 8-bit encoding when available instead of the subset encoding ISO-8859-1.
- Added Renderer.setIncludeAlternateText(boolean) method.
- Added Renderer.renderAlternateText(StartTag) method.
- Added Renderer.setIncludeFirstElementTopMargin(boolean) method.
- Added Renderer.setDefaultTopMargin(String,int) static method.
- Added Renderer.setDefaultBottomMargin(String,int) static method.
- Added Renderer.setDefaultIndent(String,boolean) static method.
- Renderer now evaluates inline styles for top, bottom and left margins.
- Added Attribute.getStartTag() method.
- Added Segment.getURIAttributes() method.
- Added Segment.getStyleURISegments() method.
- Added deregister() methods to the extended tag type classes.
- Added MicrosoftConditionalCommentTagTypes class.
- Added StartTagType.SERVER_COMMON_COMMENT tag type.
- SourceFormatter now inlines DOCTYPE tags.
- Added Segment.getMaxDepthIndicator() method.
- Added static Config.IsHTMLEmptyElementTagRecognised parameter.
- Deprecated MicrosoftTagTypes class.
- Upgraded to the following logger APIs: slf4j-api-1.6.1, log4j-1.2.16
New in Jericho HTML Parser 3.1 (Mar 31, 2010)
- Bug Fixes:
- [2793556] Infinite loop on Segment.getAllStartTags()
- Infinite loop on Segment.getAllElements()
- Segment.getFirst* methods returned segments outside the bounding
- segment.
- Segment.getAllElements methods did not return all enclosed elements
- in some circumstances.
- Fixed documentation errors in Segment.getAllElements methods.
- Added StreamedSource class.
- CHANGES THAT COULD AFFECT THE BEHAVIOUR OF EXISTING PROGRAMS:
- Changed ParseText from class to interface.
- Segment.getNodeIterator() now returns character references as
- separate nodes.
- Added tag search methods based on attribute value regular expressions.
- Added tag search methods based on HTML class attribute.
- Added static Source.LegacyNodeIteratorCompatabilityMode property
- temporarily to restore Segment.getNodeIterator() functionality to
- that of previous versions.
- Removed char[] based search methods in ParseText.
- Added CharacterReference.appendCharTo(Appendable) method.
- Added OutputDocument(Segment) constructor.
- Added StreamedSourceCopy sample program.
New in Jericho HTML Parser 3.0 (Mar 31, 2010)
- Requires runtime Java 5 or later
- Bug Fixes:
- Character references representing unicode supplementary characters
- were not decoded correctly to UTF-16 code unit pairs.
- [2188446] Element.getDepth() and Element.getParentElement()
- returned incorrect results if called in parse on demand mode.
- Comments are now recognised inside elements.
- API CHANGES THAT ARE NOT BACKWARD COMPATIBLE:
- Changed package name to net.htmlparser.jericho
- Attribute values must now be String rather than CharSequence.
- Removed all deprecated methods/classes from previous versions.
- All find* methods deprecated in favour of get* methods in order to
- apply a consistent naming convention across all tag search methods.
- Tag, Element and HTMLElements classes no longer implement the
- HTMLElementName interface. (use static import instead)
- All collections now stongly typed using generics.
- Changed FormControlOutputStyle class to enum.
- Changed FormControlType class to enum.
- Added CharStreamSource.appendTo(Appendable) method.
- Added Source.iterator() method.
- Source now implements Iterable.
- Internally uses StringBuilder for better performance.
- Added Source.getNextStartTag(StartTagType) method.
- Added Source.getNextEndTag(EndTagType) method.
- Added Source.getPreviousStartTag(StartTagType) method.
- Added Source.getPreviousEndTag(EndTagType) method.
- Added Segment.getAllStartTags(StartTagType) method.
- Added all Segment.getFirst* methods.
- Added Renderer.renderHyperlinkURL(StartTag) method.
- Added HTMLSanitiser sample program.
- Upgraded to slf4j-api-1.5.6
New in Jericho HTML Parser 2.6.1 (Mar 31, 2010)
- MAVEN release only to fix corrupt MANIFEST.MF file.
New in Jericho HTML Parser 2.6 (Mar 31, 2010)
- Bug Fixes:
- [1906051] Exponential recursion when non-server tags are present
- inside attribute values during full seq parse (introduced v2.5).
- [1927391] Renderer had indenting problems.
- [1991529] Wrong encoding with DISPLAY_VALUE and select Tags.
- An element whose start tag and end tag have different names, such
- as a Mason component called with content, had no end tag.
- SourceFormatter did not preserve original indentation inside server
- tags as specified in documentation.
- A start tag containing a server tag immediately before its closing
- delimiter was not parsed correctly.
- StartTag.tidy() removed server tags outside of attribute values.
- Nested elements formed from non-normal tag types were not parsed
- correctly.
- CharStreamSourceUtil.toString(charStreamSource) broke if
- charStreamSource.getEstimatedMaximumOutputLength()
New in Jericho HTML Parser 2.5 (Mar 31, 2010)
- Bug Fixes:
- [1747493] RenderToText does not handle multiple correctly.
- RenderToText does not handle whitespace after correctly.
- Resetting to invalid mark exception during encoding detection.
- INPUT elements of type "button" and "reset" incorrectly
- interpreted as form controls of type FormControlType.TEXT.
- Valid end tags containing white space rejected.
- Elements inside elements are now ignored, up until the first
- occurrence of the character sequence "
New in Jericho HTML Parser 2.4 (Aug 22, 2007)
- Released under dual EPL/LGPL licence.
- Bug Fixes:
- [1583814] Indent method outputs multiple tags
- [1576991] Bug in ConvertStyleSheets sample program
- [1597587] various NPEs in findFormFields()
- [1599700] Segment.findAllStartTags(attributeName...) infinite loop
- Overlapping elements resulted in some elements being listed as a child of more than one parent element.
- OutputDocument.writeTo(Writer) closed the writer.
- Server tags no longer interfere with parsing of start tag attributes.
- Added Renderer class and Segment.getRenderer() method.
- Added TextExtractor class and Segment.getTextExtractor() method.
- Deprecated segment.extractText methods.
- Added SourceFormatter class and Source.getSourceFormatter() method.
- Deprecated Source.indent method.
- Added Logger interface along with the related LoggerProvider interface and BasicLoggerProvider and WriterLogger classes.
- Added Source.setLogger(Logger) and Source.getLogger() methods.
- Deprecated Source.setLogWriter(Writer) and Source.getLogWriter() methods.
- Added Source.findNextElement(int pos, String attributeName, String value, boolean valueCaseSensitive) method.
- Added Segment.findAllElements(String attributeName, String value, boolean valueCaseSensitive) method.
- Calling the ignoreWhenParsing methods on overlapping segments no longer results in an OverlappingOutputSegmentsException.
- Added CharacterReference.getEncodingFilterWriter(Writer) method.
- Added CharacterReference.encode(char) method.
- Added Source.getNewLine() method.
- Added static Config.NewLine parameter.
- All text output now uses Config.NewLine instead of hard-coded '
- '.
- Source.fullSequentialParse() method no longer parses the source again if it has already been called.
- Some methods that require the parsing of the entire source now call Source.fullSequentialParse() automatically.
- Some changes to the output of various getDebugInfo() methods.
- Added categorised class list in javadoc.
- Removed all methods/constants deprecated in 2.0.