HTML Cleaner Changelog

What's new in HTML Cleaner 2.22

Jul 19, 2018

202 test suite failes with java9: Transformer changed behaviour
200 Adds null end of the DOCTYPE when there is no DOCTYPE
199 "whitespace: pre" CSS property not taken into account
198 Make XPath methods protected to allow extension
193 MathML equations without a namespace prefix are breaking paragraphs
192 Support HTML5 inline content without namespaces
191 Infinite loop on time b li
190 NullPointerException in HtmlCleaner.makeTree
189 Unclosed CDATA sections results in odd behaviour with large documents

New in HTML Cleaner 2.21 (Aug 7, 2017)

New in HTML Cleaner 2.20 (Aug 7, 2017)

New in HTML Cleaner 2.19 (Aug 7, 2017)

New in HTML Cleaner 2.18 (Nov 8, 2016)

New in HTML Cleaner 2.17 (Nov 8, 2016)

New in HTML Cleaner 2.16 (Dec 6, 2015)

New in HTML Cleaner 2.15 (Nov 27, 2015)

New in HTML Cleaner 2.14 (Aug 30, 2015)

New in HTML Cleaner 2.13 (Aug 30, 2015)

New in HTML Cleaner 2.12 (May 20, 2015)

New in HTML Cleaner 2.11 (May 20, 2015)

New in HTML Cleaner 2.10 (Nov 4, 2014)

New in HTML Cleaner 2.9 (Aug 26, 2014)

New in HTML Cleaner 2.8 (Aug 26, 2014)

New in HTML Cleaner 2.7 (Dec 11, 2013)

New in HTML Cleaner 2.6.1 (Sep 5, 2013)

New in HTML Cleaner 2.6 (Sep 5, 2013)

New in HTML Cleaner 2.5 (May 16, 2013)

New in HTML Cleaner 2.4 (May 16, 2013)

New in HTML Cleaner 2.2.1 (May 16, 2013)

New in HTML Cleaner 2.2 (Mar 14, 2013)

HtmlCleaner is now thread-safe. Single instance can be used from multiple threads to parse multiple html sources safely. All serializers coming in the package are thread-safe as well.
Html-based serializers are introduced, intended to produce browser-friendly HTML. Now there are basically two serializer flavors: XML (simple, pretty, compact) and HTML (simple, pretty, compact). Html serializers doesn't strictly produce well-formed XML, but rather HTML for further browser consumption (for example special entities like Α are preserved, not escaped with Α, empty tags like script are not serialized as but rather as )
New parameter transResCharsToNCR is introduced, telling whether reserved XML characters (&, ", ', ) are serialized to their Numeric Character Representations (&#dd;)
New parameter transSpecialEntitiesToNCR is introduced, telling whether special HTML entities (Α for example) are serialized to their Numeric Character Representations (&#dd;)
Parameter omitHtmlEnvelope gets deprecated and new parameter omitEnvelope in command line/Ant and optional parameter in methods XXXSerializer.writeToXXX() is introduced instead, moving this logic to the right place. This way the whole body wihout enclosing tags is serialized, not only the first inner node as before.
List of special HTML entities is extended with number of new ones. Class SpecialEntity holding them has public method addEntity(entityName, entityCode) to define new ones if some are still missing.
TagNode has number of new methods for easier node manipulation (see API docs)
Visitor concept is implemented in TagNode in order to easily traverse DOM tree and collect some data/update the document.
Pretty XML/HTML serializers have optional parameter in constructors specifying indentation string (default is TAB character).
Tag definitions updated (col, legend...) to be consistent with the browsers.
Invalid XML characters are skipped during parsing/serialization.
DOM/JDom serialization bug fixes.
Namespaces found in source HTML are now handled properly (depending on omitXmlnsAttributes parameter).
Method HtmlClenaer.getAllTags() is removed, since this approach doesn't go with introduced thread-safety.
Few classes are renamed: ContentToken -> ContentNode, CommentToken -> CommentNode.
Parameter ignoreQuestAndExclam has now default value true.
Source code now has standard MAVEN structure.
HtmlCleaner now depends on Java runtime 1.5+.
For the list of fixed bugs

New in HTML Cleaner 2.1 (Mar 14, 2013)

New in HTML Cleaner 2.0 (Mar 14, 2013)

New in HTML Cleaner 1.6 (Mar 14, 2013)

New in HTML Cleaner 1.55 (Mar 14, 2013)