HTML Cleaner Changelog

What's new in HTML Cleaner 2.22

Jul 19, 2018
  • 202 test suite failes with java9: Transformer changed behaviour
  • 200 Adds null end of the DOCTYPE when there is no DOCTYPE
  • 199 "whitespace: pre" CSS property not taken into account
  • 198 Make XPath methods protected to allow extension
  • 193 MathML equations without a namespace prefix are breaking paragraphs
  • 192 Support HTML5 inline content without namespaces
  • 191 Infinite loop on time b li
  • 190 NullPointerException in HtmlCleaner.makeTree
  • 189 Unclosed CDATA sections results in odd behaviour with large documents

New in HTML Cleaner 2.21 (Aug 7, 2017)

  • Bug fix #188 Regression in 2.20, "prune"-tags do not get removed anymore

New in HTML Cleaner 2.20 (Aug 7, 2017)

  • Enhancement - add ability to output to an Ant property from HtmlCleaner.
  • Enhancement 186 Add strict error checking flag as optional DomSerializer constructor
  • Bug fix 185 Unclosed CDATA can cause ArrayIndexOutOfBoundsException
  • Bug fix 175 HTMLCleaner genereates invalid attribute names from bad HTML
  • Bug fix 125 Html elements are doubled sometimes
  • Bug fix 57 Use first attribute if duplicated
  • Bug fix 51 setUseCdataForScriptAndStyle should apply on HtmlSerializer

New in HTML Cleaner 2.19 (Aug 7, 2017)

  • Bug fix 183 Real world html causes clean() to eat all available memory
  • Bug fix 180 duplicate tags added at the end (script,body,html)
  • Bug fix 173 Infinite loop and OOM if uppercase P tag with xmlns
  • Bug fix 172 Infinite loop occurs when MathML tags are present
  • Bug fix 169 Several issues with CDATA blocks
  • Bug fix 168 DomSerializer doesn't seem to take into account the namespacesAware configuration
  • Enhancement 167 Make it easier to extend DomSerializer
  • Bug fix 166 Bad handling of <p> inside <ul> in HTML 5
  • Bug fix 164 Font tag is not known anymore
  • Enhancement 159 Add back in Utils.fullUrl()
  • Bug fix 158 NullPointerException in HtmlCleaner.saveToLastOpenTag

New in HTML Cleaner 2.18 (Nov 8, 2016)

  • Bug fix 179 java -jar option does not work

New in HTML Cleaner 2.17 (Nov 8, 2016)

  • Bug fix 178 java.lang.ClassCastException: java.util.ArrayList cannot be cast to org.htmlcleaner.TagNode
  • Bug fix 176 Crash: IllegalArgumentException in convertToUnicode
  • Bug fix 165 Invalid HTML characters are not removed

New in HTML Cleaner 2.16 (Dec 6, 2015)

  • Bug fix 157 Infinite loop occur
  • Bug fix 156 style element should not always be moved to head in HTML5
  • Bug fix 155 Memory and resource blowup on particular documents
  • Bug fix 154 single quote character must not get serialized as "& a p o s;" by html serializers
  • Bug fix 153 NullPointerException when DOCTYPE doesn't contain a qualifiedName
  • Bug fix 130 Apply recognizeUnicodeChars property when cleaning
  • Bug fix 118 HTML always translating special entities

New in HTML Cleaner 2.15 (Nov 27, 2015)

  • New Feature FR20 Added useCdataFor parameter
  • Bug Fix 152 Destruction of Unicode characters above 65535

New in HTML Cleaner 2.14 (Aug 30, 2015)

  • 149 StackOverflowError
  • 148 Giving mixed-case filenames doesn't work on case-sensitive filesystems
  • 147 Correction of ul structure
  • 146 2.13 does not correct table structure
  • 144 schema.org elements such as meta and link are removed
  • 140 CRITICAL: endless loop in some tags (ref #129, #126)
  • 139 option tag displayed after optgroup
  • 136 ClassCastException

New in HTML Cleaner 2.13 (Aug 30, 2015)

  • Fixed issue 129 Defining required parent for >legend< element causes out-of-memory error
  • Fixed issue 126 Infinite loop on HTML parsing
  • Fixed issue 138 label tags are removed (fatalTag problem 2)
  • Fixed issue 141 OutOfMemory error

New in HTML Cleaner 2.12 (May 20, 2015)

  • Fixed issue 137 Options tags are removed inside select

New in HTML Cleaner 2.11 (May 20, 2015)

  • Feature 19: Support use of stdin and stdout for pipes on command line
  • Feature 10: Make OSGI-compatible bundle
  • Feature 15: Improved HTML5 support
  • Fixed issue 135: Some pages cause two different NullPointerExceptions
  • Fixed issue 134: Some pages cause IndexOutOfBoundsException
  • Fixed issue 133: Some pages cause NullPointerException
  • Fixed issue 132: ClassCastException: ArrayList cannot be cast to org.htmlcleaner.BaseToken

New in HTML Cleaner 2.10 (Nov 4, 2014)

  • Feature 16: Make DefaultTagProvider extendable
  • Fixed issue 128: Regression: legend tag is stripped
  • Fixed issue 127: DomSerializer loose all attributes of root node
  • Fixed issue 126: Infinite loop on HTML parsing

New in HTML Cleaner 2.9 (Aug 26, 2014)

  • Feature 14: Added "silent mode" feature. Use --quiet to turn off output.
  • Fixed Issue 124: Class cast exception
  • Fixed Issue 123: Endless loop in meta tags
  • Fixed Issue 121: Shuld be possible to exclude "meta" tag
  • Fixed Issue 119: Tag combination causes internal loop
  • Fixed Issue 117: Parsing of CSS content property incorrect
  • Fixed Issue 116: Result XML different between DomSerializer and XmlSerializer
  • Fixed Issue 115: Recognise and remove HTML namespaces
  • Fixed Issue 114: Odd behaviour when using namespaces
  • Fixed Issue 113: PATCH - limit the number of times identical tags can be copied forward
  • Fixed Issue 112: HTML5 tags missing from DefaultTagProvider
  • Fixed Issue 111: STRONG in DefaultTagProvider isn't used correctly when constructing CLOSE_BEFORE_COPY_INSIDE_TAGS
  • Fixed Issue 103: Attributes of HTML element are stripped under some circumstances
  • Applied Patch 16: Patch for deserializing entities when reading HTML

New in HTML Cleaner 2.8 (Aug 26, 2014)

  • Fixed Issue 110: Performance problem take CPU to 100% [org.htmlcleaner.XPather]
  • Fixed Issue 109: Domserializer does not properly tag html ID attribute
  • Fixed Issue 107: Remove redundant escaping code from HtmlSerializer
  • Fixed Issue 106: JDOMSerializer fails unless you setUseCdataForScriptAndStyle=false
  • Fixed Issue 105: Element names in other namespaces than HTML should not be lowercased
  • Fixed Issue 104: svg:style rules should not be aggregated in the html:head section
  • Fixed Issue 82: Remove block-level restriction for tags

New in HTML Cleaner 2.7 (Dec 11, 2013)

  • Added a new desktop app for use with HtmlCleaner (thanks to Marton Szeles)
  • Fixed Issue 99: SVG elements in HTML are incorrectly modified
  • Fixed Issue 98: char sequence &; will be treated as SpecialEntity
  • Fixed Issue 97: BrowserCompactXmlSerializer erroneous whitespace handling for inline tags
  • Fixed Issue 95: NPE when trying to use JDomSerializer
  • Fixed Issue 93: Invalid cleaned HTML when empty DIV
  • Fixed Issue 89: Raw List type in tagnode.getElementList(condition, recursive)
  • Fixed Issue 88: Illegal character escaping in attributes values
  • Fixed Issue 87: Reinstate the HtmlSerializers
  • Fixed Issue 67: New line after XML declaration is wrongly taken into account
  • Fixed Issue 33: CDATA blocks are not recognized

New in HTML Cleaner 2.6.1 (Sep 5, 2013)

  • Fixed Issue 90: Re-instating the HtmlCleaner's public instance method clean(Reader)

New in HTML Cleaner 2.6 (Sep 5, 2013)

  • Fixed Issue 86: Thread safetyn
  • Fixed Issue 85: String.isEmpty not supported on Android 2.2 -> java.lang.NoSuchMethodError
  • Fixed Issue 84: HTMLCleaner 2.5 don't ignore anymore CDATA not in script/​style elements
  • Fixed Issue 76: Make Ant dependency optional
  • Fixed Issue 27: DomSerializer ignores the doctype
  • Fixed Issue 81: ConfigFileTagProvider, DefaultTagProvider out of sync

New in HTML Cleaner 2.5 (May 16, 2013)

  • Fixed Issue 77: HeadlessTagNode Constructor Does Not Correctly Copy Wrapped TagNode's Children
  • Fixed Issue 69: leaking resources - connection not closed
  • Fixed Issue 67: New line after XML declaration is wrongly taken into account
  • Fixed Issue 58/62: xml: namespace error on DomSerializer
  • Fixed Issue 55: Doctype upper case, name and validation
  • Fixed Issue 52: Bad serialization of HTML5 DOCTYPE clauses
  • Fixed Issue 48: Mutiple requests for the HTML page in 2.2
  • DocType handling has been significantly enhanced with support for parsing all currently valid DocTypes and providing additional information via the Java API.

New in HTML Cleaner 2.4 (May 16, 2013)

  • This is a major merge of the Github fork into the core HtmlCleaner code.

New in HTML Cleaner 2.2.1 (May 16, 2013)

  • An issue with Hex-based character encoding was fixed.

New in HTML Cleaner 2.2 (Mar 14, 2013)

  • HtmlCleaner is now thread-safe. Single instance can be used from multiple threads to parse multiple html sources safely. All serializers coming in the package are thread-safe as well.
  • Html-based serializers are introduced, intended to produce browser-friendly HTML. Now there are basically two serializer flavors: XML (simple, pretty, compact) and HTML (simple, pretty, compact). Html serializers doesn't strictly produce well-formed XML, but rather HTML for further browser consumption (for example special entities like Α are preserved, not escaped with Α, empty tags like script are not serialized as but rather as )
  • New parameter transResCharsToNCR is introduced, telling whether reserved XML characters (&, ", ', ) are serialized to their Numeric Character Representations (&#dd;)
  • New parameter transSpecialEntitiesToNCR is introduced, telling whether special HTML entities (Α for example) are serialized to their Numeric Character Representations (&#dd;)
  • Parameter omitHtmlEnvelope gets deprecated and new parameter omitEnvelope in command line/Ant and optional parameter in methods XXXSerializer.writeToXXX() is introduced instead, moving this logic to the right place. This way the whole body wihout enclosing tags is serialized, not only the first inner node as before.
  • List of special HTML entities is extended with number of new ones. Class SpecialEntity holding them has public method addEntity(entityName, entityCode) to define new ones if some are still missing.
  • TagNode has number of new methods for easier node manipulation (see API docs)
  • Visitor concept is implemented in TagNode in order to easily traverse DOM tree and collect some data/update the document.
  • Pretty XML/HTML serializers have optional parameter in constructors specifying indentation string (default is TAB character).
  • Tag definitions updated (col, legend...) to be consistent with the browsers.
  • Invalid XML characters are skipped during parsing/serialization.
  • DOM/JDom serialization bug fixes.
  • Namespaces found in source HTML are now handled properly (depending on omitXmlnsAttributes parameter).
  • Method HtmlClenaer.getAllTags() is removed, since this approach doesn't go with introduced thread-safety.
  • Few classes are renamed: ContentToken -> ContentNode, CommentToken -> CommentNode.
  • Parameter ignoreQuestAndExclam has now default value true.
  • Source code now has standard MAVEN structure.
  • HtmlCleaner now depends on Java runtime 1.5+.
  • For the list of fixed bugs

New in HTML Cleaner 2.1 (Mar 14, 2013)

  • Parsing transformations are developed in order to easily skip or change specified tags or attributes during the cleanup process.
  • Few more constructors added in class HtmlCleaner giving possibility to reuse same cleaner properties with multiple cleaner instances.
  • Code cleanup.

New in HTML Cleaner 2.0 (Mar 14, 2013)

  • Complete code refactoring is done so as to better separate roles of cleaner, cleaner properties, object model nodes and serializators. API is not compatible with previous versions, though it is still very simple for use.
  • Post-cleaning node manipulation is enabled with rich set of methods in TagNode class. Now, there is no need to create DOM or JDom out of HtmlCleaner object model in order to select, add or remove some nodes or attributes.
  • Basic XPath is supported on HtmlCleaner object model. Despite partial implementation, if should be power enough to find or collect nodes/attributes/text even with fairly complex criteria.
  • Modifying already cleaned HtmlCleaner object model is enabled with HtmlCleaner.setInnerHtml(node, html) similar to DHTML feature to set inner html of an object.
  • Creating custom tag rule set is now much easier by defining XML configuration file.
  • New properties booleanAttributeValues and nodeByXPath for setting cleaner's behavior are introduced.
  • Test cases added to source code.
  • Memory leak problem in Java 1.4 fixed.
  • Number of bug fixes.

New in HTML Cleaner 1.6 (Mar 14, 2013)

  • New flag parameter ignoreQuestAndExclam is introduced offering control over special tags -

New in HTML Cleaner 1.55 (Mar 14, 2013)

  • Added Reader based HtmlCleaner constructors.
  • New parameter pruneTags is introduced offering a way to remove undesired tags with all the children from XML tree after parsing and cleaning.
  • Bug fixes.