PDFTextStream Changelog

What's new in PDFTextStream 2.6.0

Aug 10, 2012
  • New OutputHandler: com.snowtide.pdf.SelectionOutputTarget, implementing text extraction based on a "selection coordinates", as commonly found in user-facing PDF viewer UIs.
  • PDFTextStream is now free for use in single-threaded applications; all previous "evaluation" limitations no longer apply when PDFTextStream is operated without a license file.

New in PDFTextStream 2.3.2 (Sep 28, 2011)

  • Fixed issue where PDFTextStream would fail to initialize when the default system locale was set to Shift_JIS (i.e. SJIS, MS932, Windows-31J)
  • Fixed an issue where certain Chinese, Japanese, and Korean fonts were not being loaded properly when specific encoding config data was missing.
  • Fixed an octal string parsing bug that could lead to a PDF parsing failure.
  • Added crop box attribute to com.snowtide.pdf.Page interface
  • An expanded set of control characters are now treated as whitespace.
  • Added support for non-compliant PDF documents produced by TXT2PDF for OS/390.

New in PDFTextStream 2.3.1 (Sep 28, 2011)

  • Added methods to VisualOutputTarget to enable the optional exclusion of rotated content from its output (523)
  • Fixed a bug where rotated characters were reporting a rotation angle (theta) of 0 when presented to VisualOutputTarget. (519)
  • Fixed a bug where use of PDFTextStream.NET in a multithreaded environment could produce garbled or missing text extracts in very limited circumstances. (512)
  • Added support for PDFs that contain malformed arrays in their graphics output streams (509)
  • Fixed a bug where text rendered using a Type3 font that has a proper unicode mapping was being omitted from extracts (507)
  • Significantly improved the emission of whitespace between words on lines with large amounts of tracking (506)
  • Fixed character mapping for 'ã' and '- ' ("middle dot") (502)
  • Fixed a bug affecting VisualOutputTarget and RegionOutputTarget where smaller characters would not be included in resulting text extracts. (499)
  • Fixed an issue where string values held in compressed object streams were being re-encrypted (primarily affecting key/value PDF attributes) (495)
  • Fixed an issue where PDF documents generated by PDFSharp were improperly handled, leading to significant degradation of extraction accuracy. (490)
  • Fixed an issue where CFF font encodings were being applied inappropriately, potentially leading to garbled extracts. (479)
  • fixed a bug related to zero-length cross-reference entry codes that was resulting in a improper FaultyPDFException being thrown (450)

New in PDFTextStream 2.3.0 (Sep 28, 2011)

  • Added an .isStruckThrough() method to com.snowtide.pdf.TextUnit, indicating whether a character has a strikethrough drawn through it.
  • Improved PDFTextStream's support for embedded character mappings.
  • The calculation of whitespace between words has been fixed to properly account for whitespace that is explicitly encoded in the source PDF documents.
  • Improved PDFTextStream's handling of composite content encodings, which previously could fail resulting in some ranges of PDF content being 'ignored' during extraction.
  • Fixed a bug in VisualOutputTarget where text from a single line would be split over multiple lines
  • Improved vertical alignment of text extracted using VisualOutputTarget
  • Improved VisualOutputTarget-produced extracts to eliminate spurious additional whitespace between closely-adjacent words

New in PDFTextStream 2.2.5 (Sep 28, 2011)

  • Added support for extracting XFA forms data as XML
  • Significantly improved the performance of text extraction using VisualOutputTarget
  • Added support for PDF documents larger than 2GB
  • Fixed a bug where the encodings from embedded Type1 fonts were previously not being applied properly in some circumstances.
  • Fixed a problem where newer content in updated PDF documents were sometimes being ignored.
  • Fixed a problem where PDFDocEncoding-encoded bookmarks and metadata were not being decoded properly
  • added .getDestinationName() method to com.snowtide.pdf.Bookmark

New in PDFTextStream 2.2.1 (Sep 28, 2011)

  • PDFTextStream.NET now ships with ikvm v0.3.4, which fixes a number of problems that prevented PDFTextStream from functioning properly across multiple AppDomains (598)
  • Added PDFTextStream.loadLicense(URL) function (475)
  • Added a 'spacing scale' property to VisualOutputTarget which allows applications to control the amount of horizontal whitespace that should be emitted per physical amount of whitespace found in the source document (528)
  • PDFTextStream will now attempt to load a license file from the host application's current directory before checking the current classpath / AppDomain (661)
  • Fixed a problem where pathological embedded Unicode character encodings were causing PDFTextStream to strings of control characters rather than reasonable extracted content. (428)
  • Fixed a bug in PDFTextStream's handling of cross reference entries that caused fatal errors in some documents (620)
  • Fixed a problem where UTF-16 encoded bookmark titles were not being decoded properly (618)

New in PDFTextStream 2.2 (Sep 28, 2011)

  • Added support for Apache Lucene v2.1 and v2.2 to PDFTextStream's Lucene integration module (com.snowtide.pdf.lucene.PDFDocumentFactory)
  • Added com.snowtide.pdf.PDFTextStreamConfig, which enables simple static and runtime configuration of PDFTextStream
  • Added new PDFTextStream constructors that accept customized PDFTextStreamConfig instances, and a setConfiguration(PDFTextStreamConfig) function to set a PDFTextStream instance's configuration at runtime
  • PDFTextStream now joins adjacent rectangles that have similar stroke and fill colors, which improves various page segmentation results
  • Improved table detection processes to adaptively recognize very small "variant" table cells
  • Improved pdfts.examples.XMLOutputTarget to build an XML DOM Document instead of constructing XML using a StringBuffer; block elements now include a type attribute of "table" if the block is a table
  • Significantly improved the quality of PDF documents generated when merging PDF files (com.snowtide.pdf.util.MergeUtil) and when saving updated PDF forms (com.snowtide.pdf.forms.AcroForm#writeUpdatedDocument(OutputStream))
  • Rotated text blocks are now properly grouped within bounded regions
  • Changed pdfts.cjk.disable and pdfts.mmap.disable system properties to pdfts.cjk.enable and pdfts.mmap.enable, respectively
  • Fixed an overflow bug in PDFTextStream's PDF data parser
  • Fixed a bug where the ascent and descent characteristics of some fonts were defaulting to improper values
  • Fixed a bug where lines and rectangles drawn with a Separation color space were not being recognized properly
  • Fixed a bug where an error would result when reading a PDF file with a non-conforming linebreak sequence after the `stream' tag
  • Fixed a bug where tables containing underlined text would not be recognized properly
  • Fixed a bug where edges of rectangles were improperly recognized as text underlines
  • Fixed a bug where PDFTextStream wouldn't recognize PDF data stream filter name abbreviations

New in PDFTextStream 2.1.6 (Sep 28, 2011)

  • Added com.snowtide.pdf.util.TableUtils, which provides a set of CSV conversion functions for exporting the contents of tables
  • Added options to specify path to load PDFTextStream license file via pdfts_license_path environment variable or system property
  • Added com.snowtide.pdf.PDFTextStream.loadLicense(String) - programmatic way to specify path from which to load PDFTextStream license file
  • Changed PDFTextStream's default page segmentation algorithms to not eliminate empty table cells, making it simpler to export tabular content to Excel, etc.
  • Fixed bug in VisualOutputTarget where vertically-adjacent lines of text were being inappropriately combined
  • Fixed text encoding bug where text extracted from PDF documents generated by Adobe InDesign v4.0 - v5.0 would be "scrambled", or appear to be series of Chinese glyphs
  • Fixed bug where AFM font mappings were sometimes applied in an incorrect order, leading to spot errors in text extracts
  • Fixed bug where certain embedded Type1 font encodings were not being loaded correctly, resulting in single-character extraction errors

New in PDFTextStream 2.1.5 (Sep 28, 2011)

  • Significant improvements in the handling and standard output of rotated content
  • Added com.snowtide.pdf.layout.TextUnit.getTheta()

New in PDFTextStream 2.1.3 (Sep 28, 2011)

  • Added com.snowtide.pdf.Font.isItalic() -- indicates whether a font is italicized
  • Added com.snowtide.pdf.layout.TextUnit.isUnderlined() -- indicates whether a character is underlined
  • Added tagging of italic text regions in pdfts.examples.XMLOutputTarget