What's new in Bytescout PDF Extractor SDK 13.4.1 Build 4780

Jul 14, 2023
  • Enhanced text parsing
  • Improved image file rendering
  • Other minor fixes and improvements.

New in Bytescout PDF Extractor SDK 13.4.0 Build 4659 (Apr 10, 2023)

  • Added support for WEBP image format in 'RasterRenderer' and 'HTMLExtractor'
  • Adding Variant methods to extractors
  • Improved fonts rendering
  • Fixing crash on text object where contentLength
  • Performance improvements
  • Other minor fixes and improvements.

New in Bytescout PDF Extractor SDK 13.3.0 Build 4514 (Sep 27, 2022)

  • DocumentSplitter: added support for "**" split range that splits document into pairs of pages.
  • Added methods to all extractors that support Variant datatype for input and output. They allow to perform in-memory processing when using the SDK as COM/ActiveX object from Delphi, VC++, VBScript, etc.
  • Fixed text search for RTL languages.
  • Input photo images are now rotated according to EXIF information.
  • Improved parsing of PDF documents.
  • Other minor fixes and improvements.

New in Bytescout PDF Extractor SDK 13.2.0 Build 4485 (Jun 7, 2022)

  • 'DocumentRotator' now can automatically fix rotation of PDF files using OCR.
  • Improved line removal algorithm.
  • Improved loading of embedded fonts.
  • Performance improvements.
  • Rotated text objects were combined with unrotated ones. Fixed now.
  • Fixed parsing of names of file attachments.
  • 'SearchablePDFMaker': fixed coordinates of transparent text in the output document when the input is an image.
  • Suppressed junk console message.
  • Improved parsing of PDF documents.
  • Other minor fixes and improvements.

New in Bytescout PDF Extractor SDK 13.1.0 Build 4386 (Jan 25, 2022)

  • DocumentMerger: Added property 'MergedDocumentTitle' allowing to override the title of merged document
  • XLSExtractor: Added property 'CustomColumnWidths' allowing to specify exact column widths in generated Excel spreadsheet
  • JSONExtractor: The mode 'OutputStructure.Full' is renamed to 'OutputStructure.LegacyFixed' and made maximally compatible in field names with the mode 'OutputStructure.Legacy'
  • Added support for UniKS-UCS2-H text encoding
  • InfoExtractor: Added method 'GetFormFields()' returning information about form fields in PDF document
  • Improved COM/ActiveX interfaces for in-memory processing without file operations
  • Extractors and SearchablePDFMaker: Added property 'OCRDisableAutoSegmentation' to solve OCR engine's segmentation issues
  • NET Core min required version is 2.1 now (was 2.0)
  • Line grouping was not affected by 'ConsiderFontSizes' and 'ConsiderFontColors' properties. Fixed now
  • Fixed disposing issue in 'SearchablePDFMaker'
  • Improved parsing of PDF documents
  • Other minor fixes and improvements

New in Bytescout PDF Extractor SDK 13.0.0 Build 4253 (Jan 25, 2022)

  • New column detection mode 'ColumnDetectionMode.ContentGroupsAI' that works better on tables without borders and on pages with multiple tables
  • Greatly improved tables detection in 'TableDetector2'
  • Improved filtering of shadow-like text ('ExtractShadowLikeText' option)
  • Improved the 'LineGroupingMode.JoinOrphanedRows'
  • DocumentMerger': Improved merging of PDF forms. Now it can link fields with matching names or rename them to avoid unwanted linking. See the property 'RenameMatchingFieldsDuringMerge'
  • JSONExtractor' and 'XMLExtractor' now output the page size for each page
  • All extractor classes now support extraction of page ranges
  • Added properties 'DetectUnderlineTextStyle' and 'DetectStrikeoutTextStyle' to 'CSVExtractor' and 'XLSExtractor'. They help to prevent underlined text affecting the line grouping in table cells
  • Improved background color detection for the option 'ConsiderBackgroundColors'
  • Added property 'NormalizeText' to all extractors. It replaced unicode spaces and hyphens in the extracted text with normal ' ' and '-' characters
  • Remover2': fixed handling of PDF page rotation
  • Remover2': making unsearchable now performed only for edited pages
  • XMLExtractor': Added property 'IndentedXML' to control indentation
  • JSONExtractor': Added property 'IndentedJSON' to control indentation
  • Stamper': fixed stamping of rotated pages
  • Added new OCR mode - 'OCRMode.AutoRepairFonts'. It automatically tries to detect PDF documents with corrupted text and forces OCR font repair for them. Works only for English texts
  • Added property 'PageSeparator' to CSV and XLS extractors
  • XLSExtractor': improved negative numbers detection
  • TextExtractor.FindAll()' method was ignoring the case sensitivity option. Fixed now
  • Added property 'OCRDetectLines' that helps to detect table structure in scanned documents
  • JSONExtractor' and 'XMLExtractor' now outputs number of pages in the result and number of pages for which OCR was performed
  • Added property 'OCRPageCount' to extractors that contains number of pages for which OCR was performed during the last extraction
  • JSONExtractor': Added property 'OutputStructure' that allows to select structure of output JSON
  • JSONExtractor': Added property 'OutputTransformation' that allows to apply JSONPath expression to the output JSON
  • Performance improvements
  • Improved parsing of PDF documents
  • Other minor fixes and improvements

New in Bytescout PDF Extractor SDK 12.1.0 Build 4136 (May 18, 2021)

  • Added property 'TextExtractor.FuzzySearch' that enables 'fuzzy' text search algorithm. It allows to find
  • 'approximately equal' strings.
  • Added 'DocumentSplitter2' class that splits document by found text.
  • Added 'CSVExtractor.NormalizeCSV' property. It makes CSV data produced from different document pages to contain
  • the same number of columns.
  • Added property 'JSONExtractor.OutputStructure' that allows to change the structure of the generated JSON
  • to one of predefined variants for easier postprocessing.
  • Added property 'JSONExtractor.OutputTransformation' that allows to apply JSONPath expression to the generated JSON.
  • Added property 'OCRPageCount' to extractor classes that contains number of pages for which OCR was performed.
  • 'JSONExtractor' and 'XMLExtractor' now add to the generated JSON and XML result the number of process pages
  • and the number of pages for which OCR was performed.
  • Added property 'OCRDetectLines' to extractor classes that improves column detection in scanned documents.
  • Added property 'ConsiderBackgroundColors' to extractor classes that enables detection of background color
  • under text objects. It may helps to improve row and column detection in tables without borders but with
  • color stripes.
  • Added properties 'DocumentMerger.GenerateBookmarks' and 'DocumentMerger.BookmarkTitles' to enable automatic
  • generation of bookmarks pointing to the merged parts.
  • Improved PDF optimization in 'DocumentSplitter'.
  • = 'DocumentMerger' now uses the first input document as the base for the merged document. This allows to keep document
  • information properties and outlines.
  • DocumentMerger: added support for profiles.
  • MultimediaExtractor: added support for more media types.
  • - 'TextExtractor.FindAll()' method was ignoring the case sensitivity option.
  • Fixed issue with junk empty temporary files generated during OCR.
  • Improved parsing of PDF documents.
  • Other minor fixes and improvements.

New in Bytescout PDF Extractor SDK 11.3.0 Build 3983 (Oct 26, 2020)

  • DocumentSplitter: Added support for regions with inverted page numbers. For example, "!1" means "the last page", "!1-!3" or "!3-" means "last three pages". DocumentSplitter: Added support for "*" split range that means "split every single page"
  • Added 'InfoExtractor.Metadata' property that gets XMP metadata from the document
  • Improved joining of multi-line cells in tables without borders ('LineGroupingMode.JoinOrphanedRows' mode)
  • Improved detection of OCR language file versions
  • Improved .NET Core 2.0 compatibility
  • Improved unwrapping of multi-line cell text
  • Fixed issue when invisible vector drawings were causing unwanted separation of text objects
  • Fixed extraction from area when running OCR against image file (not PDF!)
  • Improved parsing of PDF documents
  • Other minor fixes and improvements

New in Bytescout PDF Extractor SDK 11.2.0 Build 3919 (Jun 30, 2020)

  • MultimediaExtractor' now supports extraction of 3D-animation objects
  • TextExtractor.Find()' now keeps original font names in found object information
  • Improved column detection in `ColumnDetectionMode.Borders` mode
  • SearchablePDFMaker' did not process vector-only pages. Fixed now
  • Improved regex text search in 'TextExtractor'
  • Added 'DetectUnderlineTextStyle' and 'DetectStrikeoutTextStyle' properties to 'JSONExtractor' and 'XMLExtractor'
  • Added 'OCRWhiteList' and 'OCRBlackList' properties to extractors
  • Added 'Invert' OCR preprocessing filter
  • Added 'Scale' OCR preprocessing filter
  • Improved joining of multi-line cells in tables without borders (`LineGroupingMode.JoinOrphanedRows` mode)
  • Improved performance of 'ImageExtractor'
  • Added page rectangles to 'InfoExtractor'
  • Improved 'OCRAnalyzer'
  • Improved automatic deletion of duplicated text objects during the extraction
  • Fixed extraction issues in .NET Core version
  • Improved parsing of PDF documents
  • Other minor fixes and improvements

New in Bytescout PDF Extractor SDK 11.1.0 Build 3845 (Mar 19, 2020)

  • Added 'OCROverallConfidence' property in all extractors that.
  • SearchablePDFMaker: Added 'KeepOriginalRotation' property.
  • SearchablePDFMaker: fixed crash on mixed English-Arabic text recognition.
  • PDF Multitool: Added "Developer Tools" sub-menu to the context menu.
  • Improved parsing of PDF documents.
  • Other minor fixes and improvements.

New in Bytescout PDF Extractor SDK 11.0.0 Build 3805 (Feb 12, 2020)

  • Added support for new revision of PDF encryption (ISO 32000-2:2017 compliance).
  • Added 'LicenseInfo' property providing detailed information about your license.
  • Added 'Grayscale' filter to OCRImagePreprocessingFilters.
  • Dramatically improved column extraction for multiple tables on a page. Works only in `ColumnDetectionMode.Borders` mode for tables with borders between columns and rows.
  • Greatly improved `ColumnDetectionMode.BorderedTables`. As in the table detection, it now uses optical recognition to detect bordered tables and their columns on scanned documents.
  • Improved 'InfoExtractor' to return the encrypted and password-protected states without asking a password or throwing an exception.
  • Added document permissions information to 'InfoExtractor'.
  • DocumentSplitter: added zero-padding to page numbers in generated file names.
  • Improved extraction of duplicated text (shadow-like effect).
  • Improved 'MultimediaExtractor'.
  • Fixed text search issues on some documents.
  • Fixed bug that damaged extracted text only during multi-thread processing.
  • Fixed crash on subsequent extractions with different OCR modes.
  • Fixed .NET Core compatibility issue.
  • Improved parsing of PDF documents.
  • Other minor fixes and improvements

New in Bytescout PDF Extractor SDK 10.8.0 Build 3732 (Dec 4, 2019)

  • Remover2: Added 'MaskColor' property that allows to change color of masking rectangle.
  • Remover & Remover2: Fixed incomplete removal of the text in some cases.
  • XMLExtractor and XFDFExtractor: fixed missing control types.
  • Fixed parsing of combobox items that consist of value+label pairs.
  • Improved handling of Arabic fonts and charsets.
  • Improved handling of CJK fonts and charsets.
  • Improved parsing of PDF documents.
  • Other minor fixes and improvements.

New in Bytescout PDF Extractor SDK 10.7.0 Build 3697 (Nov 2, 2019)

  • Improved extraction of embedded images.
  • Improved table columns detection.
  • Remover2: fixed crash on sequential Add*() method calls.
  • PDF Multitool: fixed crash on multimedia extraction.
  • Improved parsing and processing of PDF documents.
  • Other minor fixes and improvements.

New in Bytescout PDF Extractor SDK 10.6.0 Build 3659 (Oct 1, 2019)

  • Added methods to remove vector objects to `Remover` and `Remover2` classes.
  • Added experimental `TableDetector2` class demonstrating new table detection method.
  • Improved replacement of not embedded PDF fonts.
  • Improved splitting of text objects when using CustomExtractionColumns.
  • Fixed text search on some documents.
  • Added `CreateProfile()` method to all extractors that creates profile from current object.
  • PDF Multitool: Added tools to remove text, image, and vector objects.
  • PDF Multitool: Fixed "Save Vectors" option in XML extraction.
  • Improved parsing and processing of PDF documents.
  • Other minor fixes and improvements.

New in Bytescout PDF Extractor SDK 10.5.0 Build 3637 (Sep 2, 2019)

  • DocumentMerger: Added "MergeFolder()" method allowing to merge all PDF files in folder
  • Improved extraction by CustomExtractionColumns
  • Remover: Improved appearance of partially removed text objects
  • Renderer and Viewer: Improved rendering of small fonts with stroke
  • PDF Multitool: Added Full Screen mode
  • PDF Multitool: Added "Night Mode"
  • PDF Multitool: Fixed selection reset on switching a tool
  • Improved parsing and processing of PDF documents
  • Other minor fixes and improvements

New in Bytescout PDF Extractor SDK 10.4.0 Build 3600 (Aug 7, 2019)

  • Added extracted text analysis. See "EnableTextAnalysis" property.
  • Improved columns detection.
  • Implemented replacement filters allowing to replace extracted text before analysis of table structure.
  • Added "SensitiveDataDetector" class allowing to detect sensitive data in PDF documents.
  • Added new "Remover2" class: improved version of "Remover" with better interface.
  • PDF Multitool: Added "Save vector objects" option to XML and JSON converters.
  • PDF Multitool: Improved "Detect Tables" dialog.
  • PDF Multitool: Improved conversion to HTML format.
  • PDF Multitool: Added set of tools "Sensitive Data Suite" allowing to detect and remove
  • sensitive data in PDF documents.
  • PDF Multitool: Reduced memory consumption on extraction from very large documents.
  • Other minor fixes and improvements.

New in Bytescout PDF Extractor SDK 10.3.0 Build 3566 (Jul 2, 2019)

  • Added property `OCRMaximizeCPUUtilization` that allows to improve OCR performance at the cost of maximized CPU utilization.
  • Improved OCR rotation detection.
  • Fixed OCR crash on systems with CPU without AVX and AVX2 extensions.
  • Fixed OCR crash when working under limited system accounts.
  • Improved the detection of the visibility of text objects when they are hidden by a overlying opaque vector object.
  • Improved extraction from cropped PDF pages.
  • Fixed `OutOfMemoryException` on tiling patterns with very large step or bounding box.
  • Improved extraction of embedded images.
  • Improved extraction of multimedia files.
  • Fixed decoding of UTF-8 encoded text objects.
  • Improved Japanese fonts decoding.
  • Fixed `LineGroupingMode.JoinOrphanedRows` mode for multiple single-cell lines.
  • PDF Multitool: Replaced legacy `FolderBrowserDialog` with modern `FolderSelectDialog` everywhere.
  • PDF Multitool: Added Ctrl-Shift-O hot key to open recent document.
  • Other minor fixes and improvements.

New in Bytescout PDF Extractor SDK 10.2.0 Build 3512 (May 29, 2019)

  • Improved OCR engine stability when working in strict environments.
  • Improved columns separation by 'CustomExtractionColumns'.
  • Added parameter for 'TextExtractor.Find()' method that allows to specify RegexOptions.
  • Added support for streams to 'DocumentSplitter' and 'DocumentMerger'
  • Added property 'TableDetector.EnhanceTableBorders' affecting the table detection in 'Bordered Tables' mode.
  • Improved parsing and processing of PDF documents.
  • PDF Multitool: Visited pages are now displayed much faster.
  • PDF Multitool: Improved keyboard navigation.
  • PDF Multitool: Improved CSV preview.
  • PDF Multitool: All tools now shows elapsed time in the status bar.
  • PDF Multitool: Changed default OCR grade to 'Best'.
  • Other minor fixes and improvements.

New in Bytescout PDF Extractor SDK 10.0.0 Build 3420 (Mar 22, 2019)

  • Greatly improved OCR quality and performance.
  • PDF Multitool: New option to select OCR grade.
  • PDF Multitool: Fixed behavior of "Remove" button in "Merge documents" tool.
  • PDF Multitool: Reduced excessive painting in selection mode.
  • Improved parsing and rendering of PDF documents.
  • Other minor fixes and improvements.

New in Bytescout PDF Extractor SDK 9.4.0 Build 3398 (Mar 12, 2019)

  • Added TextExtractor.FindAll() and TextExtractor.FindAllToJSON() methods.
  • Added `AnnotationExtractor` class.
  • Improved handling of embedded PDF fonts.
  • Improved parsing of PDF documents.
  • PDF Multitool can now be set as default PDF viewer application in Windows.
  • PDF Multitool: Added the ability to preview the conversion.
  • PDF Multitool: Reworked converters' options dialogs. Removed weird options, added actual ones.
  • PDF Multitool: Now Ctrl-PageUp and Ctrl-PageDown keys switch pages even if PDFViewerControl is not focused.
  • PDF Multitool: Improved handling of PDF extraction permissions.
  • Fixed unwanted byte order mark (BOM) when writing extracted text to MemoryStream.
  • Fixed line grouping in table cells.
  • Fixed crash in XMLExtractor when input document is image.
  • Improved parsing of XFA forms.
  • Improved Deskew image preprocessing filter.
  • Added `ShrinkMultipleSpaces` property improving column detection if text in a table contains multiple spaces between words.
  • Fixed column detection in rotated pages.
  • Improved support of Microsoft Excel formats.
  • Other minor fixes and improvements.

New in Bytescout PDF Extractor SDK 9.3.0 Build 3352 (Feb 4, 2019)

  • Added OCRCorrections property to all extractors that implement OCR.
  • Added .NET Core compatible assemblies.
  • Improved support of Korean fonts.
  • Improved parsing of PDF documents.
  • Improved columns detection.
  • XMLExtractor, JSONExtractor: Added `SaveVectors` property.
  • OCRExtension: Suppressed unwanted console messages.
  • Removed C++ runtime dependencies.
  • Fixed merging of PDF forms containing fields with the same name.
  • Other minor fixes and improvements.

New in Bytescout PDF Extractor SDK 9.2.0 Build 3254 (Oct 24, 2018)

  • Changed font rendering engine to improve text rendering and to circumvent Windows GDI font processing issues.
  • Improved extraction of embedded media files.
  • Improved detection of columns when extracting tabular data.
  • PDF Rederer SDK: Property `RenderingOptions.PreferSystemFonts` made obsolete due to change of font rendering engine.
  • XLSExtractor: improved Excel format support.
  • Embedded default fonts to fallback to if a font is missing in Windows.
  • Improved support of cropped PDF documents.
  • Improved extraction of text from rotated pages.
  • PDF Multitool: Added "OCR Analyzer" tool.
  • Performance improvements.
  • Other minor fixes and improvements.

New in Bytescout PDF Extractor SDK 9.1.0 Build 3163 (Jul 19, 2018)

  • Added new line grouping mode `LineGroupingMode.JoinOrphanedRows`.
  • Added new OCRAnalyzer class that can help to find optimal combination of OCR image preprocessing filters. See source code examples.
  • Added new LineDetector class allowing to find all vertical and horizontal lines in document.
  • Added public methods GetPreprocessedPagePreview() and SavePreprocessedPagePreview() allowing to preview the result of OCR image preprocessing filters work.
  • Greatly improved the line removing OCR image preprocessing filters.
  • SearchablePDFMaker: fixed hanging on processing PDF documents with large count of vector objects.
  • Fixed bug in RotationAngle property when processing already rotated PDF documents.
  • ImageExtractor now correctly handles the rotation of embedded images.
  • PDF Multitool: added new feature "Optimize PDF document".
  • PDF Multitool: fixed resolution selection in "Make PDF unsearchable".
  • PDF Multitool: fixed rotation angle selection in "Rotate Document".
  • Other minor fixes and improvements.

New in Bytescout PDF Extractor SDK 9.0.0 Build 3079 (Apr 12, 2018)

  • Added RotationAngle property to rotate document pages before the extraction.
  • TextExtractor: Improved plaint text columns alignment.
  • XLSExtractor: Improved numbers detection.
  • DocumentOptimizer: Greatly improved optimization effectiveness.
  • Greatly improved Deskew algorithm for OCR of rotated scans.
  • Remover: more accurate deletion of text objects.
  • SearchablePDFMaker: Fixed processing of rotated scans.
  • SearchablePDFMaker: Fixed resolution issues when the input is image.
  • Other minor fixes and improvements.

New in Bytescout PDF Extractor SDK 8.8.1.3025 (Jan 29, 2018)

  • Improved formatting of extracted plain text (TextExtractor). Now columns look better.

New in Bytescout PDF Extractor SDK 8.8.0.3015 (Jan 23, 2018)

  • Fixed: OCR preprocessing filters were not applied if input document is image.
  • PDF Multitool: Fixed image preprocessing filters in "Find Text" dialog.
  • TableDetector now provides detected cells information for ColumnDetectionMode.BorderedTables (see `FoundTableCells` property).
  • XMLExtractor: Added annotations extraction;
  • XMLExtractor: Object coordinates in XML are fractional now for better precision (were integer).
  • Improved support of encrypted PDF documents.
  • Other minor fixes and improvements.

New in Bytescout PDF Extractor SDK 8.7.0.2980 (Nov 8, 2017)

  • DocumentOptimizer: added automatic resampling of high resolution images.
  • Added `ParsingError` event allowing to handle parsing errors and interrupt or continue the processing.
  • SearchablePDFMaker: Added DiscardExistingDocumentText property allowing to overwrite previous OCR.
  • Added AllowStandalonePunctuation property to tabular extractors (CSV, XML, JSON, XLS).
  • Performance improvements.
  • SearchablePDFMaker: Invisible text dimensions now match recognized text pieces.
  • DocumentSplitter: Fixed `outputFolder` parameter in SplitCOM() method.
  • Made IBaseTextExtractor interface public.
  • Other minor fixes and improvements.

New in Bytescout PDF Extractor SDK 8.6.0.2911 (Aug 6, 2017)

  • XMLExtractor, JSONExtractor, HTMLExtractor: Added KeepOriginalFontNames property.
  • TextComparer: Added GetChanges() method to get comparison results in form convenient for programmatic analysis.
  • DocumentRotator: It is now possible to specify pages to rotate.
  • TextExractor.ExtractColumnByColumn property now affects Find() method.
  • Fixed font names in SearchResult elements.
  • Fixed Contrast preprocessing filter.
  • Extraction: subscript and superscript text objects were merged with normal text. Fixed now.
  • Other minor fixes and improvements.

New in Bytescout PDF Extractor SDK 8.5.0.2855 (Jun 2, 2017)

  • Improved Japanese text extraction.
  • Removed obsolete ClientProfile builds.
  • Improved multimedia files extraction.
  • Other minor fixes and improvements.

New in Bytescout PDF Extractor SDK 8.4.0.2820 (Jun 2, 2017)

  • New event ProgressChanged in all time-consuming classes. The event reports the progress in percents and also allows to interrupt the processing;
  • SearchablePDFMaker now supports single and multi-page images as the input and produces a PDF document at the output.
  • Performance improvements;
  • Fixed crash when the input document is image and it's loading from stream;
  • Other minor fixes and improvements.

New in Bytescout PDF Extractor SDK 8.3.0.2792 (Jun 2, 2017)

  • Added new Remover class allowing te remove text from PDF documents.
  • InfoExtractor now able to read custom document properties (see CustomProperties property).
  • XMLExtractor and JSONExtractor now able to extract document images and put them to outer files or embed as Base64 string.
  • Text extraction: Unwrap property now affects the text in table cells.
  • Text extraction: Improved lines grouping in table cells.
  • AttachmentExtractor: Fixed extraction of attachments and portfolio created with Microsoft Outlook.
  • DocumentSplitter: Fixed document optmization (OptimizeSplittedDocuments property).
  • Performance improvements.
  • Other minor improvements and bug fixes.

New in Bytescout PDF Extractor SDK 8.2.0.2697 (Feb 1, 2017)

  • Fixed Unwrap option.
  • Improved bordered tables detection.
  • Improved attachments extraction.
  • Added support for profiles - quick way to apply multiple settings at once.
  • OCR: Implemented rotation detection of wrongly oriented scanned PDF pages.
  • SearchablePDFMaker now able to automatically rotate wrongly oriented scanned PDF pages.
  • Fixed exception in SearchablePDFMaker when loading document from stream.
  • Fixed memory leaks in OCR.
  • TextExtractor and CSVExtractor: Added Save* methods overrides allowing to specify the charachers encoding.
  • Improved media files extraction.
  • Improved Vertical Line Remover OCR preprocessing filter.
  • Other minor improvements and bug fixes.

New in Bytescout PDF Extractor SDK 8.1.1.2606 (Nov 18, 2016)

  • Fixed OCR preprocessing filters in SearchablePDFMaker.
  • Fixed OCR preprocessing filters PDF Multitool demo app.
  • Added Gamma Correction preprocessing filter.
  • Added Horizontal Lines Remover preprocessing filter.
  • Improved Dilate preprocessing filter.

New in Bytescout PDF Extractor SDK 8.1.0.2600 (Nov 18, 2016)

  • Added OCR preprocessing filters to improve the recognition quality on low-quality scanned documents.
  • Added new DocumentOptimizer class able to recompress all document images with JPEG or CCITT compression.
  • Added text removal filters.
  • All extraction class (TextExtractor, XMLExtractor, etc.) now able to load image files and extract text from them using OCR.
  • PDF Multitool demo app now able to load image files and extract text from them using OCR.
  • Fixed extraction of text in Korean charset (KSCms-UHC-H / Code Page 949).
  • Improved text extraction from specified rectangular area.
  • Improved extraction of invisible text.
  • Fixed transparent color representation in XML extraction.
  • Other minor improvements and bug fixes.

New in Bytescout PDF Extractor SDK 8.0.0.2523 (Nov 18, 2016)

  • Added filtering of extracted content by font name, font size and color.
  • Updated OCR engine to the latest version. Update language files from "tessdata" folder.
  • Improved text extraction.
  • Improved lines grouping in tabular data.
  • Improved performance.
  • Improved XFA forms extraction.
  • Improved TableDetector.
  • Fixed PDF parsing issues.
  • Fixed JBIG images decoding.
  • ImageExtractor: fixed per-page image extraction.
  • MultimediaExtractor: fixed extraction on embedded MPEG audio.
  • TextExtractor: fixed non-working RemoveHyphenation property.
  • Other minor improvements and bug fixes.

New in Bytescout PDF Extractor SDK 7.00.0.2474 (Jul 7, 2016)

  • Added new DocumentPrinter utility class allowing to print PDF documents silently (without any user dialogs).
  • Added new JSONExtractor class.
  • Added override for DocumentSplitter.Split() method allowing to specify the output folder for generated files.
  • Fixed multi-threading bug in DocumentSplitter.
  • TableDetector now respects extraction area set by SetExtractionArea() method.
  • New properties in extraction classes:
  • ExtractionColumns - contains coordinates of detected columns;
  • CustomExtractionColumns - allows to override the column detection.
  • GetPageRect* methods did not take the page rotation into account.
  • Fixed bug in installer causing some files from previous installation were interfering with updates.
  • Reworked the registration checking. Now the library will not throw an exception,
  • but work in demo mode if you missed or input wrong RegistrationName and RegistrationKey.
  • PDF Multitool: Added recent document list to "Open PDF Document" button.
  • PDF Multitool: Selection can be resized now.
  • PDF Multitool: Added Extract JSON feature.
  • PDF Multitool: Improved Table Detector UI.
  • PDF Multitool: Greatly improved font rendering quality.
  • PDF Multitool: Added debug option "Show Detected Extraction Columns" to the context menu to display
  • the detected columns on the current page. Becomes visible only after running any extraction against
  • the current displayed page.
  • PDF Multitool: Fixed font rendering issue on 32-bit Windows.
  • Other minor improvements and bug fixes.

New in Bytescout PDF Extractor SDK 6.00.2071 (May 20, 2015)

  • PDF to XML, PDF To CSV, PDF To Text functionality improved
  • PDF To XLS command line sample added (based on vbscript)
  • PDF To HTML SDK adds new .DetectHyperLinks property (TRUE by default) to enable/disable automated links detection in the text
  • New SearchablePDFMaker (available for PRO licenses) to convert PDF into searchable PDF files
  • new properties in extractor: ConsiderFontNames, ConsiderFontSizes, ConsiderFontColors, ConsiderVerticalBorders in CFG files
  • header columns detection (when AutoAlighHeaderToColumns = true) improved
  • .DetectLinesInsteadOfParagraphs replaced with new .LineGroupingMode to control how lines are merged into paragraphs
  • IMPORTANT PDF To XML fixes long time issue with incorrect Y coordinate for text objects (was point to the bottom left instead of top left)
  • .TableXMinIntersectionRequiredInPercents and .TableYMinIntersectionRequiredInPercents properties added
  • C++ source code sample added
  • XML Extractor fixes missing empty columns in PreserveFormatting=true mode
  • Minor fixes in colors in some PDF files
  • support for for multiple OCR languages added
  • PDF Multitool GUI: adds Copy to Clipboard button to TXT, CSV, XML and raster renderer dialogs
  • XLSExtractor: adds PageToWorksheet property to enable/disable generation of separate worksheets per page.
  • new .TextEncodingCodePage property
  • PDFViewerControl: adds ValidateContextMenu allowing user to add custom items to context menu
  • PDF Viewer control: adds properties ShowTextObjects, ShowImageObjects, ShowVectorObjects.
  • XMLExtractor now adds "OCRConfidence" attribute for recognized text
  • PDF/A checking functionality (in beta)
  • improving controls and text checking and alignment according to the original layout. The issue was caused by the shift of Y coordinates in controls while parsing: that was incorrect. The correct way is to shif...
  • XML Extractor updated: now produces tag for checkboxes and text fields
  • changed using of current directory to temp directory.
  • checkboxes,radioboxes, editboxes, comboboxes are better supported
  • now allows partial trust callers.

New in Bytescout PDF Extractor SDK 5.80.1781 (Jan 29, 2015)

  • PDF to XML, PDF to CSV, PDF to Text functionality improved
  • OCRMode now provides 9 modes
  • .DetectLineInsteadOfParagraph now works much better. Set it to False to capture multiline text in table cells!
  • PDF controls support improved
  • FDF and XFDF data extraction added
  • Table detection improved to support multline text in cells and tables with absent rows
  • beta version of PDF/A validator added
  • minor fixes and improvements

New in Bytescout PDF Extractor SDK 4.00.1487 (Jun 3, 2014)

  • improved pdf to text, pdf to csv, pdf to xml
  • issue with extraction area fixed
  • Improved Unicode handling
  • new .ContentType to check if PDF is PDF, Portfolio or XFAForm
  • new properties: Unwrap, ExtractionAreaUsageMode
  • new AttachmentInfo class to obtain details about attachment
  • new XFA Form XML extraction support (see XFAFormExtractor and XFAFormToXML samples)
  • new ZuGFeRD PDF support added
  • Multhithreading performance improved
  • Licensing updated: Now Licensing is per developer
  • new "match whole word" parameter to TextExtractor.Find()
  • improved XLS and XLSX output

New in Bytescout PDF Extractor SDK 3.30.1240 (Dec 11, 2013)

  • improved support for old formats PDF files
  • image flipping issue in some PDF files fixed
  • improved text rendering in PDF files
  • minor bug-fixes

New in Bytescout PDF Extractor SDK 3.20.1209 (Dec 11, 2013)

  • table detection was not returning proper coordinates for 2nd and further tables, fixed
  • minor source code samples updates
  • DocumentSplitter now works with multipage TIF files
  • minor bug-fixes

New in Bytescout PDF Extractor SDK 3.20.1200 (Dec 11, 2013)

  • minor rotated text issues fixed
  • table detection was not returning proper coordinates, fixed
  • minor bug-fixes

New in Bytescout PDF Extractor SDK 3.20.1179 (Oct 23, 2013)

  • pdf to text and pdf data extraction improved
  • new .AutoAlignColumns (true by default) property to automatically align cells to the header column or not (switching this setting will help if you are getting some shifted cells)
  • new DocumentRotator class to rotate pages in PDF documents
  • new ExtractRawImages property in Images Extractor to define if we are extracting raw images or images with rotation and transformation applied
  • improved support of PDF files with rotated objects and pages
  • new source code sample showing how to extract page found by a keyword "Find Keyword And Extract Page"
  • Images Extractor: SetExtractionArea() method added to define a rectangle area to extract images from
  • improved Splitting Pages example
  • improved pages extraction from PDF
  • new RemoveUnusedResources method to remove unused resources from PDF to reduce file size
  • minor bug-fixes and improvements

New in Bytescout PDF Extractor SDK 3.20.1100 (Oct 23, 2013)

  • new method: DocumentSplitter.Split(sourcefile, splitPages) to extract mulitple ranges of pages from the same PDF file
  • minor bug-fixes in pdf to text engine

New in Bytescout PDF Extractor SDK 3.20.1092 (Aug 14, 2013)

  • pdf to text minor functionality fixes
  • x64 installer improvements
  • minor fixes for error messages
  • PDFDocument.Dispose() now not disposing the source stream with PDF if this stream was supplied by the user (so user should dispose it)
  • improved PDF format support
  • minor bug-fixes

New in Bytescout PDF Extractor SDK 3.20.1075 (Jul 15, 2013)

  • improved PDF To CSV, PDF To XLS, PDF To XML extraction
  • improved PDF reading speed and stability
  • minor bug-fixes

New in Bytescout PDF Extractor SDK 3.10.899 (May 17, 2013)

  • improved pdf to text conversion
  • improved PDF reading support
  • more source Visual Basic .NET, C# and VBScript code samples added
  • documentation updated

New in Bytescout PDF Extractor SDK 3.00.864 (Apr 13, 2013)

  • improved PDF extraction support
  • improved PDF handling
  • pdf splitting and merging: new property to optimize PDF files after splitting DocumentSplitter.OptimizeSplittedDocuments may decrease file size when needed
  • improved PDF fonts handling
  • demo utility updated
  • source code samples updated to run on any .NET framework by default
  • minor bug-fixes

New in Bytescout PDF Extractor SDK 3.00.825 (Mar 14, 2013)

  • improved pdf to text, pdf to csv
  • demo utility PDF Viewer reworked and updated for better UI experience
  • minor improvements and fixes in PDF support
  • improved PDF stability while working with PDF files with high density vector graphics inside
  • improved support for indexed color pallettes
  • improved embedded fonts rendering
  • better support for Unicode fonts
  • new .Version property to read exact version of the dll
  • minor updates and improvements

New in Bytescout PDF Extractor SDK 2.50.708 (Dec 13, 2012)

  • PDF data extraction speed improved
  • Windows 8 support improved
  • PDF images and colors support improved
  • PDF to csv, PDF xml, PDF to xls/xslx now skips first leading rows if they are empty
  • pdf text search now works better and provides more intelligent support for regular expressions
  • ActiveX support and installation improved and now provides single batches to run on Windows x86/x64 for Windows XP to 8 Pro
  • new property: .ExtractShadowLikeText to enable/disable extraction of shadowed text (where it is used as effect to create visual shadows)
  • minor bug-fixes and improvements

New in Bytescout PDF Extractor SDK 2.40.650 (Nov 2, 2012)

  • improved support for Unicode text extraction
  • improved support for PDF/A pdf files
  • issues with white stripes appearing on multiple images combined fixed
  • data extraction internal optimizations
  • improved support for 8 bit images inside PDF
  • vector drawings improved to provide better support for multiple small objects
  • Color representation in images with indexed colors fixed
  • Type2 fonts support improved
  • Improved support for embedded fonts in PDF produced by Ghostscript engine
  • CCIT images compression compression related issues fixed
  • LZW compressed PDF support improved
  • improved support for shading objects
  • improved PDF fonts support
  • improved support for PDF with 4 bit images

New in Bytescout PDF Extractor SDK 2.30.594 (Sep 24, 2012)

  • PDF data extraction improved
  • memory and speed optimizations
  • fixing issue with empty data while extracting data from some PDF files
  • improved images extraction support (more image encoding variations are supported)
  • minor updates in examples
  • minor bug-fixes

New in Bytescout PDF Extractor SDK 2.30.568 (Sep 24, 2012)

  • pdf to text conversion quality improved
  • multithreading usage stability has been improved
  • hanging issue on some PDF fixed
  • PDF Extractor SDK: updated sample for StructuredExtractor (previously known as TableExtractor interface)
  • minor fixes and improvements

New in Bytescout PDF Extractor SDK 2.20.539 (May 23, 2012)

  • improved stability
  • demo utility improved
  • important security fixes

New in Bytescout PDF Extractor SDK 2.20.525 (May 23, 2012)

  • improved speed (up to x2 faster on some documents)
  • Tables detection improved
  • updated PDF Viewer utility
  • improved support for structured text extraction (CSV and XML data extraction)
  • minor bug-fixes

New in Bytescout PDF Extractor SDK 2.20.458 (Feb 9, 2012)

  • Minor fixes in TableDetector class (.TableDetectionMinNumberOfColumns and .TableDetectionMinNumberOfRows were working incorrectly)
  • Improved text extraction for PDF files generated from text files
  • Improved support for PDF files produced by Adobe Acrobat
  • PDF Viewer: CSV, XML and Text extractor forms updated to show .PreserveFormattingOnTextExtraction option
  • Minor fixes in .NET 4.0 assemblies
  • Renderer SDK adds /Visual Basic/PDF To BMP using streams/ sample
  • Improved support for PDF with forms objects
  • Improved leading spaces format detection in text extraction
  • .SetExtractionArea() added to define area on a page to work with in PDF Renderer SKD
  • Improved fonts information reading support in PDF files
  • New .PageSeparator property in TextExtractor allowing to define a separator String for pages if you need one
  • Fixing issue with indexed colorspaces in PDF
  • Improved PDF format support

New in Bytescout PDF Extractor SDK 2.20.415 (Dec 29, 2011)

  • PDF Extractor SDK: minor update for PDF to XLS sample
  • rendering: improved fonts support
  • text extraction with formatting improved
  • new source code sample to show how to save extracted text to a stream
  • performance optimized and pdf processing speed improved
  • improved support for PDF format

New in Bytescout PDF Extractor SDK 2.20.396 (Dec 12, 2011)

  • fixing issues with CSV, XML and XLS extraction on long tables
  • PDF Viewer now provides ability to turn on/off text formatting support on extraction
  • PDF support improved
  • minor bug-fixes

New in Bytescout PDF Extractor SDK 2.20.392 (Dec 12, 2011)

  • NEW table detection implemented, see new Bytescout.PDFExtractor.TableDetector interface and source code samples in /Find Table And Extract As CSV/ sub-folder in examples
  • NEW regular expressions support for text search in TextExtractor (see .RegexSearch property)
  • Text search functionality improved
  • minor bug-fixes

New in Bytescout PDF Extractor SDK 2.10.303 (Dec 12, 2011)

  • NEW: DocumentMerger and DocumentSplitter interfaces and classes to merge and split PDF documents
  • improved support for PDF documents
  • PDF processing speed increased
  • minor bug-fixes

New in Bytescout PDF Extractor SDK 2.10.276 (Dec 12, 2011)

  • NEW: AttachmentExtractor interface to extract file attachments and embedded files from PDF (see /Examples/Extract Attachments/ for sample source code)
  • NEW: XLSExtractor interface to extract tables from PDF as XLS and XLSX Excel files (including font formatting)
  • improved text extraction functionality
  • improved output image quality
  • improved support of Unicode text
  • improved support of damaged PDF files (not hanging on damaged files anymore)

New in Bytescout PDF Extractor SDK 2.00.228 (Dec 12, 2011)

  • CSVExtractor: SeparationSymbol and QuotationSymbol properties were added
  • TrimValues property for CSVExtractor and XMLExtractor: turned on by default to trim detected cell values automatically
  • Default properties for CSV extraction improved
  • fixed incorrect default space ratio in text extractor to 0.4, previous value 1.2 was causing to join some words into a single one
  • TextExtractor.detectNewColumnBySpacesRatio renamed into .SpaceRatioBetweenWords property
  • PDFViewer now shows options dialog to adjust SpaceRatioBetweenWords if needed
  • minor bug-fixes

New in Bytescout PDF Extractor SDK 2.00.217 (Dec 12, 2011)

  • CSV and XML extraction speed greatly improved
  • CSVExtractor and XMLExtractor classes add new .DetectNewColumnBySpacesRatio property: use this property to control space between detected columns of text
  • XML and CSV Extractor adds .SkipCellsWithEmptyValues property (true by default to skip cells with empty values)
  • PDF Viewer now shows extraction options dialog for XML and CSV export functions
  • PDF To CSV to XLS source code sample added
  • PDF To CSV\Delphi\ source code sample added
  • minor bug-fixes and improvements

New in Bytescout PDF Extractor SDK 2.00.206 (Jun 8, 2011)

  • Support for .NET 3.5, .NET 4.00 added
  • Delphi source code sample has been added
  • Minor bug-fixes and improvements

New in Bytescout PDF Extractor SDK 2.00.186 (Jun 8, 2011)

  • PDF processing speed increased up to x10 times
  • Minor bug-fixes and improvements

New in Bytescout PDF Extractor SDK 1.10.168 (Jun 8, 2011)

  • Support for password protected PDF documents improved (was not working properly in previous release)
  • Minor bug-fixes and improvements

New in Bytescout PDF Extractor SDK 1.10.160 (Jun 8, 2011)

  • XML comments are available now to show hints for methods, classes and properties in Visual Studio
  • New property: .ExtractColumnByColumn (false default), set to True to extract text column by column instead of line by line
  • PDF Viewer freeware utility updated to feature "Extract Text (line by line)" and "Extract Text (column by column)" buttons
  • Improved support for single paged PDF documents produced by Acrobat Distiller software
  • Clipping issues were fixed
  • Fixed hanging on some broken PDF documents
  • Improved text decoding support
  • Minor bug-fixes

New in Bytescout PDF Extractor SDK 1.10.150 (Jun 8, 2011)

  • PDF files support improved
  • Now handles PDF files from Google Doc without errors
  • Minor bug-fixes

New in Bytescout PDF Extractor SDK 1.10.144 (Jun 8, 2011)

  • Now works with secured documents (provide passsword if needed in .Password property)
  • Minor bug-fixes and improvements
  • Updated GUI demo application

New in Bytescout PDF Extractor SDK 1.10.121 (Jun 8, 2011)

  • PDF to CSV extractor added
  • PDF to XML extractor added
  • Support for invisible text extraction added
  • Minor bug-fixes and improvements