What's new in Bytescout PDF Extractor SDK 13.4.1 Build 4780
Jul 14, 2023
- Enhanced text parsing
- Improved image file rendering
- Other minor fixes and improvements.
New in Bytescout PDF Extractor SDK 13.4.0 Build 4659 (Apr 10, 2023)
- Added support for WEBP image format in 'RasterRenderer' and 'HTMLExtractor'
- Adding Variant methods to extractors
- Improved fonts rendering
- Fixing crash on text object where contentLength
- Performance improvements
- Other minor fixes and improvements.
New in Bytescout PDF Extractor SDK 13.3.0 Build 4514 (Sep 27, 2022)
- DocumentSplitter: added support for "**" split range that splits document into pairs of pages.
- Added methods to all extractors that support Variant datatype for input and output. They allow to perform in-memory processing when using the SDK as COM/ActiveX object from Delphi, VC++, VBScript, etc.
- Fixed text search for RTL languages.
- Input photo images are now rotated according to EXIF information.
- Improved parsing of PDF documents.
- Other minor fixes and improvements.
New in Bytescout PDF Extractor SDK 13.2.0 Build 4485 (Jun 7, 2022)
- 'DocumentRotator' now can automatically fix rotation of PDF files using OCR.
- Improved line removal algorithm.
- Improved loading of embedded fonts.
- Performance improvements.
- Rotated text objects were combined with unrotated ones. Fixed now.
- Fixed parsing of names of file attachments.
- 'SearchablePDFMaker': fixed coordinates of transparent text in the output document when the input is an image.
- Suppressed junk console message.
- Improved parsing of PDF documents.
- Other minor fixes and improvements.
New in Bytescout PDF Extractor SDK 13.1.0 Build 4386 (Jan 25, 2022)
- DocumentMerger: Added property 'MergedDocumentTitle' allowing to override the title of merged document
- XLSExtractor: Added property 'CustomColumnWidths' allowing to specify exact column widths in generated Excel spreadsheet
- JSONExtractor: The mode 'OutputStructure.Full' is renamed to 'OutputStructure.LegacyFixed' and made maximally compatible in field names with the mode 'OutputStructure.Legacy'
- Added support for UniKS-UCS2-H text encoding
- InfoExtractor: Added method 'GetFormFields()' returning information about form fields in PDF document
- Improved COM/ActiveX interfaces for in-memory processing without file operations
- Extractors and SearchablePDFMaker: Added property 'OCRDisableAutoSegmentation' to solve OCR engine's segmentation issues
- NET Core min required version is 2.1 now (was 2.0)
- Line grouping was not affected by 'ConsiderFontSizes' and 'ConsiderFontColors' properties. Fixed now
- Fixed disposing issue in 'SearchablePDFMaker'
- Improved parsing of PDF documents
- Other minor fixes and improvements
New in Bytescout PDF Extractor SDK 13.0.0 Build 4253 (Jan 25, 2022)
- New column detection mode 'ColumnDetectionMode.ContentGroupsAI' that works better on tables without borders and on pages with multiple tables
- Greatly improved tables detection in 'TableDetector2'
- Improved filtering of shadow-like text ('ExtractShadowLikeText' option)
- Improved the 'LineGroupingMode.JoinOrphanedRows'
- DocumentMerger': Improved merging of PDF forms. Now it can link fields with matching names or rename them to avoid unwanted linking. See the property 'RenameMatchingFieldsDuringMerge'
- JSONExtractor' and 'XMLExtractor' now output the page size for each page
- All extractor classes now support extraction of page ranges
- Added properties 'DetectUnderlineTextStyle' and 'DetectStrikeoutTextStyle' to 'CSVExtractor' and 'XLSExtractor'. They help to prevent underlined text affecting the line grouping in table cells
- Improved background color detection for the option 'ConsiderBackgroundColors'
- Added property 'NormalizeText' to all extractors. It replaced unicode spaces and hyphens in the extracted text with normal ' ' and '-' characters
- Remover2': fixed handling of PDF page rotation
- Remover2': making unsearchable now performed only for edited pages
- XMLExtractor': Added property 'IndentedXML' to control indentation
- JSONExtractor': Added property 'IndentedJSON' to control indentation
- Stamper': fixed stamping of rotated pages
- Added new OCR mode - 'OCRMode.AutoRepairFonts'. It automatically tries to detect PDF documents with corrupted text and forces OCR font repair for them. Works only for English texts
- Added property 'PageSeparator' to CSV and XLS extractors
- XLSExtractor': improved negative numbers detection
- TextExtractor.FindAll()' method was ignoring the case sensitivity option. Fixed now
- Added property 'OCRDetectLines' that helps to detect table structure in scanned documents
- JSONExtractor' and 'XMLExtractor' now outputs number of pages in the result and number of pages for which OCR was performed
- Added property 'OCRPageCount' to extractors that contains number of pages for which OCR was performed during the last extraction
- JSONExtractor': Added property 'OutputStructure' that allows to select structure of output JSON
- JSONExtractor': Added property 'OutputTransformation' that allows to apply JSONPath expression to the output JSON
- Performance improvements
- Improved parsing of PDF documents
- Other minor fixes and improvements
New in Bytescout PDF Extractor SDK 12.1.0 Build 4136 (May 18, 2021)
- Added property 'TextExtractor.FuzzySearch' that enables 'fuzzy' text search algorithm. It allows to find
- 'approximately equal' strings.
- Added 'DocumentSplitter2' class that splits document by found text.
- Added 'CSVExtractor.NormalizeCSV' property. It makes CSV data produced from different document pages to contain
- the same number of columns.
- Added property 'JSONExtractor.OutputStructure' that allows to change the structure of the generated JSON
- to one of predefined variants for easier postprocessing.
- Added property 'JSONExtractor.OutputTransformation' that allows to apply JSONPath expression to the generated JSON.
- Added property 'OCRPageCount' to extractor classes that contains number of pages for which OCR was performed.
- 'JSONExtractor' and 'XMLExtractor' now add to the generated JSON and XML result the number of process pages
- and the number of pages for which OCR was performed.
- Added property 'OCRDetectLines' to extractor classes that improves column detection in scanned documents.
- Added property 'ConsiderBackgroundColors' to extractor classes that enables detection of background color
- under text objects. It may helps to improve row and column detection in tables without borders but with
- color stripes.
- Added properties 'DocumentMerger.GenerateBookmarks' and 'DocumentMerger.BookmarkTitles' to enable automatic
- generation of bookmarks pointing to the merged parts.
- Improved PDF optimization in 'DocumentSplitter'.
- = 'DocumentMerger' now uses the first input document as the base for the merged document. This allows to keep document
- information properties and outlines.
- DocumentMerger: added support for profiles.
- MultimediaExtractor: added support for more media types.
- - 'TextExtractor.FindAll()' method was ignoring the case sensitivity option.
- Fixed issue with junk empty temporary files generated during OCR.
- Improved parsing of PDF documents.
- Other minor fixes and improvements.
New in Bytescout PDF Extractor SDK 11.3.0 Build 3983 (Oct 26, 2020)
- DocumentSplitter: Added support for regions with inverted page numbers. For example, "!1" means "the last page", "!1-!3" or "!3-" means "last three pages". DocumentSplitter: Added support for "*" split range that means "split every single page"
- Added 'InfoExtractor.Metadata' property that gets XMP metadata from the document
- Improved joining of multi-line cells in tables without borders ('LineGroupingMode.JoinOrphanedRows' mode)
- Improved detection of OCR language file versions
- Improved .NET Core 2.0 compatibility
- Improved unwrapping of multi-line cell text
- Fixed issue when invisible vector drawings were causing unwanted separation of text objects
- Fixed extraction from area when running OCR against image file (not PDF!)
- Improved parsing of PDF documents
- Other minor fixes and improvements
New in Bytescout PDF Extractor SDK 11.2.0 Build 3919 (Jun 30, 2020)
- MultimediaExtractor' now supports extraction of 3D-animation objects
- TextExtractor.Find()' now keeps original font names in found object information
- Improved column detection in `ColumnDetectionMode.Borders` mode
- SearchablePDFMaker' did not process vector-only pages. Fixed now
- Improved regex text search in 'TextExtractor'
- Added 'DetectUnderlineTextStyle' and 'DetectStrikeoutTextStyle' properties to 'JSONExtractor' and 'XMLExtractor'
- Added 'OCRWhiteList' and 'OCRBlackList' properties to extractors
- Added 'Invert' OCR preprocessing filter
- Added 'Scale' OCR preprocessing filter
- Improved joining of multi-line cells in tables without borders (`LineGroupingMode.JoinOrphanedRows` mode)
- Improved performance of 'ImageExtractor'
- Added page rectangles to 'InfoExtractor'
- Improved 'OCRAnalyzer'
- Improved automatic deletion of duplicated text objects during the extraction
- Fixed extraction issues in .NET Core version
- Improved parsing of PDF documents
- Other minor fixes and improvements
New in Bytescout PDF Extractor SDK 11.1.0 Build 3845 (Mar 19, 2020)
- Added 'OCROverallConfidence' property in all extractors that.
- SearchablePDFMaker: Added 'KeepOriginalRotation' property.
- SearchablePDFMaker: fixed crash on mixed English-Arabic text recognition.
- PDF Multitool: Added "Developer Tools" sub-menu to the context menu.
- Improved parsing of PDF documents.
- Other minor fixes and improvements.
New in Bytescout PDF Extractor SDK 11.0.0 Build 3805 (Feb 12, 2020)
- Added support for new revision of PDF encryption (ISO 32000-2:2017 compliance).
- Added 'LicenseInfo' property providing detailed information about your license.
- Added 'Grayscale' filter to OCRImagePreprocessingFilters.
- Dramatically improved column extraction for multiple tables on a page. Works only in `ColumnDetectionMode.Borders` mode for tables with borders between columns and rows.
- Greatly improved `ColumnDetectionMode.BorderedTables`. As in the table detection, it now uses optical recognition to detect bordered tables and their columns on scanned documents.
- Improved 'InfoExtractor' to return the encrypted and password-protected states without asking a password or throwing an exception.
- Added document permissions information to 'InfoExtractor'.
- DocumentSplitter: added zero-padding to page numbers in generated file names.
- Improved extraction of duplicated text (shadow-like effect).
- Improved 'MultimediaExtractor'.
- Fixed text search issues on some documents.
- Fixed bug that damaged extracted text only during multi-thread processing.
- Fixed crash on subsequent extractions with different OCR modes.
- Fixed .NET Core compatibility issue.
- Improved parsing of PDF documents.
- Other minor fixes and improvements
New in Bytescout PDF Extractor SDK 10.8.0 Build 3732 (Dec 4, 2019)
- Remover2: Added 'MaskColor' property that allows to change color of masking rectangle.
- Remover & Remover2: Fixed incomplete removal of the text in some cases.
- XMLExtractor and XFDFExtractor: fixed missing control types.
- Fixed parsing of combobox items that consist of value+label pairs.
- Improved handling of Arabic fonts and charsets.
- Improved handling of CJK fonts and charsets.
- Improved parsing of PDF documents.
- Other minor fixes and improvements.
New in Bytescout PDF Extractor SDK 10.7.0 Build 3697 (Nov 2, 2019)
- Improved extraction of embedded images.
- Improved table columns detection.
- Remover2: fixed crash on sequential Add*() method calls.
- PDF Multitool: fixed crash on multimedia extraction.
- Improved parsing and processing of PDF documents.
- Other minor fixes and improvements.
New in Bytescout PDF Extractor SDK 10.6.0 Build 3659 (Oct 1, 2019)
- Added methods to remove vector objects to `Remover` and `Remover2` classes.
- Added experimental `TableDetector2` class demonstrating new table detection method.
- Improved replacement of not embedded PDF fonts.
- Improved splitting of text objects when using CustomExtractionColumns.
- Fixed text search on some documents.
- Added `CreateProfile()` method to all extractors that creates profile from current object.
- PDF Multitool: Added tools to remove text, image, and vector objects.
- PDF Multitool: Fixed "Save Vectors" option in XML extraction.
- Improved parsing and processing of PDF documents.
- Other minor fixes and improvements.
New in Bytescout PDF Extractor SDK 10.5.0 Build 3637 (Sep 2, 2019)
- DocumentMerger: Added "MergeFolder()" method allowing to merge all PDF files in folder
- Improved extraction by CustomExtractionColumns
- Remover: Improved appearance of partially removed text objects
- Renderer and Viewer: Improved rendering of small fonts with stroke
- PDF Multitool: Added Full Screen mode
- PDF Multitool: Added "Night Mode"
- PDF Multitool: Fixed selection reset on switching a tool
- Improved parsing and processing of PDF documents
- Other minor fixes and improvements
New in Bytescout PDF Extractor SDK 10.4.0 Build 3600 (Aug 7, 2019)
- Added extracted text analysis. See "EnableTextAnalysis" property.
- Improved columns detection.
- Implemented replacement filters allowing to replace extracted text before analysis of table structure.
- Added "SensitiveDataDetector" class allowing to detect sensitive data in PDF documents.
- Added new "Remover2" class: improved version of "Remover" with better interface.
- PDF Multitool: Added "Save vector objects" option to XML and JSON converters.
- PDF Multitool: Improved "Detect Tables" dialog.
- PDF Multitool: Improved conversion to HTML format.
- PDF Multitool: Added set of tools "Sensitive Data Suite" allowing to detect and remove
- sensitive data in PDF documents.
- PDF Multitool: Reduced memory consumption on extraction from very large documents.
- Other minor fixes and improvements.
New in Bytescout PDF Extractor SDK 10.3.0 Build 3566 (Jul 2, 2019)
- Added property `OCRMaximizeCPUUtilization` that allows to improve OCR performance at the cost of maximized CPU utilization.
- Improved OCR rotation detection.
- Fixed OCR crash on systems with CPU without AVX and AVX2 extensions.
- Fixed OCR crash when working under limited system accounts.
- Improved the detection of the visibility of text objects when they are hidden by a overlying opaque vector object.
- Improved extraction from cropped PDF pages.
- Fixed `OutOfMemoryException` on tiling patterns with very large step or bounding box.
- Improved extraction of embedded images.
- Improved extraction of multimedia files.
- Fixed decoding of UTF-8 encoded text objects.
- Improved Japanese fonts decoding.
- Fixed `LineGroupingMode.JoinOrphanedRows` mode for multiple single-cell lines.
- PDF Multitool: Replaced legacy `FolderBrowserDialog` with modern `FolderSelectDialog` everywhere.
- PDF Multitool: Added Ctrl-Shift-O hot key to open recent document.
- Other minor fixes and improvements.
New in Bytescout PDF Extractor SDK 10.2.0 Build 3512 (May 29, 2019)
- Improved OCR engine stability when working in strict environments.
- Improved columns separation by 'CustomExtractionColumns'.
- Added parameter for 'TextExtractor.Find()' method that allows to specify RegexOptions.
- Added support for streams to 'DocumentSplitter' and 'DocumentMerger'
- Added property 'TableDetector.EnhanceTableBorders' affecting the table detection in 'Bordered Tables' mode.
- Improved parsing and processing of PDF documents.
- PDF Multitool: Visited pages are now displayed much faster.
- PDF Multitool: Improved keyboard navigation.
- PDF Multitool: Improved CSV preview.
- PDF Multitool: All tools now shows elapsed time in the status bar.
- PDF Multitool: Changed default OCR grade to 'Best'.
- Other minor fixes and improvements.
New in Bytescout PDF Extractor SDK 10.0.0 Build 3420 (Mar 22, 2019)
- Greatly improved OCR quality and performance.
- PDF Multitool: New option to select OCR grade.
- PDF Multitool: Fixed behavior of "Remove" button in "Merge documents" tool.
- PDF Multitool: Reduced excessive painting in selection mode.
- Improved parsing and rendering of PDF documents.
- Other minor fixes and improvements.
New in Bytescout PDF Extractor SDK 9.4.0 Build 3398 (Mar 12, 2019)
- Added TextExtractor.FindAll() and TextExtractor.FindAllToJSON() methods.
- Added `AnnotationExtractor` class.
- Improved handling of embedded PDF fonts.
- Improved parsing of PDF documents.
- PDF Multitool can now be set as default PDF viewer application in Windows.
- PDF Multitool: Added the ability to preview the conversion.
- PDF Multitool: Reworked converters' options dialogs. Removed weird options, added actual ones.
- PDF Multitool: Now Ctrl-PageUp and Ctrl-PageDown keys switch pages even if PDFViewerControl is not focused.
- PDF Multitool: Improved handling of PDF extraction permissions.
- Fixed unwanted byte order mark (BOM) when writing extracted text to MemoryStream.
- Fixed line grouping in table cells.
- Fixed crash in XMLExtractor when input document is image.
- Improved parsing of XFA forms.
- Improved Deskew image preprocessing filter.
- Added `ShrinkMultipleSpaces` property improving column detection if text in a table contains multiple spaces between words.
- Fixed column detection in rotated pages.
- Improved support of Microsoft Excel formats.
- Other minor fixes and improvements.
New in Bytescout PDF Extractor SDK 9.3.0 Build 3352 (Feb 4, 2019)
- Added OCRCorrections property to all extractors that implement OCR.
- Added .NET Core compatible assemblies.
- Improved support of Korean fonts.
- Improved parsing of PDF documents.
- Improved columns detection.
- XMLExtractor, JSONExtractor: Added `SaveVectors` property.
- OCRExtension: Suppressed unwanted console messages.
- Removed C++ runtime dependencies.
- Fixed merging of PDF forms containing fields with the same name.
- Other minor fixes and improvements.
New in Bytescout PDF Extractor SDK 9.2.0 Build 3254 (Oct 24, 2018)
- Changed font rendering engine to improve text rendering and to circumvent Windows GDI font processing issues.
- Improved extraction of embedded media files.
- Improved detection of columns when extracting tabular data.
- PDF Rederer SDK: Property `RenderingOptions.PreferSystemFonts` made obsolete due to change of font rendering engine.
- XLSExtractor: improved Excel format support.
- Embedded default fonts to fallback to if a font is missing in Windows.
- Improved support of cropped PDF documents.
- Improved extraction of text from rotated pages.
- PDF Multitool: Added "OCR Analyzer" tool.
- Performance improvements.
- Other minor fixes and improvements.
New in Bytescout PDF Extractor SDK 9.1.0 Build 3163 (Jul 19, 2018)
- Added new line grouping mode `LineGroupingMode.JoinOrphanedRows`.
- Added new OCRAnalyzer class that can help to find optimal combination of OCR image preprocessing filters. See source code examples.
- Added new LineDetector class allowing to find all vertical and horizontal lines in document.
- Added public methods GetPreprocessedPagePreview() and SavePreprocessedPagePreview() allowing to preview the result of OCR image preprocessing filters work.
- Greatly improved the line removing OCR image preprocessing filters.
- SearchablePDFMaker: fixed hanging on processing PDF documents with large count of vector objects.
- Fixed bug in RotationAngle property when processing already rotated PDF documents.
- ImageExtractor now correctly handles the rotation of embedded images.
- PDF Multitool: added new feature "Optimize PDF document".
- PDF Multitool: fixed resolution selection in "Make PDF unsearchable".
- PDF Multitool: fixed rotation angle selection in "Rotate Document".
- Other minor fixes and improvements.
New in Bytescout PDF Extractor SDK 9.0.0 Build 3079 (Apr 12, 2018)
- Added RotationAngle property to rotate document pages before the extraction.
- TextExtractor: Improved plaint text columns alignment.
- XLSExtractor: Improved numbers detection.
- DocumentOptimizer: Greatly improved optimization effectiveness.
- Greatly improved Deskew algorithm for OCR of rotated scans.
- Remover: more accurate deletion of text objects.
- SearchablePDFMaker: Fixed processing of rotated scans.
- SearchablePDFMaker: Fixed resolution issues when the input is image.
- Other minor fixes and improvements.
New in Bytescout PDF Extractor SDK 8.8.1.3025 (Jan 29, 2018)
- Improved formatting of extracted plain text (TextExtractor). Now columns look better.
New in Bytescout PDF Extractor SDK 8.8.0.3015 (Jan 23, 2018)
- Fixed: OCR preprocessing filters were not applied if input document is image.
- PDF Multitool: Fixed image preprocessing filters in "Find Text" dialog.
- TableDetector now provides detected cells information for ColumnDetectionMode.BorderedTables (see `FoundTableCells` property).
- XMLExtractor: Added annotations extraction;
- XMLExtractor: Object coordinates in XML are fractional now for better precision (were integer).
- Improved support of encrypted PDF documents.
- Other minor fixes and improvements.
New in Bytescout PDF Extractor SDK 8.7.0.2980 (Nov 8, 2017)
- DocumentOptimizer: added automatic resampling of high resolution images.
- Added `ParsingError` event allowing to handle parsing errors and interrupt or continue the processing.
- SearchablePDFMaker: Added DiscardExistingDocumentText property allowing to overwrite previous OCR.
- Added AllowStandalonePunctuation property to tabular extractors (CSV, XML, JSON, XLS).
- Performance improvements.
- SearchablePDFMaker: Invisible text dimensions now match recognized text pieces.
- DocumentSplitter: Fixed `outputFolder` parameter in SplitCOM() method.
- Made IBaseTextExtractor interface public.
- Other minor fixes and improvements.
New in Bytescout PDF Extractor SDK 8.6.0.2911 (Aug 6, 2017)
- XMLExtractor, JSONExtractor, HTMLExtractor: Added KeepOriginalFontNames property.
- TextComparer: Added GetChanges() method to get comparison results in form convenient for programmatic analysis.
- DocumentRotator: It is now possible to specify pages to rotate.
- TextExractor.ExtractColumnByColumn property now affects Find() method.
- Fixed font names in SearchResult elements.
- Fixed Contrast preprocessing filter.
- Extraction: subscript and superscript text objects were merged with normal text. Fixed now.
- Other minor fixes and improvements.
New in Bytescout PDF Extractor SDK 8.5.0.2855 (Jun 2, 2017)
- Improved Japanese text extraction.
- Removed obsolete ClientProfile builds.
- Improved multimedia files extraction.
- Other minor fixes and improvements.
New in Bytescout PDF Extractor SDK 8.4.0.2820 (Jun 2, 2017)
- New event ProgressChanged in all time-consuming classes. The event reports the progress in percents and also allows to interrupt the processing;
- SearchablePDFMaker now supports single and multi-page images as the input and produces a PDF document at the output.
- Performance improvements;
- Fixed crash when the input document is image and it's loading from stream;
- Other minor fixes and improvements.
New in Bytescout PDF Extractor SDK 8.3.0.2792 (Jun 2, 2017)
- Added new Remover class allowing te remove text from PDF documents.
- InfoExtractor now able to read custom document properties (see CustomProperties property).
- XMLExtractor and JSONExtractor now able to extract document images and put them to outer files or embed as Base64 string.
- Text extraction: Unwrap property now affects the text in table cells.
- Text extraction: Improved lines grouping in table cells.
- AttachmentExtractor: Fixed extraction of attachments and portfolio created with Microsoft Outlook.
- DocumentSplitter: Fixed document optmization (OptimizeSplittedDocuments property).
- Performance improvements.
- Other minor improvements and bug fixes.
New in Bytescout PDF Extractor SDK 8.2.0.2697 (Feb 1, 2017)
- Fixed Unwrap option.
- Improved bordered tables detection.
- Improved attachments extraction.
- Added support for profiles - quick way to apply multiple settings at once.
- OCR: Implemented rotation detection of wrongly oriented scanned PDF pages.
- SearchablePDFMaker now able to automatically rotate wrongly oriented scanned PDF pages.
- Fixed exception in SearchablePDFMaker when loading document from stream.
- Fixed memory leaks in OCR.
- TextExtractor and CSVExtractor: Added Save* methods overrides allowing to specify the charachers encoding.
- Improved media files extraction.
- Improved Vertical Line Remover OCR preprocessing filter.
- Other minor improvements and bug fixes.
New in Bytescout PDF Extractor SDK 8.1.1.2606 (Nov 18, 2016)
- Fixed OCR preprocessing filters in SearchablePDFMaker.
- Fixed OCR preprocessing filters PDF Multitool demo app.
- Added Gamma Correction preprocessing filter.
- Added Horizontal Lines Remover preprocessing filter.
- Improved Dilate preprocessing filter.
New in Bytescout PDF Extractor SDK 8.1.0.2600 (Nov 18, 2016)
- Added OCR preprocessing filters to improve the recognition quality on low-quality scanned documents.
- Added new DocumentOptimizer class able to recompress all document images with JPEG or CCITT compression.
- Added text removal filters.
- All extraction class (TextExtractor, XMLExtractor, etc.) now able to load image files and extract text from them using OCR.
- PDF Multitool demo app now able to load image files and extract text from them using OCR.
- Fixed extraction of text in Korean charset (KSCms-UHC-H / Code Page 949).
- Improved text extraction from specified rectangular area.
- Improved extraction of invisible text.
- Fixed transparent color representation in XML extraction.
- Other minor improvements and bug fixes.
New in Bytescout PDF Extractor SDK 8.0.0.2523 (Nov 18, 2016)
- Added filtering of extracted content by font name, font size and color.
- Updated OCR engine to the latest version. Update language files from "tessdata" folder.
- Improved text extraction.
- Improved lines grouping in tabular data.
- Improved performance.
- Improved XFA forms extraction.
- Improved TableDetector.
- Fixed PDF parsing issues.
- Fixed JBIG images decoding.
- ImageExtractor: fixed per-page image extraction.
- MultimediaExtractor: fixed extraction on embedded MPEG audio.
- TextExtractor: fixed non-working RemoveHyphenation property.
- Other minor improvements and bug fixes.
New in Bytescout PDF Extractor SDK 7.00.0.2474 (Jul 7, 2016)
- Added new DocumentPrinter utility class allowing to print PDF documents silently (without any user dialogs).
- Added new JSONExtractor class.
- Added override for DocumentSplitter.Split() method allowing to specify the output folder for generated files.
- Fixed multi-threading bug in DocumentSplitter.
- TableDetector now respects extraction area set by SetExtractionArea() method.
- New properties in extraction classes:
- ExtractionColumns - contains coordinates of detected columns;
- CustomExtractionColumns - allows to override the column detection.
- GetPageRect* methods did not take the page rotation into account.
- Fixed bug in installer causing some files from previous installation were interfering with updates.
- Reworked the registration checking. Now the library will not throw an exception,
- but work in demo mode if you missed or input wrong RegistrationName and RegistrationKey.
- PDF Multitool: Added recent document list to "Open PDF Document" button.
- PDF Multitool: Selection can be resized now.
- PDF Multitool: Added Extract JSON feature.
- PDF Multitool: Improved Table Detector UI.
- PDF Multitool: Greatly improved font rendering quality.
- PDF Multitool: Added debug option "Show Detected Extraction Columns" to the context menu to display
- the detected columns on the current page. Becomes visible only after running any extraction against
- the current displayed page.
- PDF Multitool: Fixed font rendering issue on 32-bit Windows.
- Other minor improvements and bug fixes.
New in Bytescout PDF Extractor SDK 6.00.2071 (May 20, 2015)
- PDF to XML, PDF To CSV, PDF To Text functionality improved
- PDF To XLS command line sample added (based on vbscript)
- PDF To HTML SDK adds new .DetectHyperLinks property (TRUE by default) to enable/disable automated links detection in the text
- New SearchablePDFMaker (available for PRO licenses) to convert PDF into searchable PDF files
- new properties in extractor: ConsiderFontNames, ConsiderFontSizes, ConsiderFontColors, ConsiderVerticalBorders in CFG files
- header columns detection (when AutoAlighHeaderToColumns = true) improved
- .DetectLinesInsteadOfParagraphs replaced with new .LineGroupingMode to control how lines are merged into paragraphs
- IMPORTANT PDF To XML fixes long time issue with incorrect Y coordinate for text objects (was point to the bottom left instead of top left)
- .TableXMinIntersectionRequiredInPercents and .TableYMinIntersectionRequiredInPercents properties added
- C++ source code sample added
- XML Extractor fixes missing empty columns in PreserveFormatting=true mode
- Minor fixes in colors in some PDF files
- support for for multiple OCR languages added
- PDF Multitool GUI: adds Copy to Clipboard button to TXT, CSV, XML and raster renderer dialogs
- XLSExtractor: adds PageToWorksheet property to enable/disable generation of separate worksheets per page.
- new .TextEncodingCodePage property
- PDFViewerControl: adds ValidateContextMenu allowing user to add custom items to context menu
- PDF Viewer control: adds properties ShowTextObjects, ShowImageObjects, ShowVectorObjects.
- XMLExtractor now adds "OCRConfidence" attribute for recognized text
- PDF/A checking functionality (in beta)
- improving controls and text checking and alignment according to the original layout. The issue was caused by the shift of Y coordinates in controls while parsing: that was incorrect. The correct way is to shif...
- XML Extractor updated: now produces tag for checkboxes and text fields
- changed using of current directory to temp directory.
- checkboxes,radioboxes, editboxes, comboboxes are better supported
- now allows partial trust callers.
New in Bytescout PDF Extractor SDK 5.80.1781 (Jan 29, 2015)
- PDF to XML, PDF to CSV, PDF to Text functionality improved
- OCRMode now provides 9 modes
- .DetectLineInsteadOfParagraph now works much better. Set it to False to capture multiline text in table cells!
- PDF controls support improved
- FDF and XFDF data extraction added
- Table detection improved to support multline text in cells and tables with absent rows
- beta version of PDF/A validator added
- minor fixes and improvements
New in Bytescout PDF Extractor SDK 4.00.1487 (Jun 3, 2014)
- improved pdf to text, pdf to csv, pdf to xml
- issue with extraction area fixed
- Improved Unicode handling
- new .ContentType to check if PDF is PDF, Portfolio or XFAForm
- new properties: Unwrap, ExtractionAreaUsageMode
- new AttachmentInfo class to obtain details about attachment
- new XFA Form XML extraction support (see XFAFormExtractor and XFAFormToXML samples)
- new ZuGFeRD PDF support added
- Multhithreading performance improved
- Licensing updated: Now Licensing is per developer
- new "match whole word" parameter to TextExtractor.Find()
- improved XLS and XLSX output
New in Bytescout PDF Extractor SDK 3.30.1240 (Dec 11, 2013)
- improved support for old formats PDF files
- image flipping issue in some PDF files fixed
- improved text rendering in PDF files
- minor bug-fixes
New in Bytescout PDF Extractor SDK 3.20.1209 (Dec 11, 2013)
- table detection was not returning proper coordinates for 2nd and further tables, fixed
- minor source code samples updates
- DocumentSplitter now works with multipage TIF files
- minor bug-fixes
New in Bytescout PDF Extractor SDK 3.20.1200 (Dec 11, 2013)
- minor rotated text issues fixed
- table detection was not returning proper coordinates, fixed
- minor bug-fixes
New in Bytescout PDF Extractor SDK 3.20.1179 (Oct 23, 2013)
- pdf to text and pdf data extraction improved
- new .AutoAlignColumns (true by default) property to automatically align cells to the header column or not (switching this setting will help if you are getting some shifted cells)
- new DocumentRotator class to rotate pages in PDF documents
- new ExtractRawImages property in Images Extractor to define if we are extracting raw images or images with rotation and transformation applied
- improved support of PDF files with rotated objects and pages
- new source code sample showing how to extract page found by a keyword "Find Keyword And Extract Page"
- Images Extractor: SetExtractionArea() method added to define a rectangle area to extract images from
- improved Splitting Pages example
- improved pages extraction from PDF
- new RemoveUnusedResources method to remove unused resources from PDF to reduce file size
- minor bug-fixes and improvements
New in Bytescout PDF Extractor SDK 3.20.1100 (Oct 23, 2013)
- new method: DocumentSplitter.Split(sourcefile, splitPages) to extract mulitple ranges of pages from the same PDF file
- minor bug-fixes in pdf to text engine
New in Bytescout PDF Extractor SDK 3.20.1092 (Aug 14, 2013)
- pdf to text minor functionality fixes
- x64 installer improvements
- minor fixes for error messages
- PDFDocument.Dispose() now not disposing the source stream with PDF if this stream was supplied by the user (so user should dispose it)
- improved PDF format support
- minor bug-fixes
New in Bytescout PDF Extractor SDK 3.20.1075 (Jul 15, 2013)
- improved PDF To CSV, PDF To XLS, PDF To XML extraction
- improved PDF reading speed and stability
- minor bug-fixes
New in Bytescout PDF Extractor SDK 3.10.899 (May 17, 2013)
- improved pdf to text conversion
- improved PDF reading support
- more source Visual Basic .NET, C# and VBScript code samples added
- documentation updated
New in Bytescout PDF Extractor SDK 3.00.864 (Apr 13, 2013)
- improved PDF extraction support
- improved PDF handling
- pdf splitting and merging: new property to optimize PDF files after splitting DocumentSplitter.OptimizeSplittedDocuments may decrease file size when needed
- improved PDF fonts handling
- demo utility updated
- source code samples updated to run on any .NET framework by default
- minor bug-fixes
New in Bytescout PDF Extractor SDK 3.00.825 (Mar 14, 2013)
- improved pdf to text, pdf to csv
- demo utility PDF Viewer reworked and updated for better UI experience
- minor improvements and fixes in PDF support
- improved PDF stability while working with PDF files with high density vector graphics inside
- improved support for indexed color pallettes
- improved embedded fonts rendering
- better support for Unicode fonts
- new .Version property to read exact version of the dll
- minor updates and improvements
New in Bytescout PDF Extractor SDK 2.50.708 (Dec 13, 2012)
- PDF data extraction speed improved
- Windows 8 support improved
- PDF images and colors support improved
- PDF to csv, PDF xml, PDF to xls/xslx now skips first leading rows if they are empty
- pdf text search now works better and provides more intelligent support for regular expressions
- ActiveX support and installation improved and now provides single batches to run on Windows x86/x64 for Windows XP to 8 Pro
- new property: .ExtractShadowLikeText to enable/disable extraction of shadowed text (where it is used as effect to create visual shadows)
- minor bug-fixes and improvements
New in Bytescout PDF Extractor SDK 2.40.650 (Nov 2, 2012)
- improved support for Unicode text extraction
- improved support for PDF/A pdf files
- issues with white stripes appearing on multiple images combined fixed
- data extraction internal optimizations
- improved support for 8 bit images inside PDF
- vector drawings improved to provide better support for multiple small objects
- Color representation in images with indexed colors fixed
- Type2 fonts support improved
- Improved support for embedded fonts in PDF produced by Ghostscript engine
- CCIT images compression compression related issues fixed
- LZW compressed PDF support improved
- improved support for shading objects
- improved PDF fonts support
- improved support for PDF with 4 bit images
New in Bytescout PDF Extractor SDK 2.30.594 (Sep 24, 2012)
- PDF data extraction improved
- memory and speed optimizations
- fixing issue with empty data while extracting data from some PDF files
- improved images extraction support (more image encoding variations are supported)
- minor updates in examples
- minor bug-fixes
New in Bytescout PDF Extractor SDK 2.30.568 (Sep 24, 2012)
- pdf to text conversion quality improved
- multithreading usage stability has been improved
- hanging issue on some PDF fixed
- PDF Extractor SDK: updated sample for StructuredExtractor (previously known as TableExtractor interface)
- minor fixes and improvements
New in Bytescout PDF Extractor SDK 2.20.539 (May 23, 2012)
- improved stability
- demo utility improved
- important security fixes
New in Bytescout PDF Extractor SDK 2.20.525 (May 23, 2012)
- improved speed (up to x2 faster on some documents)
- Tables detection improved
- updated PDF Viewer utility
- improved support for structured text extraction (CSV and XML data extraction)
- minor bug-fixes
New in Bytescout PDF Extractor SDK 2.20.458 (Feb 9, 2012)
- Minor fixes in TableDetector class (.TableDetectionMinNumberOfColumns and .TableDetectionMinNumberOfRows were working incorrectly)
- Improved text extraction for PDF files generated from text files
- Improved support for PDF files produced by Adobe Acrobat
- PDF Viewer: CSV, XML and Text extractor forms updated to show .PreserveFormattingOnTextExtraction option
- Minor fixes in .NET 4.0 assemblies
- Renderer SDK adds /Visual Basic/PDF To BMP using streams/ sample
- Improved support for PDF with forms objects
- Improved leading spaces format detection in text extraction
- .SetExtractionArea() added to define area on a page to work with in PDF Renderer SKD
- Improved fonts information reading support in PDF files
- New .PageSeparator property in TextExtractor allowing to define a separator String for pages if you need one
- Fixing issue with indexed colorspaces in PDF
- Improved PDF format support
New in Bytescout PDF Extractor SDK 2.20.415 (Dec 29, 2011)
- PDF Extractor SDK: minor update for PDF to XLS sample
- rendering: improved fonts support
- text extraction with formatting improved
- new source code sample to show how to save extracted text to a stream
- performance optimized and pdf processing speed improved
- improved support for PDF format
New in Bytescout PDF Extractor SDK 2.20.396 (Dec 12, 2011)
- fixing issues with CSV, XML and XLS extraction on long tables
- PDF Viewer now provides ability to turn on/off text formatting support on extraction
- PDF support improved
- minor bug-fixes
New in Bytescout PDF Extractor SDK 2.20.392 (Dec 12, 2011)
- NEW table detection implemented, see new Bytescout.PDFExtractor.TableDetector interface and source code samples in /Find Table And Extract As CSV/ sub-folder in examples
- NEW regular expressions support for text search in TextExtractor (see .RegexSearch property)
- Text search functionality improved
- minor bug-fixes
New in Bytescout PDF Extractor SDK 2.10.303 (Dec 12, 2011)
- NEW: DocumentMerger and DocumentSplitter interfaces and classes to merge and split PDF documents
- improved support for PDF documents
- PDF processing speed increased
- minor bug-fixes
New in Bytescout PDF Extractor SDK 2.10.276 (Dec 12, 2011)
- NEW: AttachmentExtractor interface to extract file attachments and embedded files from PDF (see /Examples/Extract Attachments/ for sample source code)
- NEW: XLSExtractor interface to extract tables from PDF as XLS and XLSX Excel files (including font formatting)
- improved text extraction functionality
- improved output image quality
- improved support of Unicode text
- improved support of damaged PDF files (not hanging on damaged files anymore)
New in Bytescout PDF Extractor SDK 2.00.228 (Dec 12, 2011)
- CSVExtractor: SeparationSymbol and QuotationSymbol properties were added
- TrimValues property for CSVExtractor and XMLExtractor: turned on by default to trim detected cell values automatically
- Default properties for CSV extraction improved
- fixed incorrect default space ratio in text extractor to 0.4, previous value 1.2 was causing to join some words into a single one
- TextExtractor.detectNewColumnBySpacesRatio renamed into .SpaceRatioBetweenWords property
- PDFViewer now shows options dialog to adjust SpaceRatioBetweenWords if needed
- minor bug-fixes
New in Bytescout PDF Extractor SDK 2.00.217 (Dec 12, 2011)
- CSV and XML extraction speed greatly improved
- CSVExtractor and XMLExtractor classes add new .DetectNewColumnBySpacesRatio property: use this property to control space between detected columns of text
- XML and CSV Extractor adds .SkipCellsWithEmptyValues property (true by default to skip cells with empty values)
- PDF Viewer now shows extraction options dialog for XML and CSV export functions
- PDF To CSV to XLS source code sample added
- PDF To CSV\Delphi\ source code sample added
- minor bug-fixes and improvements
New in Bytescout PDF Extractor SDK 2.00.206 (Jun 8, 2011)
- Support for .NET 3.5, .NET 4.00 added
- Delphi source code sample has been added
- Minor bug-fixes and improvements
New in Bytescout PDF Extractor SDK 2.00.186 (Jun 8, 2011)
- PDF processing speed increased up to x10 times
- Minor bug-fixes and improvements
New in Bytescout PDF Extractor SDK 1.10.168 (Jun 8, 2011)
- Support for password protected PDF documents improved (was not working properly in previous release)
- Minor bug-fixes and improvements
New in Bytescout PDF Extractor SDK 1.10.160 (Jun 8, 2011)
- XML comments are available now to show hints for methods, classes and properties in Visual Studio
- New property: .ExtractColumnByColumn (false default), set to True to extract text column by column instead of line by line
- PDF Viewer freeware utility updated to feature "Extract Text (line by line)" and "Extract Text (column by column)" buttons
- Improved support for single paged PDF documents produced by Acrobat Distiller software
- Clipping issues were fixed
- Fixed hanging on some broken PDF documents
- Improved text decoding support
- Minor bug-fixes
New in Bytescout PDF Extractor SDK 1.10.150 (Jun 8, 2011)
- PDF files support improved
- Now handles PDF files from Google Doc without errors
- Minor bug-fixes
New in Bytescout PDF Extractor SDK 1.10.144 (Jun 8, 2011)
- Now works with secured documents (provide passsword if needed in .Password property)
- Minor bug-fixes and improvements
- Updated GUI demo application
New in Bytescout PDF Extractor SDK 1.10.121 (Jun 8, 2011)
- PDF to CSV extractor added
- PDF to XML extractor added
- Support for invisible text extraction added
- Minor bug-fixes and improvements