k2pdfopt Changelog

What's new in k2pdfopt 2.55

Dec 27, 2023

ENHANCEMENTS:
Windows version compiled with MinGW, gcc v13.2.0 on Windows 11 core i9-9900 CPU.
Mac arm-64 version compiled with clang 17 on macOS 14 Sonoma
Mac x64 version cross-compiled on M1-CPU Mac mini with clang 17on macOS 14 Sonoma
Linux version compiled with gcc 13.2.1 on Fedora 39 (virtual box)
Library updates: Tesseract 5.3.3, Leptonica 1.83, Mupdf 1.23.7, Freetype 2.13.2, libpng 1.6.40, jpeg-turbo 2.1.5.1, openjpeg 2.5.0, jbig2dec 0.20, zlib 1.3
BUG FIXES:
Bookmarks / TOC page numbering fixed in wmupdf.c (broke in v2.54 due to change in mupdf library).

New in k2pdfopt 2.53 (Jul 19, 2020)

New in k2pdfopt 2.52 (Jun 12, 2020)

NEW FEATURES:
Automatically downloads Tesseract v4.0 training files from github.
Use -y for no prompting. See -ocrlang command usage for full details.
Added new -go (grid order) option to specify the order in which
Areas in gridded pages are processed. Grid area processing also
Defaults to right-to-left if -r is specified.
ENHANCEMENTS:
Prints Tesseract init info at the beginning of each file conversion
Now--in the GUI also.
Redirects Tesseract library debug statements to a log file. Prints
Them out if it cannot initialize correctly.
Re-edited a lot of the -mode usage descriptions.
CMakeLists.txt files updated.
Added Kobo Clara HD, Kobo Forma, and Kobo Libra H2O to device list
(using preliminary resolution numbers).
Windows version now compiled with gcc 9.3.1 on a core i9-9900.
MS WINDOWS GUI
Tesseract language selection and download fully supported.
Added three more preset buttons for a total of seven.
Added "View" button to GUI to view PDF source file.
BUG FIXES:
Fixed issue where OCR layer placement from an already generated
OCR layer was not correct if a document scale factor not equal
To 1 was used (e.g. -ds 2).
Hopefully improved detection of out-of-family small rows.
Modified textrows_remove_small_rows() in textrows.c and also added
Rhmin option to remove any row w/text height less than a certain value.
Https://www.mobileread.com/forums/showthread.php?p=3693210#post3693210
Https://www.mobileread.com/forums/showthread.php?p=2781815#post2781815
Fixed some issues with -fr option. Added -ddr option (-ddr- to turn
Off double-row detect) to prevent tall figures from potentially
Getting broken up inadvertently. Modified k2proc.c to
"rotate the trow" structure next the end of bmpregion_add() function.
Https://www.mobileread.com/forums/showthread.php?p=3549163#post3549163
Fixed issue where the output margins weren't exactly right in the
Case of -mode trim. See k2settings_adjust_devdims_for_om() function
In k2settings.c. (PM from Tex2002ans on MR.)
Added textrows_remove_defects() function to bmpregion_find_textrows()
In bmpregion.c. This helped fix an issue where rows were not
Getting selected properly in a document due to small scanning
Defects.
Fixed implementation of -bp <gap>. This had been broken for some
Time. It was getting overridden by other features. See mods to
K2proc.c and k2master.c.

New in k2pdfopt 2.51 (Jan 5, 2019)

New in k2pdfopt 2.50 (Dec 28, 2018)

New in k2pdfopt 2.42 (May 22, 2017)

New in k2pdfopt 2.41 (May 22, 2017)

New in k2pdfopt 2.40 (Jan 30, 2017)

New in k2pdfopt 2.36 (Dec 26, 2016)

New in k2pdfopt 2.35 (Oct 26, 2016)

ENHANCEMENTS:
Also, the binaries are compressed with a newer version of UPX which
is compatible with Mac OSX 10.12 Sierra.
Windows binaries compiled with GCC 6.2.0 (MinGW).
Compiled with the latest versions of libpng (1.6.25), freetype (2.7),
turbo JPEG (1.5.1).
I tried newer versions of MuPDF (v1.9a and v1.10 pre-release), but
they broke more things than I was comfortable with, so I've stayed
with MuPDF v1.8 for this release.
NEW FEATURES:
Added -jfc- option to prevent trying to join figure captions to the
figures.
http://www.mobileread.com/forums/showthread.php?p=3342105#post3342105
Added new conversion mode, -mode concat, which keeps the output at
the same dimensions as the source file and concatenates crop-boxes
(red boxes) together--as many as can fit on each page without breaking
them apart.
Added option -f2p -3 to support -mode concat.
BUG FIXES:
No longer crashes in native PDF output mode if there is no output.
Writes more informative message to screen if output file not written.
Makes sure output file can be opened for writing before proceeding
with the conversion. Warns user if file cannot be opened.
http://www.mobileread.com/forums/showthread.php?p=3343367#post3343367
Correctly processes blank/empty pages in .djvu files.
http://www.mobileread.com/forums/showthread.php?p=3350691#post3350691
The -title option now substitutes the file name for %s or %b, like
the -o option.
http://www.mobileread.com/forums/showthread.php?p=3389292#post3389292
The -grid overlap percentage is more precise now.
Warning message to use -fc- with -odpi, -fs, or -mag. Not sure this
is the best way--should I just turn off -fc if those are specified?
Reported in 22 April 2016 e-mail.
Also: http://www.mobileread.com/forums/showthread.php?p=3354548#post3354548
Wide-char (UTF-8) DJVU file names now work.
http://www.mobileread.com/forums/showthread.php?p=3351085#post3351085
Blank pages no longer cause an improper conversion with -mode trim.
E-mail from 17 Mar 2016.
Clarified usage of -mode copy, explaining about gamma and contrast
settings (how they are not reset to 1 with -mode copy).
In the function where the 2-column divider is detected, a special
call is made to find_textrows() so that figure caption joining is
disabled unless -jfc+ is specified. This helps 2-column detection
work more reliably.
http://www.mobileread.com/forums/showthread.php?p=3351808#post3351808
http://www.mobileread.com/forums/showthread.php?p=3342105#post3342105
MS WINDOWS GUI BUG FIXES:
The GUI now correctly selects the "crop" conversion mode.

New in k2pdfopt 2.34b (Mar 28, 2016)

New in k2pdfopt 2.33a (Oct 6, 2015)

New in k2pdfopt 2.33 (Oct 6, 2015)

NEW FEATURES
Compiled with GCC v5.2.0 and MuPDF v1.7a (released May 7, 2015).
The MuPDF upgrade involved modifying a significant amount of the MuPDF interface code in the willus library since Artifex changed the APIs on several functions, but the bulk of the logic did not change.
I uncovered a bug in the pdf_dict_del() function as well (reported).
The -i option displays information about the source PDF file. Added to MS Windows GUI also.
Added -fr option to rotate wide-aspect-ratio figures to landscape.
Added Kindle Paperwhite 3 (2015 release) and Pocketbook Basic 2 to dev list (from http://www.mobileread.com/forums/showthread.php?t=253579)
Smarter sorting of red regions on a multiple-column page. See pageregion_sort() function in pageregions.c.
New -ibox option has same format as -cbox, but these boxes are ignored ** by k2pdfopt--they are "whited out" in the source file. For native output, the contents may still be visible in the output.
The -neg option now attempts to only negate text passages to white on black and to leave figures alone. Use -neg+ to negate everything.
Added option -ehl to erase horizontal lines in the document. Works exactly like the -evl option.
Added -author and -title options to specify the author and title of the output PDF.
Added -px option to exclude a set of pages, e.g. -px 4,7,10-20.
User can use color markings to tell k2pdfopt where to apply page breaks to the output file.
The -? option can now be followed by a (wildcard) matching string to show the usage of a particlar option, e.g. -? -ws.
NEW MS WINDOWS GUI FEATURES:
Crop margins replaced by three crop-box entries.
These entries can act as either -cbox or -ibox.
Added an "INFO" button to show information on the selected PDF file.
If a wildcard is specified for a PDF source file on the command line (e.g. *.pdf), and the GUI is launched, all matching entries are placed into the GUI file list.
BUG FIXES:
With notes options turned on (-nl / -nr), k2pdfopt will still search for multiple columns if no notes are found on the page. In addition, the -crgh option now more directly affects column divider finding.
See textrows_remove_small_rows() call in bmpregion_find_multicolumn_divider().
Fixed multiple file select (broke when I converted to wide chars in v2.30).
Modified bmpregion_hyphen_detect() to be less strict about rejecting hyphens that aren't exactly centered. Also modified calculation of lcheight in bmpregion_calc_bbox()--see the function.
The k2pdfopt web site and help pages work again from the help menu.
Turned off some debugging text from the bmp_autocrop2 function in k2bmp.c.
Not really a bug fix, but the command-line help is now shown in ** Courier New in MS Windows (a mono-spaced font).
In info_update() in wmupdf.c in the willus library, I check to see if I can resolve the Info dictionary. This checks to see if it can be parsed correctly. If not, I discard the dictionary. This was causing a bug that a user submitted to me in an e-mail on 15 April 2015. The users had a PDF file with a corrupt "Info" dictionary.
IWPDFOUTLINE structures correctly freed.
MuPDF v1.7 stores ligatured characters differently than previous versions in its internal character arrays, so I had to compensate for this.

New in k2pdfopt 2.32 (Mar 7, 2015)

New in k2pdfopt 2.31 (Dec 29, 2014)

New in k2pdfopt 2.30 (Dec 5, 2014)

New in k2pdfopt 2.21 (Jul 30, 2014)

New in k2pdfopt 2.20 (Jul 26, 2014)

NEW FEATURES:
Added k2ocr_ocrwords_get_from_ocrlayer() and supporting functions to more intelligently extract all of the OCR-layer text from a major ("red box") region (rather than parsing for the words graphically, the OCR layer is queried for any words which are within the box). This should eliminate the need to use the -ocrcol option on PDF files which already have their own text layer.
There is a new option optimized for PDFs that have notes in the left or right margins. This option (-nl for notes in the left margin or -nr for notes in the right margin) tells k2pdfopt to look for notes and intersperse them with the main text. The notes can even alternate, e.g. left margin for odd pages and right margin for even pages would be -nlo -nre.
The word spacing (-ws) option now defaults to -0.20 (the old default was 0.375). When a negative value is given, an automatic word spacing detection algorithm now used to break apart words in lines. The algorithm will try to choose a natural word spacing value, with the minimum allowed being the absolute value of the setting (e.g. 0.2 for the default). If you want k2pdfopt to aggressively break lines (e.g. break apart long words if they don't fit on a line), use a smaller absolute value, e.g. -ws -0.01.
You can use a positive value for the older style of line breaking, and the lines are only broken where a gap exceeds that fraction of the height of a lower-case 'o'. There is also a new Windows GUI checkbox for this option.
Entire rows of text within the OCR layer can either have the words within rendered individually (the default and original behavior of k2pdfopt), or the entire row can be rendered at once with spaces used between each word. This may improve the text selection behavior for certain readers. The option that controls this is -ocrsp:
ocrsp puts one space between each word in the row.
-ocrsp+ puts multiple spaces between each word in the row in order to better position the words since k2pdfopt does not typically match the exact font used by the source document when doing the OCR layer (it always uses arial). -ocrsp- reverts to the original (default) behavior.
The -m and -om options now can use units, e.g. pixels, inches, cm, and "s" for page/screen size. In addition, -m can use the other units used by -cbox, -w, and -h ("t" and "x").
The new "x" unit of measure corresponds to the OCR Layer bounding box, e.g. -cbox 0x,0x,1x,1x will correspond to a crop box that matches the bounding box of the OCR layer. This can be used by the -w, -h, and -m options as well. See -h usage.
New option -to[-] for text-only output, removes all figures as determined by a height limit (see -jf option). Use with -bp m to avoid text selection issues if using in conjunction with native output.
If -wt+ is specified for the white threshold, all pixels >= the specified value will be painted pure white (255).
While not a perfect work-around, large, stylized first letters which frequently begin a book chapter (typically the height of 2 or 3 normal-sized text rows) are now detected when wrapping text lines so that the lines adjacent to them are more-or-less correctly wrapped. Still needs improvement.
MS-WINDOWS GUI ENHANCEMENTS:
If the custom buttons are not used, a "2-column" and a "Fit Width" button are automatically assigned.
New MS-Windows GUI check box for -bp m option (Avoid text select overlap).
New MS-Windows GUI check box for defects (sets -de 1.5).
New MS-Windows GUI check box / text box for line break setting (-ws option)
The last settings (other than the custom button presets) are remembered between settings (stored in the K2PDFOPT_CUSTOM0 env var). (4-15-14 e-mail)
There is now a "Restore Defaults" button since k2pdfopt remembers its last settings.
New option for GUI: -rls[+|-]. Forces/disables restoration of last settings from K2PDFOPT_CUSTOM0 environment variable.
Environment variables related to the MS Windows GUI are only read and set through Windows calls (not through getenv() and putenv())-- this was causing problems in trying to clear them.
Command-line options that don't impact the GUI are put into the "additional options" box upon launch.
For file overwriting, the user is now given a "Yes to All" option and a "No to all" option.
There are two new menu options--to save and restore the settings (stored in environment variables) to and from a file. This is done in the k2gui_save_settings_to_file() and k2gui_restore_settings_from_file() functions in k2gui.c.
If the output file cannot be opened (e.g. because another application already has it open), a message box is shown informing the user rather than just quitting the program.
BUG FIXES:
Tabbing between crop margin text fields in the MS Windows GUI keeps the entire text field selected even when you change a value before tabbing.
Fixed bug in window positioning at startup.
Fixed bug where -ds did not get properly applied in native mode.
Fixed bug where tesseract was incorrectly initializing languages that do not have CUBE/COMBINED data (e.g. Russian). If that initialization fails, it now tries CUBE-only (and then no CUBE at all). This required a couple minor mods to the Tesseract library itself, which are included in the latest k2pdfopt source code distribution.
SOURCE CODE MODIFICATIONS:
The bmpregion_add() function and some others use a new parameter data structure rather than having so many arguments passed to them (ADDED_REGION_INFO).
Copied web text at top of main k2 page to k2pdfopt.c intro.

New in k2pdfopt 2.18 (Jun 17, 2014)

New in k2pdfopt 2.17a (Jun 3, 2014)

New in k2pdfopt 2.17 (May 19, 2014)

New in k2pdfopt 2.16 (May 5, 2014)

New in k2pdfopt 2.15 (Apr 9, 2014)

New in k2pdfopt 2.14 (Apr 9, 2014)

New in k2pdfopt 2.13 (Apr 9, 2014)

New in k2pdfopt 2.12 (Apr 9, 2014)

New in k2pdfopt 2.11 (Apr 9, 2014)

New in k2pdfopt 2.10 (Apr 9, 2014)

NEW FEATURES:
The PDF "Outlines" tree (often called "bookmarks" by PDF viewers) that helps you navigate the PDF file and is usually shown in the left pane of the PDF viewer is now preserved in the converted file. Or you can create your own bookmarks from a simple text file if your PDF source file doesn't have one (or if you want to change it).
See the -toc, -toclist, and -tocsave command-line options. (toc = Table of Contents.) Destination page breaks are forced at outline anchor pages by default (see -bp option).
A new -cbox option allows you to specify a crop box to be applied to each page. You can specify more than one, and each separate crop box will be rendered to a different output page, similar to the way the -grid option works. See -cbox in the command usage.
Using -mode crop with -cbox, you can crop a source PDF file to a destination PDF file. You can specify different crop boxes for even and odd pages, as well.
The -bpl option now allows you to specify a list of source pages where destination page breaks will be forced.
Three new modes: -mode trim causes the source page to be trimmed and the destination to be sized to the trimmed source.
-mode fitpage is similar, but squeezes the trimmed source page into the specified device output screen size.
-mode crop is a complement to the -cbox option and causes each cropped box to be placed on a new page the size of the cropped box.
ENHANCEMENTS:
Windows versions are compiled with gcc 4.8.2.
The Win64 binary is now compressed with UPX 3.91w which finally is able to compress the Win64/PE format.
BUG FIXES:
In native output, consecutive streams now delimited by white space.
Pages with no "/Contents" entry are correctly handled.
Re-wrote masterinfo_break_point() to make use of bmpregion_find_textrows() so that decisions on where to break pages in the "fitwidth" mode should be more consistent and also will be affected by the -gtr option.
Removed last vestiges of -pi option (interactive menu 'w' option was incorrectly still using it).
The vert_line_erase() function in k2bmp.c correctly handle the cbmp pointer when it is an 8-bit bitmap now.
Fixed a flow problem in k2file.c (k2pdfopt_proc_one() function) which was causing the GUI preview not to work with -mode copy.
The textrows_remove_small_rows() function no longer includes figures (REGION_TYPE_FIGURE) when doing statistics on the row heights.

New in k2pdfopt 2.03 (Apr 9, 2014)

New in k2pdfopt 2.02 (Apr 9, 2014)

New in k2pdfopt 2.01 (Apr 9, 2014)

New in k2pdfopt 2.00 (Apr 9, 2014)

MAJOR NEW FEATURES:
Added a GUI for the MS Windows version. In MS Windows, k2pdfopt now has a dual-mode operation: it runs as a GUI or a console program depending on the situation and command-line arguments: Use -gui to force the GUI, -guimin to minimize the GUI, use -gui- for no GUI.
Window size / position stored in K2PDFOPT_WINPOS env. var.
Custom presets stored in K2PDFOPT_CUSTOM env. vars. (=1-4).
For PDF documents that contain text (not scanned); MuPDF text analysis functions so that even on re-flowed documents (i.e. not using native PDF output) the text can be included from the source for searching and highlighting without resorting to OCR.
"Virtual OCR". You explicitly turn it on with -ocr m (m for MuPDF), though it is on by default since it adds very little time to the PDF file processing. It is not necessary if you are using native PDF output (-n).
OTHER NEW FEATURES:
Windows 64-bit version compiled with gcc 4.7.3 (on core i5-670).
The optimized Windows 32-bit version still uses gcc 4.6.3 (it does not compile correctly in gcc 4.7.3 on my MinGW platform due to some issue with with the Tesseract library). The generic Windows 32-bit version uses gcc 4.7.3.
Compiled with latest versions of MuPDF (v1.3), OpenJPEG, FreeType,
Turbo-JPEG, PNG, and Z libraries.
A new option, -bmp, will write out a preview bitmap of the specified page, e.g. -bmp 2 will write out the 2nd converted page to the file k2pdfopt_out.png.
New option -sp simply echos the page counts of the source files and exits.
OCR text can be written to a separate text file now with the -ocrout option. E.g. -ocrout %s.txt will write the OCR from myfile.pdf to myfile.txt.
SOURCE CODE RE-WRITE:
New way a number of data structures store information (BMPREGION, TEXTROWS) and eliminated the BREAKINFO structure.
The way gaps are placed between regions of text has been substantially re-done.
BUG FIXES:
In k2pdfopt_settings_set_margins_and_devsize(), doesn't flush the master bitmap unless the width changes.
Mode -f2p -2 (put every new red-box region onto a new page) no longer add is exactly the region size.
When expected numeric arguments on the command-line don't occur, that parameter is not set, a warning is printed, and the non-numeric argument is processed as if it were the next argument on the command line
Modified tessedit.cpp in Tesseract library to automatically detect if .cube files are present so that it uses them even for multiple language specification. Correspondingly modified ocrtess_init() in willus library.
v1.65 was crashing after not finding the Tesseract language file and switching to GOCR: This is fixed (tesseract status now stored in global static variable in k2ocr.c).