uni2ascii



Contents

  1. Description
  2. Documentation
  3. Related Programs
  4. Details
  5. Downloads
  6. Environment
  7. Change Log
  8. Roadmap
  9. Bugs

Finnish translation of this page.

Polish translation of this page


Description

his package provides conversion in both directions between UTF-8 Unicode and more than thirty 7-bit ASCII equivalents, including RFC 2396 URI format and RFC 2045 Quoted Printable format, the representations used in HTML, SGML, XML, OOXML, the Unicode standard, Rich Text Format, POSIX portable charmaps, POSIX locale specifications, and Apache log files, and the escapes used for including Unicode in Ada, C, Common Lisp, Java, Pascal, Perl, Postscript, Python, Scheme, and Tcl.

Such ASCII equivalents are useful when including Unicode text in program source, when debugging, and when entering text into web programs that can handle the Unicode character set but are not 8-bit safe. For example, MovableType, the blog software, truncates posts as soon as it encounters a byte with the high bit set. However, if Unicode is entered in the form of HTML numeric character entities, Movable Type will not garble the post.

It also provides ways of converting non-ASCII characters to similar ASCII characters, e.g. by stripping diacritics.

For example, here is the Chinese for regular expression in Unicode:

正規表達式
and here is the HTML hexadecimal numeric character reference output from uni2ascii:
正規表達式

The package consists of two programs: uni2ascii and ascii2uni.

Here is a list of the ASCII representations of Unicode known to me with indications of their usage.

The Unicode escapes handled include:

Microsoft-style HTML character entities and numeric character references without the final semi-colon are converted with a warning message.

The package can also be used to convert from one type of ASCII representation to another by passing through Unicode. For example, the pipeline:

ascii2uni -a U | uni2ascii -a J

will convert from \u-escapes (e.g. \u00e9) to RFC2396 URI format (e.g. %C3%A9).

ascii2uni -a H | uni2ascii -a D

will convert HTML hexadecimal numeric character references to decimal numeric character references.

ascii2uni -a H | uni2ascii -a H -a Q

will convert HTML hexadecimal numeric character references to HTML character entities where equivalent character entities exist, and

ascii2uni -a M | uni2ascii -a H

will convert SGML hexadecimal numeric character entities to HTML.

uni2ascii can also replace non-ASCII characters with approximate ASCII equivalents. For example, it can replaced stylistic variants (e.g. bold-face) with their plain counterparts, or characters with accents with their unaccented equivalents.

Back to Top

Documentation

uni2ascii and ascii2uni are provided with standard Unix manual pages:

Both programs also provide a detailed summary of their command line options in response to the -h command line option.

Back to Top

Related Programs

If you need to convert between UTF-8 Unicode and other encodings, you may find enca, iconv, recode, and uniconv useful. If you need to convert between textual representations of numbers and machine representations, you may find the programs ascii2binary and binary2ascii helpful. If you need to find out more about what is in a Unicode file (e.g. if you don't know the writing system, don't have the necessary font, think that the Unicode may be ill-formed, or need to examine details of representation such as composition) you may find the Unicode Utilities suite of programs useful.

Back to Top

Details

LanguageC [basic programs], Tcl/Tk [GUI]
EnvironmentPOSIX
LicenseGNU General Public License, version 3
Current version4.18
Last modified2011-05-15
ContactBill Poser
Back to Top

Downloads

FileSize (Bytes)MD5 Sum
uni2ascii-4.18.tar.bz2 127,125 a1b1df74cccd1fa997bad79c8c4ced68
uni2ascii-4.18.tar.gz 160,182 096cf1b70a55c4796b136ff1a126a940
uni2ascii-4.18.zip 174,602 3842bcc366ca5b2d98c63c289cc550a2

If you wish to be informed of new releases, subscribe to uni2ascii at Freshmeat.

Packages

Arch Linux
uni2ascii
Debian
Debian package (stable)
Debian package (testing)
Debian package (unstable)
FreeBSD
Freshport
Mac OS X
Macports.
Mac OS X
Fink.
OpenPackage
OpenPackage
Redhat/Fedora
RPMs for a variety of architectures are available here.
Redhat/Fedora
A source RPM and a binary RPM for the i386 architecture are available here.
SUSE Linux
RPM
Ubuntu
Ubuntu


Back to Top

Environment

uni2ascii and ascii2uni have been compiled and tested under FreeBSD, GNU/Linux, Mac OS X and SunOS. They should compile and run without modification in any POSIX-compliant environment.

Back to Top

Change Log

4.18 - 2011-05-15

4.17 - 2011-02-16

4.16 - 2010-12-12

4.15 - 2010-08-29


Full Change Log
Back to Top

Roadmap

Bugs

ascii2uni contains a bug that affects impure mode conversions of standard hex (-X option). Version 3.9.2 fixes the bug for inputs within the BMP, that is, for hex values less than or equal to 0xFFFF. A more general fix is anticipated.


Back to Top


Back to Bill Poser's software page.