Capítulo 11. Conversión de datos

Tabla de contenidos

11.1. Herramientas para la conversión de datos de texto
11.1.1. Convirtiendo un archivo de texto con iconv
11.1.2. Comprobando que un archivo es UTF-8 con iconv
11.1.3. Convirtiendo los nombres de archivos con iconv
11.1.4. Conversión EOL
11.1.5. Conversión de TAB
11.1.6. Editors with auto-conversion
11.1.7. Plain text extraction
11.1.8. Highlighting and formatting plain text data
11.2. XML data
11.2.1. Basic hints for XML
11.2.2. XML processing
11.2.3. The XML data extraction
11.3. Type setting
11.3.1. roff typesetting
11.3.2. TeX/LaTeX
11.3.3. Pretty print a manual page
11.3.4. Creating a manual page
11.4. Printable data
11.4.1. Ghostscript
11.4.2. Merge two PS or PDF files
11.4.3. Printable data utilities
11.4.4. Printing with CUPS
11.5. The mail data conversion
11.5.1. Mail data basics
11.6. Graphic data tools
11.7. Miscellaneous data conversion

Se describen herramientas y métodos para convertir formatos de datos en el sistema Debian.

Las herramientas para formatos estándar son muy buenas pero para formatos propietarios son limitadas.

Los siguientes paquetes para la conversión de información de texto llamaron mi atención.


[Sugerencia] Sugerencia

iconv(1) es parte del paquete libc6 y esta siempre disponible en practicamente el cualquier sistema tipo Unix para la conversión de codificaciones de caracteres.

Puede converitr las codificaciones de los archivos de texto con iconv(1)como es muestra.

$ iconv -f codificación1 -t codificación2 entrada.txt >salida.txt

Los valores de códificaciones para el encaje distinguen entre mayúsculas y minúsculas y pasan por alto "-" y "_". Puede obtener una lista de las codificaciones reconocidas mediante la órden "iconv -l".

Tabla 11.2. Enumeración de valores de codificación y su uso

valor de la codificación uso
ASCII Código Estándar Americano para el Intercambio de Información, código de 7 bits sin carácteres acentuados
UTF-8 estándar multilenguaje actual en los sistemas operativos modernos
ISO-8859-1 estándar antiguo de las lenguas occidentales, ASCII+ caracteres acentuados
ISO-8859-2 antiguo estándar de las lenguas occidentales, ASCII + carácteres acentuados
ISO-8859-15 antiguo estándar de las lenguas occidentales, ISO-8859-1 con el símbolo del euro
CP850 página de códigos 850, caracteres de Microsoft DOS con gráficos para los lenguajes de la Europa occidental, variante de ISO-8859-1
CP932 página de código 932, variante del japonés de Shift-JIS al estilo Microsoft Windows
CP936 página de códigos 936,GB2312, GBK o GB18030 variante para chino simplificado al estilo Microsoft Windows
CP949 página de código 949, EUC-KR o Código Unificado Hangul par coreano al estilo Microsoft Windows
CP950 código de página 950, Big5 variante par chino tradicional al estilo Microsoft Windows
CP1251 código de página 1251, codificación del alfabeto cirílico al estilo Microsoft Windows
CP1252 código de página 1252, ISO-8859-15 para las lenguas de Europa occidental al estilo Microsoft Windows
KOI8-R antiguo estándar ruso UNIX para el alfabeto cirílico
ISO-2022-JP estándar de codificación japones para el correo electrónico que solo utiliza códigos de 7 bit
eucJP código de 8 bit del antiguo estándar japonés de UNIX, completamente diferente de Shift-JIS
Shift-JIS Apéndice 1 para el japonés JIS X 0208 (consulte CP932)

[Nota] Nota

Algunas codificaciones son únicamente usadas para la conversión de información y no son usables como valores de la configuración local (Sección 8.3.1, “Fundamentos de codificación”).

Para los conjuntos de caracteres que caben en un único byte como ASCII y ISO-8859, la códificación de caracteres es casi lo mismo que el conjunto de caracteres.

Para los conjuntos de caracteres con muchos elementos como JIS X 0213 en el japonés o Conjunto de Caracteres Universal (UCS, Unicode, ISO-10646-1) en prácticamente cualquier lenguaje, existen muchos esquemas de codificación y encajan como secuencias de bytes de datos.

En este caso existe un diferenciación clara entre el conjunto de caracteres y la códificación de caracteres

Algunos proveedores en algunos casos utilizan la página de códigos como sinónimo de la tabla de codificación de caracteres.

[Nota] Nota

Por favor, tenga en cuenta que la mayor parte de los sistemas de codificación comparten los mismos códigos con ASCII de 7 bits. Pero existen algunas excepciones. Si esta convirtiendo programas antiguos japoneses en C y datos URL de la codificación conocida como formato shift-JIS a formato UTF-8, utilice "CP932" como nombre de la codificación en lugar de "shift-JIS" para obtener los resultados correctos: 0x5C → "\" y 0x7E → "~". De otro modo serán convertidos a los caracteres incorrectos.

[Sugerencia] Sugerencia

recode(1) también puede ser usado y aporta mayor funcionalidad que la combinación de iconv(1), fromdos(1), todos(1), frommac(1), y tomac(1). Para más información, consulte "info recode".

Intelligent modern editors such as the vim program are quite smart and copes well with any encoding systems and any file formats. You should use these editors under the UTF-8 locale in the UTF-8 capable console for the best compatibility.

An old western European Unix text file, "u-file.txt", stored in the latin1 (iso-8859-1) encoding can be edited simply with vim by the following.

$ vim u-file.txt

This is possible since the auto detection mechanism of the file encoding in vim assumes the UTF-8 encoding first and, if it fails, assumes it to be latin1.

An old Polish Unix text file, "pu-file.txt", stored in the latin2 (iso-8859-2) encoding can be edited with vim by the following.

$ vim '+e ++enc=latin2 pu-file.txt'

An old Japanese unix text file, "ju-file.txt", stored in the eucJP encoding can be edited with vim by the following.

$ vim '+e ++enc=eucJP ju-file.txt'

An old Japanese MS-Windows text file, "jw-file.txt", stored in the so called shift-JIS encoding (more precisely: CP932) can be edited with vim by the following.

$ vim '+e ++enc=CP932 ++ff=dos jw-file.txt'

When a file is opened with "++enc" and "++ff" options, ":w" in the Vim command line stores it in the original format and overwrite the original file. You can also specify the saving format and the file name in the Vim command line, e.g., ":w ++enc=utf8 new.txt".

Please refer to the mbyte.txt "multi-byte text support" in vim on-line help and Tabla 11.2, “Enumeración de valores de codificación y su uso” for locale values used with "++enc".

The emacs family of programs can perform the equivalent functions.

You can highlight and format plain text data by the following.

Tabla 11.6. List of tools to highlight plain text data

paquete popularidad tamaño palabra clave descripción
vim-runtime V:20, I:431 27567 highlight Vim MACRO to convert source code to HTML with ":source $VIMRUNTIME/syntax/html.vim"
cxref V:0, I:0 1157 c→html converter for the C program to latex and HTML (C language)
src2tex V:0, I:0 612 highlight convert many source codes to TeX (C language)
source-highlight V:1, I:7 2008 highlight convert many source codes to HTML, XHTML, LaTeX, Texinfo, ANSI color escape sequences and DocBook files with highlight (C++)
highlight V:1, I:16 943 highlight convert many source codes to HTML, XHTML, RTF, LaTeX, TeX or XSL-FO files with highlight (C++)
grc V:0, I:2 60 text→color generic colouriser for everything (Python)
txt2html V:0, I:4 296 text→html text to HTML converter (Perl)
markdown V:0, I:6 56 text→html markdown text document formatter to (X)HTML (Perl)
asciidoc V:1, I:14 2442 text→any AsciiDoc text document formatter to XML/HTML (Python)
pandoc V:3, I:23 69422 text→any general markup converter (Haskell)
python-docutils V:35, I:554 1653 text→any ReStructured Text document formatter to XML (Python)
txt2tags V:0, I:1 951 text→any document conversion from text to HTML, SGML, LaTeX, man page, MoinMoin, Magic Point and PageMaker (Python)
udo V:0, I:0 548 text→any universal document - text processing utility (C language)
stx2any V:0, I:0 264 text→any document converter from structured plain text to other formats (m4)
rest2web V:0, I:0 526 text→html document converter from ReStructured Text to html (Python)
aft V:0, I:0 235 text→any "free form" document preparation system (Perl)
yodl V:0, I:0 522 text→any pre-document language and tools to process it (C language)
sdf V:0, I:0 1445 text→any simple document parser (Perl)
sisu V:0, I:0 5338 text→any document structuring, publishing and search framework (Ruby)

The Extensible Markup Language (XML) is a markup language for documents containing structured information.

See introductory information at XML.COM.

XML text looks somewhat like HTML. It enables us to manage multiple formats of output for a document. One easy XML system is the docbook-xsl package, which is used here.

Each XML file starts with standard XML declaration as the following.

<?xml version="1.0" encoding="UTF-8"?>

The basic syntax for one XML element is marked up as the following.

<name attribute="value">content</name>

XML element with empty content is marked up in the following short form.

<name attribute="value"/>

The "attribute="value"" in the above examples are optional.

The comment section in XML is marked up as the following.

<!-- comment -->

Other than adding markups, XML requires minor conversion to the content using predefined entities for following characters.


[Atención] Atención

"<" or "&" can not be used in attributes or elements.

[Nota] Nota

When SGML style user defined entities, e.g. "&some-tag:", are used, the first definition wins over others. The entity definition is expressed in "<!ENTITY some-tag "entity value">".

[Nota] Nota

As long as the XML markup are done consistently with certain set of the tag name (either some data as content or attribute value), conversion to another XML is trivial task using Extensible Stylesheet Language Transformations (XSLT).

There are many tools available to process XML files such as the Extensible Stylesheet Language (XSL).

Basically, once you create well formed XML file, you can convert it to any format using Extensible Stylesheet Language Transformations (XSLT).

The Extensible Stylesheet Language for Formatting Objects (XSL-FO) is supposed to be solution for formatting. The fop package is new to the Debian main archive due to its dependence to the Java programing language. So the LaTeX code is usually generated from XML using XSLT and the LaTeX system is used to create printable file such as DVI, PostScript, and PDF.


Since XML is subset of Standard Generalized Markup Language (SGML), it can be processed by the extensive tools available for SGML, such as Document Style Semantics and Specification Language (DSSSL).


[Sugerencia] Sugerencia

GNOME's yelp is sometimes handy to read DocBook XML files directly since it renders decently on X.

The Unix troff program originally developed by AT&T can be used for simple typesetting. It is usually used to create manpages.

TeX created by Donald Knuth is a very powerful type setting tool and is the de facto standard. LaTeX originally written by Leslie Lamport enables a high-level access to the power of TeX.


Traditionally, roff is the main Unix text processing system. See roff(7), groff(7), groff(1), grotty(1), troff(1), groff_mdoc(7), groff_man(7), groff_ms(7), groff_me(7), groff_mm(7), and "info groff".

You can read or print a good tutorial and reference on "-me" macro in "/usr/share/doc/groff/" by installing the groff package.

[Sugerencia] Sugerencia

"groff -Tascii -me -" produces plain text output with ANSI escape code. If you wish to get manpage like output with many "^H" and "_", use "GROFF_NO_SGR=1 groff -Tascii -me -" instead.

[Sugerencia] Sugerencia

To remove "^H" and "_" from a text file generated by groff, filter it by "col -b -x".

The TeX Live software distribution offers a complete TeX system. The texlive metapackage provides a decent selection of the TeX Live packages which should suffice for the most common tasks.

There are many references available for TeX and LaTeX.

  • The teTeX HOWTO: The Linux-teTeX Local Guide

  • tex(1)

  • latex(1)

  • texdoc(1)

  • texdoctk(1)

  • "The TeXbook", by Donald E. Knuth, (Addison-Wesley)

  • "LaTeX - A Document Preparation System", by Leslie Lamport, (Addison-Wesley)

  • "The LaTeX Companion", by Goossens, Mittelbach, Samarin, (Addison-Wesley)

This is the most powerful typesetting environment. Many SGML processors use this as their back end text processor. Lyx provided by the lyx package and GNU TeXmacs provided by the texmacs package offer nice WYSIWYG editing environment for LaTeX while many use Emacs and Vim as the choice for the source editor.

There are many online resources available.

When documents become bigger, sometimes TeX may cause errors. You must increase pool size in "/etc/texmf/texmf.cnf" (or more appropriately edit "/etc/texmf/texmf.d/95NonPath" and run update-texmf(8)) to fix this.

[Nota] Nota

The TeX source of "The TeXbook" is available at http://tug.ctan.org/tex-archive/systems/knuth/dist/tex/texbook.tex. This file contains most of the required macros. I heard that you can process this document with tex(1) after commenting lines 7 to 10 and adding "\input manmac \proofmodefalse". It's strongly recommended to buy this book (and all other books from Donald E. Knuth) instead of using the online version but the source is a great example of TeX input!

Printable data is expressed in the PostScript format on the Debian system. Common Unix Printing System (CUPS) uses Ghostscript as its rasterizer backend program for non-PostScript printers.

You can merge two PostScript (PS) or Portable Document Format (PDF) files using gs(1) of Ghostscript.

$ gs -q -dNOPAUSE -dBATCH -sDEVICE=pswrite -sOutputFile=bla.ps -f foo1.ps foo2.ps
$ gs -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sOutputFile=bla.pdf -f foo1.pdf foo2.pdf
[Nota] Nota

The PDF, which is a widely used cross-platform printable data format, is essentially the compressed PS format with few additional features and extensions.

[Sugerencia] Sugerencia

For command line, psmerge(1) and other commands from the psutils package are useful for manipulating PostScript documents. pdftk(1) from the pdftk package is useful for manipulating PDF documents, too.

The following packages for the printable data utilities caught my eyes.


Both lp(1) and lpr(1) commands offered by Common Unix Printing System (CUPS) provides options for customized printing the printable data.

You can print 3 copies of a file collated using one of the following commands.

$ lp -n 3 -o Collate=True filename
$ lpr -#3 -o Collate=True filename

You can further customize printer operation by using printer option such as "-o number-up=2", "-o page-set=even", "-o page-set=odd", "-o scaling=200", "-o natural-scaling=200", etc., documented at Command-Line Printing and Options.

The following packages for the mail data conversion caught my eyes.


[Sugerencia] Sugerencia

The Internet Message Access Protocol version 4 (IMAP4) server (see Sección 6.7, “Servidor POP3/IMAP4”) may be used to move mails out from proprietary mail systems if the mail client software can be configured to use IMAP4 server too.

Mail (SMTP) data should be limited to series of 7 bit data. So binary data and 8 bit text data are encoded into 7 bit format with the Multipurpose Internet Mail Extensions (MIME) and the selection of the charset (see Sección 8.3.1, “Fundamentos de codificación”).

The standard mail storage format is mbox formatted according to RFC2822 (updated RFC822). See mbox(5) (provided by the mutt package).

For European languages, "Content-Transfer-Encoding: quoted-printable" with the ISO-8859-1 charset is usually used for mail since there are not much 8 bit characters. If European text is encoded in UTF-8, "Content-Transfer-Encoding: quoted-printable" is likely to be used since it is mostly 7 bit data.

For Japanese, traditionally "Content-Type: text/plain; charset=ISO-2022-JP" is usually used for mail to keep text in 7 bits. But older Microsoft systems may send mail data in Shift-JIS without proper declaration. If Japanese text is encoded in UTF-8, Base64 is likely to be used since it contains many 8 bit data. The situation of other Asian languages is similar.

[Nota] Nota

If your non-Unix mail data is accessible by a non-Debian client software which can talk to the IMAP4 server, you may be able to move them out by running your own IMAP4 server (see Sección 6.7, “Servidor POP3/IMAP4”).

[Nota] Nota

If you use other mail storage formats, moving them to mbox format is the good first step. The versatile client program such as mutt(1) may be handy for this.

You can split mailbox contents to each message using procmail(1) and formail(1).

Each mail message can be unpacked using munpack(1) from the mpack package (or other specialized tools) to obtain the MIME encoded contents.

The following packages for the graphic data conversion, editing, and organization tools caught my eyes.

Tabla 11.17. List of graphic data tools

paquete popularidad tamaño palabra clave descripción
gimp V:97, I:509 16255 image(bitmap) GNU Image Manipulation Program
imagemagick V:154, I:544 191 image(bitmap) image manipulation programs
graphicsmagick V:7, I:14 4820 image(bitmap) image manipulation programs (fork of imagemagick)
xsane V:24, I:193 913 image(bitmap) GTK+-based X11 frontend for SANE (Scanner Access Now Easy)
netpbm V:32, I:547 4230 image(bitmap) graphics conversion tools
icoutils V:8, I:72 192 png↔ico(bitmap) convert MS Windows icons and cursors to and from PNG formats (favicon.ico)
scribus V:14, I:28 19136 ps/pdf/SVG/… Scribus DTP editor
libreoffice-draw V:344, I:479 8995 image(vector) LibreOffice office suite - drawing
inkscape V:145, I:360 102751 image(vector) SVG (Scalable Vector Graphics) editor
dia-gnome V:6, I:11 20 image(vector) diagram editor (GNOME)
dia V:25, I:41 3880 image(vector) diagram editor (Gtk)
xfig V:13, I:19 1783 image(vector) Facility for Interactive Generation of figures under X11
pstoedit V:15, I:358 667 ps/pdf→image(vector) PostScript and PDF files to editable vector graphics converter (SVG)
libwmf-bin V:14, I:365 104 Windows/image(vector) Windows metafile (vector graphic data) conversion tools
fig2sxd V:0, I:0 142 fig→sxd(vector) convert XFig files to OpenOffice.org Draw format
unpaper V:2, I:15 447 image→image post-processing tool for scanned pages for OCR
tesseract-ocr V:4, I:27 558 image→text free OCR software based on the HP's commercial OCR engine
tesseract-ocr-eng I:28 37486 image→text OCR engine data: tesseract-ocr language files for English text
gocr V:2, I:25 494 image→text free OCR software
ocrad V:1, I:7 310 image→text free OCR software
eog V:101, I:337 10581 image(Exif) Eye of GNOME graphics viewer program
gthumb V:15, I:27 3238 image(Exif) image viewer and browser (GNOME)
geeqie V:17, I:25 1535 image(Exif) image viewer using GTK+
shotwell V:17, I:140 5754 image(Exif) digital photo organizer (GNOME)
gtkam V:0, I:7 965 image(Exif) application for retrieving media from digital cameras (GTK+)
gphoto2 V:1, I:14 969 image(Exif) The gphoto2 digital camera command-line client
gwenview V:33, I:104 4508 image(Exif) image viewer (KDE)
kamera V:4, I:103 230 image(Exif) digital camera support for KDE applications
digikam V:3, I:17 1760 image(Exif) digital photo management application for KDE
exiv2 V:5, I:77 242 image(Exif) EXIF/IPTC metadata manipulation tool
exiftran V:2, I:26 67 image(Exif) transform digital camera jpeg images
jhead V:1, I:13 105 image(Exif) manipulate the non-image part of Exif compliant JPEG (digital camera photo) files
exif V:1, I:10 370 image(Exif) command-line utility to show EXIF information in JPEG files
exiftags V:0, I:3 205 image(Exif) utility to read Exif tags from a digital camera JPEG file
exifprobe V:0, I:3 482 image(Exif) read metadata from digital pictures
dcraw V:3, I:25 358 image(Raw)→ppm decode raw digital camera images
findimagedupes V:0, I:1 79 image→fingerprint find visually similar or duplicate images
ale V:0, I:0 766 image→image merge images to increase fidelity or create mosaics
imageindex V:0, I:0 144 image(Exif)→html generate static HTML galleries from images
outguess V:0, I:0 217 jpeg,png universal Steganographic tool
librecad V:12, I:18 7762 DXF CAD data editor (KDE)
blender V:4, I:29 101399 blend, TIFF, VRML, … 3D content editor for animation etc
mm3d V:0, I:0 4668 ms3d, obj, dxf, … OpenGL based 3D model editor
open-font-design-toolkit I:0 28 ttf, ps, … metapackage for open font design
fontforge V:1, I:10 91 ttf, ps, … font editor for PS, TrueType and OpenType fonts
xgridfit V:0, I:0 898 ttf program for gridfitting and hinting TrueType fonts

[Sugerencia] Sugerencia

Search more image tools using regex "~Gworks-with::image" in aptitude(8) (see Sección 2.2.6, “Opciones del método de búsqueda con aptitude”).

Although GUI programs such as gimp(1) are very powerful, command line tools such as imagemagick(1) are quite useful for automating image manipulation via scripts.

The de facto image file format of the digital camera is the Exchangeable Image File Format (EXIF) which is the JPEG image file format with additional metadata tags. It can hold information such as date, time, and camera settings.

The Lempel-Ziv-Welch (LZW) lossless data compression patent has been expired. Graphics Interchange Format (GIF) utilities which use the LZW compression method are now freely available on the Debian system.

[Sugerencia] Sugerencia

Any digital camera or scanner with removable recording media works with Linux through USB storage readers since it follows the Design rule for Camera Filesystem and uses FAT filesystem. See Sección 10.1.7, “Dispositivos de almacenamiento extraibles”.

There are many other programs for converting data. Following packages caught my eyes using regex "~Guse::converting" in aptitude(8) (see Sección 2.2.6, “Opciones del método de búsqueda con aptitude”).


You can also extract data from RPM format with the following.

$ rpm2cpio file.src.rpm | cpio --extract