From: Roland Hieber Date: Sun, 17 Nov 2013 22:58:35 +0000 (+0100) Subject: new blag post: Optimizing XSane's scanned PDFs (also: PDF internals) X-Git-Url: http://git.rohieb.name/www-rohieb-name.git/commitdiff_plain/dd6de2c6f67423a12913724647f30c37971885e4?ds=sidebyside new blag post: Optimizing XSane's scanned PDFs (also: PDF internals) --- diff --git a/blag/post/optimizing-xsane-s-scanned-pdfs.mdwn b/blag/post/optimizing-xsane-s-scanned-pdfs.mdwn new file mode 100644 index 0000000..529e873 --- /dev/null +++ b/blag/post/optimizing-xsane-s-scanned-pdfs.mdwn @@ -0,0 +1,371 @@ +[[!meta title="Optimizing XSane's scanned PDFs (also: PDF internals)"]] +[[!meta author="rohieb"]] +[[!meta license="CC-BY-SA 3.0"]] +[[!img defaults size=x200]] + +[[!toc levels=2]] + +Problem +------- + +I use [XSane][xsane] to scan documents for my digital archive. I want them to be +in PDF format and have a reasonable resolution (better than 200 dpi, so I +can try OCRing them afterwards). However, the PDFs created by XSane’s multipage +mode are too large, about 250 MB for a 20-page document scanned at +200 dpi. + +[xsane]: http://www.xsane.org/ "XSane homepage" + +[[!img xsane-multipage-mode.png caption="XSane’s Multipage mode"]] + + +First (non-optimal) solution +-------------- + +At first, I tried to optimize the PDF using [GhostScript][gs]. I +[[use-ghostscript-to-convert-pdf-files|already wrote]] about how GhostScript’s +`-dPDFSETTINGS` option can be used to minimize PDFs by redering the pictures to +a smaller resolution. In fact, there are [multiple rendering modes][gs-ps-pdf] +(`screen` for 96 dpi, `ebook` for 150 dpi, `printer` for 300 dpi, +and `prepress` for color-preserving 300 dpi), but they are pre-defined, and +for my 200 dpi images, `ebook` was not enough (I would lose resolution), +while `printer` was too high and would only enlarge the PDF. + +[gs-ps-pdf]: http://milan.kupcevic.net/ghostscript-ps-pdf/#refs "Ghostscript PDF Reference & Tips" + + +Interlude: PDF Internals +------------------ + +The best thing to do was to find out how the images were embedded in the PDF. +Since most PDF files are also partly human-readable, I opened my file with vim. +(Also, I was surprised that [vim has syntax highlighting for +PDF](vim-syntax-highlighting.png).) Before we continue, I'll give a short +introduction to the PDF file format (for the long version, see [Adobe’s PDF +reference][pdf-ref]). + +[pdf-ref]: http://partners.adobe.com/public/developer/en/pdf/PDFReference.pdf "Adobe Portable Document Format, Version 1.4" + +### Building Blocks ### +Every PDF file starts with the [magic string][magic] that identifies the version +of the standard which the document conforms to, like `%PDF-1.4`. After that, a +PDF document is made up of the following objects: + +[magic]: https://en.wikipedia.org/wiki/Magic_number_(programming)#Magic_numbers_in_files "Wikipedia: Magic numbers in files" + +Boolean values +: `true` and `false` + +Integers and floating-point numbers +: for example, `1337`, `-23.42` and `.1415` + +Strings +: * interpreted as literal characters when enclosed in parentheses: `(This + is a string.)` These can contain escaped characters, particularly + escaped closing braces and control characters: `(This string contains a + literal \) and some\n newlines.\n)`. + * interpreted as hexadecimal data when enclosed in angled brackets: + `<53 61 6D 70 6C 65>` equals `(Sample)`. +Names +: starting with a forward slash, like `/Type`. You can think of them like + identifiers in programming languages. + +Arrays +: enclosed in square brackets: + `[ -1 4 6 (A String) /AName [ (strings in arrays in arrays!) ] ]` + +Dictionaries +: key-value stores, which are enclosed in double angled brackets. The key must + be a name, the value can be any object. Keys and values are given in turns, + beginning with the first key: + `<< /FirstKey (First Value) /SecondKey 3.14 /ThirdKey /ANameAsValue >>` + Usually, the first key is `/Type` and defines what the dictionary actually + describes. + +Stream Objects + +: a collection of bytes. In contrast to strings, stream objects are usually + used for large amount of data which may not be read entirely, while strings + are always read as a whole. For example, streams can be used to embed images + or metadata. + +: Streams consist of a dictionary, followed by the keyword `stream`, the raw + content of the stream, and the keyword `endstream`. The dictionary describes + the stream’s length and the filters that have been applied to it, which + basically define the encoding the data is stored in. For example, data + streams can be compressed with various algorithms. + +The Null Object +: Represented by the literal string `null`. + +Indirect Objects + +: Every object in a PDF document can also be stored as a indirect object, + which means that it is given a label and can be used multiple times in the + document. The label consists of two numbers, a positive *object number* + (which makes the object unique) and a non-negative *generation number* + (which allows to incrementally update objects by appending to the file). + +: Indirect objects are defined by their object number, followed by their + generation number, the keyword `obj`, the contents of the object, and the + keyword `endobj`. Example: `1 0 obj (I'm an object!) endobj` defines the + indirect object with object number 1 and generation number 0, which consists + only of the string “I'm an object!”. Likewise, more complex data structures + can be labeled with indirect objects. + +: Referencing an indirect object works by giving the object and generation + number, followed by an uppercase R: `1 0 R` references the object created + above. References can be used everywhere where a (direct) object could be + used instead. + +Using these object, a PDF document builds up a tree structure, starting from the +root object, which has the object number 1 and is a dictionary with the value +`/Catalog` assigned to the key `/Type`. The other values of this dictionary +point to the objects describing the outlines and pages of the document, which in +turn reference other objects describing single pages, which point to objects +describing drawing operations or text blocks, etc. + + +### Dissecting the PDFs created by XSane ### + +Now that we know how a PDF document looks like, we can go back to out initial +problem and try to find out why my PDF file was so huge. I will walk you through +the PDF object by object. + +[[!format pdf <> +endobj +EOF]] + +This is just the magic string declaring the document as PDF-1.4, and the root +object with object number 1, which references objects number 2 for Outlines and +number 3 for pages. We're not interested in outlines, let's look at the pages. + +[[!format pdf <> +endobj +EOF]] + +OK, apparently this document has four pages, which are referenced by objects +number 6, 8, 10 and 12. This makes sense, since I scanned four pages ;-) + +Let's start with object number 6: + +[[!format pdf <> + >> +endobj +EOF]] + +We see that object number 6 is a page object, and the actual content is in +object number 7. More redirection, yay! + +[[!format pdf <> +stream +q +1 0 0 1 0 0 cm +1.000000 0.000000 -0.000000 1.000000 0 0 cm +595.080017 0 0 841.679993 0 0 cm +BI + /W 1653 + /H 2338 + /CS /G + /BPC 8 + /F /FlateDecode +ID +xœ$¼[‹$;¾åù!fžú¥‡¡aæátq.4§ +% [ ...byte stream shortened... ] +EI +Q +endstream +endobj +EOF]] + +Aha, here is where the magic happens. Object number 7 is a stream object of +2,678,332 bytes (about 2 MB) and contains drawing operations! After skipping +around a bit in Adobe’s PDF reference (chapters 3 and 4), here is the annotated +version of the stream content: + +[[!format pdf <> +stream +% [ raw byte data ] +endstream +EOF]] + +The filter `/RunLengthDecode` indicates that the stream data is compressed with +[Run-length encoding][RLE], another simple lossless compression. Not what I +wanted. (Apart from that, `convert` embeds images as XObjects, but there is not +much difference to the inline images described above.) + +[RLE]: https://en.wikipedia.org/wiki/Run-length_encoding "Wikipedia: Run-length encoding" + +### Converting PNM to JPG, then to PDF ### + +Next, I converted the PNMs to JPG, then to PDF. + + $ convert image*.pnm image.jpg + $ convert image*jpg document.pdf + +(The first command creates the output files `image-1.jpg`, `image-2.jpg`, etc., +since JPG does nut support multiple pages in one file.) + +When looking at the PDF, we see that we now have DCT-compressed images inside +the PDF: + +[[!format pdf <> +stream +% [ raw byte data ] +endstream +EOF]] + +### Converting PNM to JPG, then to PDF, and fix page size ### + +However, the pages in `document.pdf` are 82.47×58.31 cm, which results in +about 72 dpi in respect to the size of the original images. But `convert` +also allows us to specify the pixel density, so we'll set that to 200 dpi +in X and Y direction, which was the resolution at which the images were scanned: + + $ convert image*jpg -density 200x200 document.pdf + +With that approach, I could reduce the size of my PDF from 250 MB with +losslessly compressed images to 38 MB with DCT compression. + +Too long, didn’t read +----------------- + +Here’s the gist for you: + +* Read the article above, it’s very comprehensive :P +* Use `convert` on XSane’s multipage images and specify your + scanning resolution: + + $ convert image*.pnm image.jpg + $ convert image*jpg -density 200x200 document.pdf + + +Further reading +------------- + +There is probably software out there which does those thing for you, with a +shiny user interface, but I could not find one quickly. What I did find though, +was [this detailed article][scan-to-pdfa], which describes how to get +high-resolution scans wihh OCR information in PDF/A and DjVu format, using +`scantailor` and `unpaper`. + +Also, Didier Stevens helped me understand stream objects in in his +[illustrated blogpost][pdf-stream-objects]. He seems to write about PDF more +often, and it was fun to poke around in his blog. There is also a nice script, +[`pdf-parser`][pdf-tools], which helps you visualize the structure of a PDF +document. + +[scan-to-pdfa]: http://blog.konradvoelkel.de/2013/03/scan-to-pdfa/ "Konrad Voelkel: Linux, OCR and PDF: Scan to PDF/A" +[pdf-stream-objects]: http://blog.didierstevens.com/2008/05/19/pdf-stream-objects/ "Didier Stevens: PDF Stream Objects" +[pdf-tools]: http://blog.didierstevens.com/programs/pdf-tools/ "Didier Stevens: PDF Tools" + +[[!tag PDF note_to_self howto ImageMagic convert file_formats reference longpost]] diff --git a/blag/post/optimizing-xsane-s-scanned-pdfs/vim-syntax-highlighting.png b/blag/post/optimizing-xsane-s-scanned-pdfs/vim-syntax-highlighting.png new file mode 100644 index 0000000..56e2054 Binary files /dev/null and b/blag/post/optimizing-xsane-s-scanned-pdfs/vim-syntax-highlighting.png differ diff --git a/blag/post/optimizing-xsane-s-scanned-pdfs/xsane-multipage-mode.png b/blag/post/optimizing-xsane-s-scanned-pdfs/xsane-multipage-mode.png new file mode 100644 index 0000000..bca3d77 Binary files /dev/null and b/blag/post/optimizing-xsane-s-scanned-pdfs/xsane-multipage-mode.png differ