new blag post: Optimizing XSane's scanned PDFs (also: PDF internals)

author Roland Hieber <rohieb@rohieb.name>

Sun, 17 Nov 2013 22:58:35 +0000 (23:58 +0100)

committer Roland Hieber <rohieb@rohieb.name>

Sun, 17 Nov 2013 23:02:18 +0000 (00:02 +0100)
author Roland Hieber <rohieb@rohieb.name>
Sun, 17 Nov 2013 22:58:35 +0000 (23:58 +0100)
committer Roland Hieber <rohieb@rohieb.name>
Sun, 17 Nov 2013 23:02:18 +0000 (00:02 +0100)
diff --git a/blag/post/optimizing-xsane-s-scanned-pdfs.mdwn b/blag/post/optimizing-xsane-s-scanned-pdfs.mdwn

new file mode 100644 (file)

index 0000000..529e873
--- /dev/null
+++ b/blag/post/optimizing-xsane-s-scanned-pdfs.mdwn
@@ -0,0 +1,371 @@
+[[!meta title="Optimizing XSane's scanned PDFs (also: PDF internals)"]]
+[[!meta author="rohieb"]]
+[[!meta license="CC-BY-SA 3.0"]]
+[[!img defaults size=x200]]
+
+[[!toc levels=2]]
+
+Problem
+-------
+
+I use [XSane][xsane] to scan documents for my digital archive. I want them to be
+in PDF format and have a reasonable resolution (better than 200&nbsp;dpi, so I
+can try OCRing them afterwards). However, the PDFs created by XSane’s multipage
+mode are too large, about 250&nbsp;MB for a 20-page document scanned at
+200&nbsp;dpi.
+
+[xsane]: http://www.xsane.org/ "XSane homepage"
+
+[[!img xsane-multipage-mode.png caption="XSane’s Multipage mode"]]
+
+
+First (non-optimal) solution
+--------------
+
+At first, I tried to optimize the PDF using [GhostScript][gs]. I
+[[use-ghostscript-to-convert-pdf-files|already wrote]] about how GhostScript’s
+`-dPDFSETTINGS` option can be used to minimize PDFs by redering the pictures to
+a smaller resolution. In fact, there are [multiple rendering modes][gs-ps-pdf]
+(`screen` for 96&nbsp;dpi, `ebook` for 150&nbsp;dpi, `printer` for 300&nbsp;dpi,
+and `prepress` for color-preserving 300&nbsp;dpi), but they are pre-defined, and
+for my 200&nbsp;dpi images, `ebook` was not enough (I would lose resolution),
+while `printer` was too high and would only enlarge the PDF.
+
+[gs-ps-pdf]: http://milan.kupcevic.net/ghostscript-ps-pdf/#refs "Ghostscript PDF Reference & Tips"
+
+
+Interlude: PDF Internals
+------------------
+
+The best thing to do was to find out how the images were embedded in the PDF.
+Since most PDF files are also partly human-readable, I opened my file with vim.
+(Also, I was surprised that [vim has syntax highlighting for
+PDF](vim-syntax-highlighting.png).) Before we continue, I'll give a short
+introduction to the PDF file format (for the long version, see [Adobe’s PDF
+reference][pdf-ref]).
+
+[pdf-ref]: http://partners.adobe.com/public/developer/en/pdf/PDFReference.pdf "Adobe Portable Document Format, Version 1.4"
+
+### Building Blocks ###
+Every PDF file starts with the [magic string][magic] that identifies the version
+of the standard which the document conforms to, like `%PDF-1.4`. After that, a
+PDF document is made up of the following objects:
+
+[magic]: https://en.wikipedia.org/wiki/Magic_number_(programming)#Magic_numbers_in_files "Wikipedia: Magic numbers in files"
+
+Boolean values
+:   `true` and `false`
+
+Integers and floating-point numbers
+:   for example, `1337`, `-23.42` and `.1415`
+
+Strings
+:   *   interpreted as literal characters when enclosed in parentheses: `(This
+        is a string.)` These can contain escaped characters, particularly
+        escaped closing braces and control characters: `(This string contains a
+        literal \) and some\n newlines.\n)`.
+    *   interpreted as hexadecimal data when enclosed in angled brackets:
+        `<53 61 6D 70 6C 65>` equals `(Sample)`.
+Names
+:   starting with a forward slash, like `/Type`. You can think of them like
+    identifiers in programming languages.
+
+Arrays
+:   enclosed in square brackets:
+    `[ -1 4 6 (A String) /AName [ (strings in arrays in arrays!) ] ]`
+
+Dictionaries
+:   key-value stores, which are enclosed in double angled brackets. The key must
+    be a name, the value can be any object. Keys and values are given in turns,
+    beginning with the first key:
+    `<< /FirstKey (First Value) /SecondKey 3.14 /ThirdKey /ANameAsValue >>`
+    Usually, the first key is `/Type` and defines what the dictionary actually
+    describes.
+
+Stream Objects
+
+:   a collection of bytes. In contrast to strings, stream objects are usually
+    used for large amount of data which may not be read entirely, while strings
+    are always read as a whole. For example, streams can be used to embed images
+    or metadata.
+
+:   Streams consist of a dictionary, followed by the keyword `stream`, the raw
+    content of the stream, and the keyword `endstream`. The dictionary describes
+    the stream’s length and the filters that have been applied to it, which
+    basically define the encoding the data is stored in. For example, data
+    streams can be compressed with various algorithms.
+
+The Null Object
+:   Represented by the literal string `null`.
+
+Indirect Objects
+
+:   Every object in a PDF document can also be stored as a indirect object,
+    which means that it is given a label and can be used multiple times in the
+    document. The label consists of two numbers, a positive *object number*
+    (which makes the object unique) and a non-negative *generation number*
+    (which allows to incrementally update objects by appending to the file).
+
+:   Indirect objects are defined by their object number, followed by their
+    generation number, the keyword `obj`, the contents of the object, and the
+    keyword `endobj`. Example: `1 0 obj (I'm an object!) endobj` defines the
+    indirect object with object number 1 and generation number 0, which consists
+    only of the string “I'm an object!”. Likewise, more complex data structures
+    can be labeled with indirect objects.
+
+:   Referencing an indirect object works by giving the object and generation
+    number, followed by an uppercase R: `1 0 R` references the object created
+    above. References can be used everywhere where a (direct) object could be
+    used instead.
+
+Using these object, a PDF document builds up a tree structure, starting from the
+root object, which has the object number 1 and is a dictionary with the value
+`/Catalog` assigned to the key `/Type`. The other values of this dictionary
+point to the objects describing the outlines and pages of the document, which in
+turn reference other objects describing single pages, which point to objects
+describing drawing operations or text blocks, etc.
+
+
+### Dissecting the PDFs created by XSane ###
+
+Now that we know how a PDF document looks like, we can go back to out initial
+problem and try to find out why my PDF file was so huge. I will walk you through
+the PDF object by object.
+
+[[!format pdf <<EOF
+%PDF-1.4
+
+1 0 obj
+   << /Type /Catalog
+      /Outlines 2 0 R
+      /Pages 3 0 R
+   >>
+endobj
+EOF]]
+
+This is just the magic string declaring the document as PDF-1.4, and the root
+object with object number 1, which references objects number 2 for Outlines and
+number 3 for pages. We're not interested in outlines, let's look at the pages.
+
+[[!format pdf <<EOF
+3 0 obj
+   << /Type /Pages
+      /Kids [
+             6 0 R
+             8 0 R
+             10 0 R
+             12 0 R
+            ]
+      /Count 4
+   >>
+endobj
+EOF]]
+
+OK, apparently this document has four pages, which are referenced by objects
+number 6, 8, 10 and 12. This makes sense, since I scanned four pages ;-)
+
+Let's start with object number 6:
+
+[[!format pdf <<EOF
+6 0 obj
+    << /Type /Page
+       /Parent 3 0 R
+       /MediaBox [0 0 596 842]
+       /Contents 7 0 R
+       /Resources << /ProcSet 8 0 R >>
+    >>
+endobj
+EOF]]
+
+We see that object number 6 is a page object, and the actual content is in
+object number 7. More redirection, yay!
+
+[[!format pdf <<EOF
+7 0 obj
+    << /Length 2678332     >>
+stream
+q
+1 0 0 1 0 0 cm
+1.000000 0.000000 -0.000000 1.000000 0 0 cm
+595.080017 0 0 841.679993 0 0 cm
+BI
+  /W 1653
+  /H 2338
+  /CS /G
+  /BPC 8
+  /F /FlateDecode
+ID
+x\9c$¼[\8b$;¾åù!\ 6f\9eú¥\87¡a\1e\ 6æátq.4§
+% [ ...byte stream shortened... ]
+EI
+Q
+endstream
+endobj
+EOF]]
+
+Aha, here is where the magic happens. Object number 7 is a stream object of
+2,678,332 bytes (about 2 MB) and contains drawing operations! After skipping
+around a bit in Adobe’s PDF reference (chapters 3 and 4), here is the annotated
+version of the stream content:
+
+[[!format pdf <<EOF
+q                 % Save drawing context
+1 0 0 1 0 0 cm    % Set up coordinate space for image
+1.000000 0.000000 -0.000000 1.000000 0 0 cm
+595.080017 0 0 841.679993 0 0 cm
+BI                % Begin Image
+  /W 1653           % Image width is 1653 pixel
+  /H 2338           % Image height is 2338 pixel
+  /CS /G            % Color space is Gray
+  /BPC 8            % 8 bits per pixel
+  /F /FlateDecode   % Filters: data is Deflate-compressed
+ID                % Image Data follows:
+x$¼[$;¾åù!fú¥¡aæátq.4§ [ ...byte stream shortened... ]
+EI                % End Image
+Q                 % Restore drawing context
+EOF]]
+
+So now we know why the PDF was so huge: the line `/F /FlateDecode` tells us that
+the image ata is stored losslessly with [Deflate][] compression (this is
+basically what PNG uses). However, scanned images, as well as photographed
+pictures, have the tendency to become very big when stored losslessly, due to te
+fact that image sensors always add noise from the universe and lossless
+compression also has to take account of this noise. In contrast, lossy
+compression like JPEG, which uses [discrete cosine transform][dct], only has to
+approximate the image (and therefore the noise from the sensor) to a certain
+degree, therefore reducing the space needed to save the image. And the PDF
+standard also allows image data to be DCT-compressed, by adding `/DCTDecode` to
+the filters.
+
+[Deflate]: https://en.wikipedia.org/wiki/DEFLATE "Wikipedia: DEFLATE algorithm"
+[dct]: http://en.wikipedia.org/wiki/Discrete_cosine_transform "Wikipedia: Discrete cosine transform"
+
+
+Second solution: use a (better) compression algorithm
+------------------
+
+Now that I knew where the problem was, I could try to create PDFs with DCT
+compression. I still had the original, uncompressed [PNM][] files that fell out
+of XSane’ multipage mode (just look in the multipage project folder), so I
+started to play around a bit with [ImageMagick’s][im] `convert` tool, which can
+also convert images to PDF.
+
+[im]: http://www.imagemagick.org "ImageMagic homepage"
+[PNM]: https://en.wikipedia.org/wiki/Netpbm_format "Wikipedia: Netpbm format"
+
+### Converting PNM to PDF ###
+First, I tried converting the umcompressed PNM to PDF:
+
+    $ convert image*.pnm document.pdf
+
+`convert` generally takes parameters of the form `inputfile outputfile`, but it
+also allows us to specify more than one input file (which is somehow
+undocumented in the [man page][man-convert]). In that case it tries to create
+multi-page documents, if possible. With PDF as output format, this results in
+one input file per page.
+
+[man-converted]: http://manpages.debian.net/cgi-bin/man.cgi?query=convert "man convert(1)"
+
+The embedded image objects looked somewhat like the following:
+
+[[!format pdf <<EOF
+8 0 obj
+<<
+    /Type /XObject
+    /Subtype /Image
+    /Name /Im0
+    /Filter [ /RunLengthDecode ]
+    /Width 1653
+    /Height 2338
+    /ColorSpace 10 0 R
+    /BitsPerComponent 8
+    /Length 9 0 R
+>>
+stream
+% [ raw byte data ]
+endstream
+EOF]]
+
+The filter `/RunLengthDecode` indicates that the stream data is compressed with
+[Run-length encoding][RLE], another simple lossless compression. Not what I
+wanted. (Apart from that, `convert` embeds images as XObjects, but there is not
+much difference to the inline images described above.)
+
+[RLE]: https://en.wikipedia.org/wiki/Run-length_encoding "Wikipedia: Run-length encoding"
+
+### Converting PNM to JPG, then to PDF ###
+
+Next, I converted the PNMs to JPG, then to PDF.
+
+    $ convert image*.pnm image.jpg
+    $ convert image*jpg document.pdf
+
+(The first command creates the output files `image-1.jpg`, `image-2.jpg`, etc.,
+since JPG does nut support multiple pages in one file.)
+
+When looking at the PDF, we see that we now have DCT-compressed images inside
+the PDF:
+
+[[!format pdf <<EOF
+8 0 obj
+<<
+    /Type /XObject
+    /Subtype /Image
+    /Name /Im0
+    /Filter [ /DCTDecode ]
+    /Width 1653
+    /Height 2338
+    /ColorSpace 10 0 R
+    /BitsPerComponent 8
+    /Length 9 0 R
+>>
+stream
+% [ raw byte data ]
+endstream
+EOF]]
+
+### Converting PNM to JPG, then to PDF, and fix page size ###
+
+However, the pages in `document.pdf` are 82.47×58.31&nbsp;cm, which results in
+about 72&nbsp;dpi in respect to the size of the original images. But `convert`
+also allows us to specify the pixel density, so we'll set that to 200&nbsp;dpi
+in X and Y direction, which was the resolution at which the images were scanned:
+
+    $ convert image*jpg -density 200x200 document.pdf
+
+With that approach, I could reduce the size of my PDF from 250&nbsp;MB with
+losslessly compressed images to 38&nbsp;MB with DCT compression.
+
+Too long, didn’t read
+-----------------
+
+Here’s the gist for you:
+
+*   Read the article above, it’s very comprehensive :P
+*   Use `convert` on XSane’s multipage images and specify your
+    scanning resolution:
+
+        $ convert image*.pnm image.jpg
+        $ convert image*jpg -density 200x200 document.pdf
+
+
+Further reading
+-------------
+
+There is probably software out there which does those thing for you, with a
+shiny user interface, but I could not find one quickly. What I did find though,
+was [this detailed article][scan-to-pdfa], which describes how to get
+high-resolution scans wihh OCR information in PDF/A and DjVu format, using
+`scantailor` and `unpaper`.
+
+Also, Didier Stevens helped me understand stream objects in in his
+[illustrated blogpost][pdf-stream-objects]. He seems to write about PDF more
+often, and it was fun to poke around in his blog. There is also a nice script,
+[`pdf-parser`][pdf-tools], which helps you visualize the structure of a PDF
+document.
+
+[scan-to-pdfa]: http://blog.konradvoelkel.de/2013/03/scan-to-pdfa/ "Konrad Voelkel: Linux, OCR and PDF: Scan to PDF/A"
+[pdf-stream-objects]: http://blog.didierstevens.com/2008/05/19/pdf-stream-objects/ "Didier Stevens: PDF Stream Objects"
+[pdf-tools]: http://blog.didierstevens.com/programs/pdf-tools/ "Didier Stevens: PDF Tools"
+
+[[!tag PDF note_to_self howto ImageMagic convert file_formats reference longpost]]
diff --git a/blag/post/optimizing-xsane-s-scanned-pdfs/vim-syntax-highlighting.png b/blag/post/optimizing-xsane-s-scanned-pdfs/vim-syntax-highlighting.png

new file mode 100644 (file)

index 0000000..56e2054

Binary files /dev/null and b/blag/post/optimizing-xsane-s-scanned-pdfs/vim-syntax-highlighting.png differ
diff --git a/blag/post/optimizing-xsane-s-scanned-pdfs/xsane-multipage-mode.png b/blag/post/optimizing-xsane-s-scanned-pdfs/xsane-multipage-mode.png

new file mode 100644 (file)

index 0000000..bca3d77

Binary files /dev/null and b/blag/post/optimizing-xsane-s-scanned-pdfs/xsane-multipage-mode.png differ
author	Roland Hieber <rohieb@rohieb.name>
	Sun, 17 Nov 2013 22:58:35 +0000 (23:58 +0100)
committer	Roland Hieber <rohieb@rohieb.name>
	Sun, 17 Nov 2013 23:02:18 +0000 (00:02 +0100)
blag/post/optimizing-xsane-s-scanned-pdfs.mdwn	[new file with mode: 0644]	patch \| blob
blag/post/optimizing-xsane-s-scanned-pdfs/vim-syntax-highlighting.png	[new file with mode: 0644]	patch \| blob
blag/post/optimizing-xsane-s-scanned-pdfs/xsane-multipage-mode.png	[new file with mode: 0644]	patch \| blob