blag/post/optimizing-xsane-s-scanned-pdfs.mdwn

   1 [[!meta title="Optimizing XSane's scanned PDFs (also: PDF internals)"]]
   2 [[!meta author="rohieb"]]
   3 [[!meta license="CC-BY-SA 3.0"]]
   4 [[!img defaults size=x200]]
   5
   6 [[!toc levels=2]]
   7
   8 Problem
   9 -------
  10
  11 I use [XSane][xsane] to scan documents for my digital archive. I want them to be
  12 in PDF format and have a reasonable resolution (better than 200&nbsp;dpi, so I
  13 can try OCRing them afterwards). However, the PDFs created by XSane’s multipage
  14 mode are too large, about 250&nbsp;MB for a 20-page document scanned at
  15 200&nbsp;dpi.
  16
  17 [xsane]: http://www.xsane.org/ "XSane homepage"
  18
  19 [[!img xsane-multipage-mode.png caption="XSane’s Multipage mode"]]
  20
  21
  22 First (non-optimal) solution
  23 --------------
  24
  25 At first, I tried to optimize the PDF using [GhostScript][gs]. I
  26 [[use-ghostscript-to-convert-pdf-files|already wrote]] about how GhostScript’s
  27 `-dPDFSETTINGS` option can be used to minimize PDFs by redering the pictures to
  28 a smaller resolution. In fact, there are [multiple rendering modes][gs-ps-pdf]
  29 (`screen` for 96&nbsp;dpi, `ebook` for 150&nbsp;dpi, `printer` for 300&nbsp;dpi,
  30 and `prepress` for color-preserving 300&nbsp;dpi), but they are pre-defined, and
  31 for my 200&nbsp;dpi images, `ebook` was not enough (I would lose resolution),
  32 while `printer` was too high and would only enlarge the PDF.
  33
  34 [gs-ps-pdf]: http://milan.kupcevic.net/ghostscript-ps-pdf/#refs "Ghostscript PDF Reference & Tips"
  35
  36
  37 Interlude: PDF Internals
  38 ------------------
  39
  40 The best thing to do was to find out how the images were embedded in the PDF.
  41 Since most PDF files are also partly human-readable, I opened my file with vim.
  42 (Also, I was surprised that [vim has syntax highlighting for
  43 PDF](vim-syntax-highlighting.png).) Before we continue, I'll give a short
  44 introduction to the PDF file format (for the long version, see [Adobe’s PDF
  45 reference][pdf-ref]).
  46
  47 [pdf-ref]: http://partners.adobe.com/public/developer/en/pdf/PDFReference.pdf "Adobe Portable Document Format, Version 1.4"
  48
  49 ### Building Blocks ###
  50 Every PDF file starts with the [magic string][magic] that identifies the version
  51 of the standard which the document conforms to, like `%PDF-1.4`. After that, a
  52 PDF document is made up of the following objects:
  53
  54 [magic]: https://en.wikipedia.org/wiki/Magic_number_(programming)#Magic_numbers_in_files "Wikipedia: Magic numbers in files"
  55
  56 Boolean values
  57 :   `true` and `false`
  58
  59 Integers and floating-point numbers
  60 :   for example, `1337`, `-23.42` and `.1415`
  61
  62 Strings
  63 :   *   interpreted as literal characters when enclosed in parentheses: `(This
  64         is a string.)` These can contain escaped characters, particularly
  65         escaped closing braces and control characters: `(This string contains a
  66         literal \) and some\n newlines.\n)`.
  67     *   interpreted as hexadecimal data when enclosed in angled brackets:
  68         `<53 61 6D 70 6C 65>` equals `(Sample)`.
  69 Names
  70 :   starting with a forward slash, like `/Type`. You can think of them like
  71     identifiers in programming languages.
  72
  73 Arrays
  74 :   enclosed in square brackets:
  75     `[ -1 4 6 (A String) /AName [ (strings in arrays in arrays!) ] ]`
  76
  77 Dictionaries
  78 :   key-value stores, which are enclosed in double angled brackets. The key must
  79     be a name, the value can be any object. Keys and values are given in turns,
  80     beginning with the first key:
  81     `<< /FirstKey (First Value) /SecondKey 3.14 /ThirdKey /ANameAsValue >>`
  82     Usually, the first key is `/Type` and defines what the dictionary actually
  83     describes.
  84
  85 Stream Objects
  86
  87 :   a collection of bytes. In contrast to strings, stream objects are usually
  88     used for large amount of data which may not be read entirely, while strings
  89     are always read as a whole. For example, streams can be used to embed images
  90     or metadata.
  91
  92 :   Streams consist of a dictionary, followed by the keyword `stream`, the raw
  93     content of the stream, and the keyword `endstream`. The dictionary describes
  94     the stream’s length and the filters that have been applied to it, which
  95     basically define the encoding the data is stored in. For example, data
  96     streams can be compressed with various algorithms.
  97
  98 The Null Object
  99 :   Represented by the literal string `null`.
 100
 101 Indirect Objects
 102
 103 :   Every object in a PDF document can also be stored as a indirect object,
 104     which means that it is given a label and can be used multiple times in the
 105     document. The label consists of two numbers, a positive *object number*
 106     (which makes the object unique) and a non-negative *generation number*
 107     (which allows to incrementally update objects by appending to the file).
 108
 109 :   Indirect objects are defined by their object number, followed by their
 110     generation number, the keyword `obj`, the contents of the object, and the
 111     keyword `endobj`. Example: `1 0 obj (I'm an object!) endobj` defines the
 112     indirect object with object number 1 and generation number 0, which consists
 113     only of the string “I'm an object!”. Likewise, more complex data structures
 114     can be labeled with indirect objects.
 115
 116 :   Referencing an indirect object works by giving the object and generation
 117     number, followed by an uppercase R: `1 0 R` references the object created
 118     above. References can be used everywhere where a (direct) object could be
 119     used instead.
 120
 121 Using these object, a PDF document builds up a tree structure, starting from the
 122 root object, which has the object number 1 and is a dictionary with the value
 123 `/Catalog` assigned to the key `/Type`. The other values of this dictionary
 124 point to the objects describing the outlines and pages of the document, which in
 125 turn reference other objects describing single pages, which point to objects
 126 describing drawing operations or text blocks, etc.
 127
 128
 129 ### Dissecting the PDFs created by XSane ###
 130
 131 Now that we know how a PDF document looks like, we can go back to out initial
 132 problem and try to find out why my PDF file was so huge. I will walk you through
 133 the PDF object by object.
 134
 135 [[!format pdf <<EOF
 136 %PDF-1.4
 137
 138 1 0 obj
 139    << /Type /Catalog
 140       /Outlines 2 0 R
 141       /Pages 3 0 R
 142    >>
 143 endobj
 144 EOF]]
 145
 146 This is just the magic string declaring the document as PDF-1.4, and the root
 147 object with object number 1, which references objects number 2 for Outlines and
 148 number 3 for pages. We're not interested in outlines, let's look at the pages.
 149
 150 [[!format pdf <<EOF
 151 3 0 obj
 152    << /Type /Pages
 153       /Kids [
 154              6 0 R
 155              8 0 R
 156              10 0 R
 157              12 0 R
 158             ]
 159       /Count 4
 160    >>
 161 endobj
 162 EOF]]
 163
 164 OK, apparently this document has four pages, which are referenced by objects
 165 number 6, 8, 10 and 12. This makes sense, since I scanned four pages ;-)
 166
 167 Let's start with object number 6:
 168
 169 [[!format pdf <<EOF
 170 6 0 obj
 171     << /Type /Page
 172        /Parent 3 0 R
 173        /MediaBox [0 0 596 842]
 174        /Contents 7 0 R
 175        /Resources << /ProcSet 8 0 R >>
 176     >>
 177 endobj
 178 EOF]]
 179
 180 We see that object number 6 is a page object, and the actual content is in
 181 object number 7. More redirection, yay!
 182
 183 [[!format pdf <<EOF
 184 7 0 obj
 185     << /Length 2678332     >>
 186 stream
 187 q
 188 1 0 0 1 0 0 cm
 189 1.000000 0.000000 -0.000000 1.000000 0 0 cm
 190 595.080017 0 0 841.679993 0 0 cm
 191 BI
 192   /W 1653
 193   /H 2338
 194   /CS /G
 195   /BPC 8
 196   /F /FlateDecode
 197 ID
 198 x\9c$¼[\8b$;¾åù!\ 6f\9eú¥\87¡a\1e\ 6æátq.4§
 199 % [ ...byte stream shortened... ]
 200 EI
 201 Q
 202 endstream
 203 endobj
 204 EOF]]
 205
 206 Aha, here is where the magic happens. Object number 7 is a stream object of
 207 2,678,332 bytes (about 2 MB) and contains drawing operations! After skipping
 208 around a bit in Adobe’s PDF reference (chapters 3 and 4), here is the annotated
 209 version of the stream content:
 210
 211 [[!format pdf <<EOF
 212 q                 % Save drawing context
 213 1 0 0 1 0 0 cm    % Set up coordinate space for image
 214 1.000000 0.000000 -0.000000 1.000000 0 0 cm
 215 595.080017 0 0 841.679993 0 0 cm
 216 BI                % Begin Image
 217   /W 1653           % Image width is 1653 pixel
 218   /H 2338           % Image height is 2338 pixel
 219   /CS /G            % Color space is Gray
 220   /BPC 8            % 8 bits per pixel
 221   /F /FlateDecode   % Filters: data is Deflate-compressed
 222 ID                % Image Data follows:
 223 x$¼[$;¾åù!fú¥¡aæátq.4§ [ ...byte stream shortened... ]
 224 EI                % End Image
 225 Q                 % Restore drawing context
 226 EOF]]
 227
 228 So now we know why the PDF was so huge: the line `/F /FlateDecode` tells us that
 229 the image ata is stored losslessly with [Deflate][] compression (this is
 230 basically what PNG uses). However, scanned images, as well as photographed
 231 pictures, have the tendency to become very big when stored losslessly, due to te
 232 fact that image sensors always add noise from the universe and lossless
 233 compression also has to take account of this noise. In contrast, lossy
 234 compression like JPEG, which uses [discrete cosine transform][dct], only has to
 235 approximate the image (and therefore the noise from the sensor) to a certain
 236 degree, therefore reducing the space needed to save the image. And the PDF
 237 standard also allows image data to be DCT-compressed, by adding `/DCTDecode` to
 238 the filters.
 239
 240 [Deflate]: https://en.wikipedia.org/wiki/DEFLATE "Wikipedia: DEFLATE algorithm"
 241 [dct]: http://en.wikipedia.org/wiki/Discrete_cosine_transform "Wikipedia: Discrete cosine transform"
 242
 243
 244 Second solution: use a (better) compression algorithm
 245 ------------------
 246
 247 Now that I knew where the problem was, I could try to create PDFs with DCT
 248 compression. I still had the original, uncompressed [PNM][] files that fell out
 249 of XSane’ multipage mode (just look in the multipage project folder), so I
 250 started to play around a bit with [ImageMagick’s][im] `convert` tool, which can
 251 also convert images to PDF.
 252
 253 [im]: http://www.imagemagick.org "ImageMagic homepage"
 254 [PNM]: https://en.wikipedia.org/wiki/Netpbm_format "Wikipedia: Netpbm format"
 255
 256 ### Converting PNM to PDF ###
 257 First, I tried converting the umcompressed PNM to PDF:
 258
 259     $ convert image*.pnm document.pdf
 260
 261 `convert` generally takes parameters of the form `inputfile outputfile`, but it
 262 also allows us to specify more than one input file (which is somehow
 263 undocumented in the [man page][man-convert]). In that case it tries to create
 264 multi-page documents, if possible. With PDF as output format, this results in
 265 one input file per page.
 266
 267 [man-converted]: http://manpages.debian.net/cgi-bin/man.cgi?query=convert "man convert(1)"
 268
 269 The embedded image objects looked somewhat like the following:
 270
 271 [[!format pdf <<EOF
 272 8 0 obj
 273 <<
 274     /Type /XObject
 275     /Subtype /Image
 276     /Name /Im0
 277     /Filter [ /RunLengthDecode ]
 278     /Width 1653
 279     /Height 2338
 280     /ColorSpace 10 0 R
 281     /BitsPerComponent 8
 282     /Length 9 0 R
 283 >>
 284 stream
 285 % [ raw byte data ]
 286 endstream
 287 EOF]]
 288
 289 The filter `/RunLengthDecode` indicates that the stream data is compressed with
 290 [Run-length encoding][RLE], another simple lossless compression. Not what I
 291 wanted. (Apart from that, `convert` embeds images as XObjects, but there is not
 292 much difference to the inline images described above.)
 293
 294 [RLE]: https://en.wikipedia.org/wiki/Run-length_encoding "Wikipedia: Run-length encoding"
 295
 296 ### Converting PNM to JPG, then to PDF ###
 297
 298 Next, I converted the PNMs to JPG, then to PDF.
 299
 300     $ convert image*.pnm image.jpg
 301     $ convert image*jpg document.pdf
 302
 303 (The first command creates the output files `image-1.jpg`, `image-2.jpg`, etc.,
 304 since JPG does nut support multiple pages in one file.)
 305
 306 When looking at the PDF, we see that we now have DCT-compressed images inside
 307 the PDF:
 308
 309 [[!format pdf <<EOF
 310 8 0 obj
 311 <<
 312     /Type /XObject
 313     /Subtype /Image
 314     /Name /Im0
 315     /Filter [ /DCTDecode ]
 316     /Width 1653
 317     /Height 2338
 318     /ColorSpace 10 0 R
 319     /BitsPerComponent 8
 320     /Length 9 0 R
 321 >>
 322 stream
 323 % [ raw byte data ]
 324 endstream
 325 EOF]]
 326
 327 ### Converting PNM to JPG, then to PDF, and fix page size ###
 328
 329 However, the pages in `document.pdf` are 82.47×58.31&nbsp;cm, which results in
 330 about 72&nbsp;dpi in respect to the size of the original images. But `convert`
 331 also allows us to specify the pixel density, so we'll set that to 200&nbsp;dpi
 332 in X and Y direction, which was the resolution at which the images were scanned:
 333
 334     $ convert image*jpg -density 200x200 document.pdf
 335
 336 With that approach, I could reduce the size of my PDF from 250&nbsp;MB with
 337 losslessly compressed images to 38&nbsp;MB with DCT compression.
 338
 339 Too long, didn’t read
 340 -----------------
 341
 342 Here’s the gist for you:
 343
 344 *   Read the article above, it’s very comprehensive :P
 345 *   Use `convert` on XSane’s multipage images and specify your
 346     scanning resolution:
 347
 348         $ convert image*.pnm image.jpg
 349         $ convert image*jpg -density 200x200 document.pdf
 350
 351
 352 Further reading
 353 -------------
 354
 355 There is probably software out there which does those thing for you, with a
 356 shiny user interface, but I could not find one quickly. What I did find though,
 357 was [this detailed article][scan-to-pdfa], which describes how to get
 358 high-resolution scans wihh OCR information in PDF/A and DjVu format, using
 359 `scantailor` and `unpaper`.
 360
 361 Also, Didier Stevens helped me understand stream objects in in his
 362 [illustrated blogpost][pdf-stream-objects]. He seems to write about PDF more
 363 often, and it was fun to poke around in his blog. There is also a nice script,
 364 [`pdf-parser`][pdf-tools], which helps you visualize the structure of a PDF
 365 document.
 366
 367 [scan-to-pdfa]: http://blog.konradvoelkel.de/2013/03/scan-to-pdfa/ "Konrad Voelkel: Linux, OCR and PDF: Scan to PDF/A"
 368 [pdf-stream-objects]: http://blog.didierstevens.com/2008/05/19/pdf-stream-objects/ "Didier Stevens: PDF Stream Objects"
 369 [pdf-tools]: http://blog.didierstevens.com/programs/pdf-tools/ "Didier Stevens: PDF Tools"
 370
 371 [[!tag PDF note_to_self howto ImageMagic convert file_formats reference longpost]]