blag/post/optimizing-xsane-s-scanned-pdfs.mdwn

   1 [[!meta title="Optimizing XSane's scanned PDFs (also: PDF internals)"]]
   2 [[!meta author="rohieb"]]
   3 [[!meta license="CC-BY-SA 3.0"]]
   4 [[!img defaults size=x200]]
   5
   6 [[!toc levels=2]]
   7
   8 Problem
   9 -------
  10
  11 I use [XSane][xsane] to scan documents for my digital archive. I want them to be
  12 in PDF format and have a reasonable resolution (better than 200&nbsp;dpi, so I
  13 can try OCRing them afterwards). However, the PDFs created by XSane’s multipage
  14 mode are too large, about 250&nbsp;MB for a 20-page document scanned at
  15 200&nbsp;dpi.
  16
  17 [xsane]: http://www.xsane.org/ "XSane homepage"
  18
  19 [[!img xsane-multipage-mode.png caption="XSane’s Multipage mode"]]
  20
  21
  22 First (non-optimal) solution
  23 --------------
  24
  25 At first, I tried to optimize the PDF using [GhostScript][gs]. I
  26 [[use-ghostscript-to-convert-pdf-files|already wrote]] about how GhostScript’s
  27 `-dPDFSETTINGS` option can be used to minimize PDFs by redering the pictures to
  28 a smaller resolution. In fact, there are [multiple rendering modes][gs-ps-pdf]
  29 (`screen` for 96&nbsp;dpi, `ebook` for 150&nbsp;dpi, `printer` for 300&nbsp;dpi,
  30 and `prepress` for color-preserving 300&nbsp;dpi), but they are pre-defined, and
  31 for my 200&nbsp;dpi images, `ebook` was not enough (I would lose resolution),
  32 while `printer` was too high and would only enlarge the PDF.
  33
  34 [gs]: http://ghostscript.com "Ghostscript homepage"
  35 [gs-ps-pdf]: http://milan.kupcevic.net/ghostscript-ps-pdf/#refs "Ghostscript PDF Reference & Tips"
  36
  37
  38 Interlude: PDF Internals
  39 ------------------
  40
  41 The best thing to do was to find out how the images were embedded in the PDF.
  42 Since most PDF files are also partly human-readable, I opened my file with vim.
  43 (Also, I was surprised that [vim has syntax highlighting for
  44 PDF](vim-syntax-highlighting.png).) Before we continue, I'll give a short
  45 introduction to the PDF file format (for the long version, see [Adobe’s PDF
  46 reference][pdf-ref]).
  47
  48 [pdf-ref]: http://partners.adobe.com/public/developer/en/pdf/PDFReference.pdf "Adobe Portable Document Format, Version 1.4"
  49
  50 ### Building Blocks ###
  51 Every PDF file starts with the [magic string][magic] that identifies the version
  52 of the standard which the document conforms to, like `%PDF-1.4`. After that, a
  53 PDF document is made up of the following objects:
  54
  55 [magic]: https://en.wikipedia.org/wiki/Magic_number_(programming)#Magic_numbers_in_files "Wikipedia: Magic numbers in files"
  56
  57 Boolean values
  58 :   `true` and `false`
  59
  60 Integers and floating-point numbers
  61 :   for example, `1337`, `-23.42` and `.1415`
  62
  63 Strings
  64 :   *   interpreted as literal characters when enclosed in parentheses: `(This
  65         is a string.)` These can contain escaped characters, particularly
  66         escaped closing braces and control characters: `(This string contains a
  67         literal \) and some\n newlines.\n)`.
  68     *   interpreted as hexadecimal data when enclosed in angled brackets:
  69         `<53 61 6D 70 6C 65>` equals `(Sample)`.
  70
  71 Names
  72 :   starting with a forward slash, like `/Type`. You can think of them like
  73     identifiers in programming languages.
  74
  75 Arrays
  76 :   enclosed in square brackets:
  77     `[ -1 4 6 (A String) /AName [ (strings in arrays in arrays!) ] ]`
  78
  79 Dictionaries
  80 :   key-value stores, which are enclosed in double angled brackets. The key must
  81     be a name, the value can be any object. Keys and values are given in turns,
  82     beginning with the first key:
  83     `<< /FirstKey (First Value) /SecondKey 3.14 /ThirdKey /ANameAsValue >>`
  84     Usually, the first key is `/Type` and defines what the dictionary actually
  85     describes.
  86
  87 Stream Objects
  88
  89 :   a collection of bytes. In contrast to strings, stream objects are usually
  90     used for large amount of data which may not be read entirely, while strings
  91     are always read as a whole. For example, streams can be used to embed images
  92     or metadata.
  93
  94 :   Streams consist of a dictionary, followed by the keyword `stream`, the raw
  95     content of the stream, and the keyword `endstream`. The dictionary describes
  96     the stream’s length and the filters that have been applied to it, which
  97     basically define the encoding the data is stored in. For example, data
  98     streams can be compressed with various algorithms.
  99
 100 The Null Object
 101 :   Represented by the literal string `null`.
 102
 103 Indirect Objects
 104
 105 :   Every object in a PDF document can also be stored as a indirect object,
 106     which means that it is given a label and can be used multiple times in the
 107     document. The label consists of two numbers, a positive *object number*
 108     (which makes the object unique) and a non-negative *generation number*
 109     (which allows to incrementally update objects by appending to the file).
 110
 111 :   Indirect objects are defined by their object number, followed by their
 112     generation number, the keyword `obj`, the contents of the object, and the
 113     keyword `endobj`. Example: `1 0 obj (I'm an object!) endobj` defines the
 114     indirect object with object number 1 and generation number 0, which consists
 115     only of the string “I'm an object!”. Likewise, more complex data structures
 116     can be labeled with indirect objects.
 117
 118 :   Referencing an indirect object works by giving the object and generation
 119     number, followed by an uppercase R: `1 0 R` references the object created
 120     above. References can be used everywhere where a (direct) object could be
 121     used instead.
 122
 123 Using these object, a PDF document builds up a tree structure, starting from the
 124 root object, which has the object number 1 and is a dictionary with the value
 125 `/Catalog` assigned to the key `/Type`. The other values of this dictionary
 126 point to the objects describing the outlines and pages of the document, which in
 127 turn reference other objects describing single pages, which point to objects
 128 describing drawing operations or text blocks, etc.
 129
 130
 131 ### Dissecting the PDFs created by XSane ###
 132
 133 Now that we know how a PDF document looks like, we can go back to out initial
 134 problem and try to find out why my PDF file was so huge. I will walk you through
 135 the PDF object by object.
 136
 137 [[!format pdf <<EOF
 138 %PDF-1.4
 139
 140 1 0 obj
 141    << /Type /Catalog
 142       /Outlines 2 0 R
 143       /Pages 3 0 R
 144    >>
 145 endobj
 146 EOF]]
 147
 148 This is just the magic string declaring the document as PDF-1.4, and the root
 149 object with object number 1, which references objects number 2 for Outlines and
 150 number 3 for Pages. We're not interested in outlines, let's look at the pages.
 151
 152 [[!format pdf <<EOF
 153 3 0 obj
 154    << /Type /Pages
 155       /Kids [
 156              6 0 R
 157              8 0 R
 158              10 0 R
 159              12 0 R
 160             ]
 161       /Count 4
 162    >>
 163 endobj
 164 EOF]]
 165
 166 OK, apparently this document has four pages, which are referenced by objects
 167 number 6, 8, 10 and 12. This makes sense, since I scanned four pages ;-)
 168
 169 Let's start with object number 6:
 170
 171 [[!format pdf <<EOF
 172 6 0 obj
 173     << /Type /Page
 174        /Parent 3 0 R
 175        /MediaBox [0 0 596 842]
 176        /Contents 7 0 R
 177        /Resources << /ProcSet 8 0 R >>
 178     >>
 179 endobj
 180 EOF]]
 181
 182 We see that object number 6 is a page object, and the actual content is in
 183 object number 7. More redirection, yay!
 184
 185 [[!format pdf <<EOF
 186 7 0 obj
 187     << /Length 2678332     >>
 188 stream
 189 q
 190 1 0 0 1 0 0 cm
 191 1.000000 0.000000 -0.000000 1.000000 0 0 cm
 192 595.080017 0 0 841.679993 0 0 cm
 193 BI
 194   /W 1653
 195   /H 2338
 196   /CS /G
 197   /BPC 8
 198   /F /FlateDecode
 199 ID
 200 x$¼[$;¾åù!fú¥¡aæátq.4§ [ ...byte stream shortened... ]
 201 EI
 202 Q
 203 endstream
 204 endobj
 205 EOF]]
 206
 207 Aha, here is where the magic happens. Object number 7 is a stream object of
 208 2,678,332 bytes (about 2 MB) and contains drawing operations! After skipping
 209 around a bit in Adobe’s PDF reference (chapters 3 and 4), here is the annotated
 210 version of the stream content:
 211
 212 [[!format pdf <<EOF
 213 q                 % Save drawing context
 214 1 0 0 1 0 0 cm    % Set up coordinate space for image
 215 1.000000 0.000000 -0.000000 1.000000 0 0 cm
 216 595.080017 0 0 841.679993 0 0 cm
 217 BI                % Begin Image
 218   /W 1653           % Image width is 1653 pixel
 219   /H 2338           % Image height is 2338 pixel
 220   /CS /G            % Color space is Gray
 221   /BPC 8            % 8 bits per pixel
 222   /F /FlateDecode   % Filters: data is Deflate-compressed
 223 ID                % Image Data follows:
 224 x$¼[$;¾åù!fú¥¡aæátq.4§ [ ...byte stream shortened... ]
 225 EI                % End Image
 226 Q                 % Restore drawing context
 227 EOF]]
 228
 229 So now we know why the PDF was so huge: the line `/F /FlateDecode` tells us that
 230 the image data is stored losslessly with [Deflate][] compression (this is
 231 basically what PNG uses). However, scanned images, as well as photographed
 232 pictures, have the tendency to become very big when stored losslessly, due to te
 233 fact that image sensors always add noise from the universe and lossless
 234 compression also has to take account of this noise. In contrast, lossy
 235 compression like JPEG, which uses [discrete cosine transform][dct], only has to
 236 approximate the image (and therefore the noise from the sensor) to a certain
 237 degree, therefore reducing the space needed to save the image. And the PDF
 238 standard also allows image data to be DCT-compressed, by adding `/DCTDecode` to
 239 the filters.
 240
 241 [Deflate]: https://en.wikipedia.org/wiki/DEFLATE "Wikipedia: DEFLATE algorithm"
 242 [dct]: http://en.wikipedia.org/wiki/Discrete_cosine_transform "Wikipedia: Discrete cosine transform"
 243
 244
 245 Second solution: use a (better) compression algorithm
 246 ------------------
 247
 248 Now that I knew where the problem was, I could try to create PDFs with DCT
 249 compression. I still had the original, uncompressed [PNM][] files that fell out
 250 of XSane’ multipage mode (just look in the multipage project folder), so I
 251 started to play around a bit with [ImageMagick’s][im] `convert` tool, which can
 252 also convert images to PDF.
 253
 254 [im]: http://www.imagemagick.org "ImageMagic homepage"
 255 [PNM]: https://en.wikipedia.org/wiki/Netpbm_format "Wikipedia: Netpbm format"
 256
 257 ### Converting PNM to PDF ###
 258 First, I tried converting the umcompressed PNM to PDF:
 259
 260     $ convert image*.pnm document.pdf
 261
 262 `convert` generally takes parameters of the form `inputfile outputfile`, but it
 263 also allows us to specify more than one input file (which is somehow
 264 undocumented in the [man page][man-convert]). In that case it tries to create
 265 multi-page documents, if possible. With PDF as output format, this results in
 266 one input file per page.
 267
 268 [man-convert]: http://manpages.debian.net/cgi-bin/man.cgi?query=convert "man convert(1)"
 269
 270 The embedded image objects looked somewhat like the following:
 271
 272 [[!format pdf <<EOF
 273 8 0 obj
 274 <<
 275     /Type /XObject
 276     /Subtype /Image
 277     /Name /Im0
 278     /Filter [ /RunLengthDecode ]
 279     /Width 1653
 280     /Height 2338
 281     /ColorSpace 10 0 R
 282     /BitsPerComponent 8
 283     /Length 9 0 R
 284 >>
 285 stream
 286 % [ raw byte data ]
 287 endstream
 288 EOF]]
 289
 290 The filter `/RunLengthDecode` indicates that the stream data is compressed with
 291 [Run-length encoding][RLE], another simple lossless compression. Not what I
 292 wanted. (Apart from that, `convert` embeds images as XObjects, but there is not
 293 much difference to the inline images described above.)
 294
 295 [RLE]: https://en.wikipedia.org/wiki/Run-length_encoding "Wikipedia: Run-length encoding"
 296
 297 ### Converting PNM to JPG, then to PDF ###
 298
 299 Next, I converted the PNMs to JPG, then to PDF.
 300
 301     $ convert image*.pnm image.jpg
 302     $ convert image*jpg document.pdf
 303
 304 (The first command creates the output files `image-1.jpg`, `image-2.jpg`, etc.,
 305 since JPG does not support multiple pages in one file.)
 306
 307 When looking at the PDF, we see that we now have DCT-compressed images inside
 308 the PDF:
 309
 310 [[!format pdf <<EOF
 311 8 0 obj
 312 <<
 313     /Type /XObject
 314     /Subtype /Image
 315     /Name /Im0
 316     /Filter [ /DCTDecode ]
 317     /Width 1653
 318     /Height 2338
 319     /ColorSpace 10 0 R
 320     /BitsPerComponent 8
 321     /Length 9 0 R
 322 >>
 323 stream
 324 % [ raw byte data ]
 325 endstream
 326 EOF]]
 327
 328 ### Converting PNM to JPG, then to PDF, and fix page size ###
 329
 330 However, the pages in `document.pdf` are 82.47×58.31&nbsp;cm, which results in
 331 about 72&nbsp;dpi in respect to the size of the original images. But `convert`
 332 also allows us to specify the pixel density, so we'll set that to 200&nbsp;dpi
 333 in X and Y direction, which was the resolution at which the images were scanned:
 334
 335     $ convert image*jpg -density 200x200 document.pdf
 336
 337 With that approach, I could reduce the size of my PDF from 250&nbsp;MB with
 338 losslessly compressed images to 38&nbsp;MB with DCT compression.
 339
 340 Too long, didn’t read
 341 -----------------
 342
 343 Here’s the gist for you:
 344
 345 *   Read the article above, it’s very comprehensive :P
 346 *   Use `convert` on XSane’s multipage images and specify your
 347     scanning resolution:
 348
 349         $ convert image*.pnm image.jpg
 350         $ convert image*jpg -density 200x200 document.pdf
 351
 352
 353 Further reading
 354 -------------
 355
 356 There is probably software out there which does those thing for you, with a
 357 shiny user interface, but I could not find one quickly. What I did find though,
 358 was [this detailed article][scan-to-pdfa], which describes how to get
 359 high-resolution scans wihh OCR information in PDF/A and DjVu format, using
 360 `scantailor` and `unpaper`.
 361
 362 Also, Didier Stevens helped me understand stream objects in in his
 363 [illustrated blogpost][pdf-stream-objects]. He seems to write about PDF more
 364 often, and it was fun to poke around in his blog. There is also a nice script,
 365 [`pdf-parser`][pdf-tools], which helps you visualize the structure of a PDF
 366 document.
 367
 368 [scan-to-pdfa]: http://blog.konradvoelkel.de/2013/03/scan-to-pdfa/ "Konrad Voelkel: Linux, OCR and PDF: Scan to PDF/A"
 369 [pdf-stream-objects]: http://blog.didierstevens.com/2008/05/19/pdf-stream-objects/ "Didier Stevens: PDF Stream Objects"
 370 [pdf-tools]: http://blog.didierstevens.com/programs/pdf-tools/ "Didier Stevens: PDF Tools"
 371
 372 [[!tag PDF note_to_self howto ImageMagic convert file_formats reference longpost]]