1 [[!meta title="Optimizing XSane's scanned PDFs (also: PDF internals)"]]
2 [[!meta author="rohieb"]]
3 [[!meta license="CC-BY-SA 3.0"]]
4 [[!img defaults size=x200]]
11 I use [XSane][xsane] to scan documents for my digital archive. I want them to be
12 in PDF format and have a reasonable resolution (better than 200 dpi, so I
13 can try OCRing them afterwards). However, the PDFs created by XSane’s multipage
14 mode are too large, about 250 MB for a 20-page document scanned at
17 [xsane]: http://www.xsane.org/ "XSane homepage"
19 [[!img xsane-multipage-mode.png caption="XSane’s Multipage mode"]]
22 First (non-optimal) solution
25 At first, I tried to optimize the PDF using [GhostScript][gs]. I
26 [[use-ghostscript-to-convert-pdf-files|already wrote]] about how GhostScript’s
27 `-dPDFSETTINGS` option can be used to minimize PDFs by redering the pictures to
28 a smaller resolution. In fact, there are [multiple rendering modes][gs-ps-pdf]
29 (`screen` for 96 dpi, `ebook` for 150 dpi, `printer` for 300 dpi,
30 and `prepress` for color-preserving 300 dpi), but they are pre-defined, and
31 for my 200 dpi images, `ebook` was not enough (I would lose resolution),
32 while `printer` was too high and would only enlarge the PDF.
34 [gs]: http://ghostscript.com "Ghostscript homepage"
35 [gs-ps-pdf]: http://milan.kupcevic.net/ghostscript-ps-pdf/#refs "Ghostscript PDF Reference & Tips"
38 Interlude: PDF Internals
41 The best thing to do was to find out how the images were embedded in the PDF.
42 Since most PDF files are also partly human-readable, I opened my file with vim.
43 (Also, I was surprised that [vim has syntax highlighting for
44 PDF](vim-syntax-highlighting.png).) Before we continue, I'll give a short
45 introduction to the PDF file format (for the long version, see [Adobe’s PDF
48 [pdf-ref]: http://partners.adobe.com/public/developer/en/pdf/PDFReference.pdf "Adobe Portable Document Format, Version 1.4"
50 ### Building Blocks ###
51 Every PDF file starts with the [magic string][magic] that identifies the version
52 of the standard which the document conforms to, like `%PDF-1.4`. After that, a
53 PDF document is made up of the following objects:
55 [magic]: https://en.wikipedia.org/wiki/Magic_number_(programming)#Magic_numbers_in_files "Wikipedia: Magic numbers in files"
60 Integers and floating-point numbers
61 : for example, `1337`, `-23.42` and `.1415`
64 : * interpreted as literal characters when enclosed in parentheses: `(This
65 is a string.)` These can contain escaped characters, particularly
66 escaped closing braces and control characters: `(This string contains a
67 literal \) and some\n newlines.\n)`.
68 * interpreted as hexadecimal data when enclosed in angled brackets:
69 `<53 61 6D 70 6C 65>` equals `(Sample)`.
72 : starting with a forward slash, like `/Type`. You can think of them like
73 identifiers in programming languages.
76 : enclosed in square brackets:
77 `[ -1 4 6 (A String) /AName [ (strings in arrays in arrays!) ] ]`
80 : key-value stores, which are enclosed in double angled brackets. The key must
81 be a name, the value can be any object. Keys and values are given in turns,
82 beginning with the first key:
83 `<< /FirstKey (First Value) /SecondKey 3.14 /ThirdKey /ANameAsValue >>`
84 Usually, the first key is `/Type` and defines what the dictionary actually
89 : a collection of bytes. In contrast to strings, stream objects are usually
90 used for large amount of data which may not be read entirely, while strings
91 are always read as a whole. For example, streams can be used to embed images
94 : Streams consist of a dictionary, followed by the keyword `stream`, the raw
95 content of the stream, and the keyword `endstream`. The dictionary describes
96 the stream’s length and the filters that have been applied to it, which
97 basically define the encoding the data is stored in. For example, data
98 streams can be compressed with various algorithms.
101 : Represented by the literal string `null`.
105 : Every object in a PDF document can also be stored as a indirect object,
106 which means that it is given a label and can be used multiple times in the
107 document. The label consists of two numbers, a positive *object number*
108 (which makes the object unique) and a non-negative *generation number*
109 (which allows to incrementally update objects by appending to the file).
111 : Indirect objects are defined by their object number, followed by their
112 generation number, the keyword `obj`, the contents of the object, and the
113 keyword `endobj`. Example: `1 0 obj (I'm an object!) endobj` defines the
114 indirect object with object number 1 and generation number 0, which consists
115 only of the string “I'm an object!”. Likewise, more complex data structures
116 can be labeled with indirect objects.
118 : Referencing an indirect object works by giving the object and generation
119 number, followed by an uppercase R: `1 0 R` references the object created
120 above. References can be used everywhere where a (direct) object could be
123 Using these object, a PDF document builds up a tree structure, starting from the
124 root object, which has the object number 1 and is a dictionary with the value
125 `/Catalog` assigned to the key `/Type`. The other values of this dictionary
126 point to the objects describing the outlines and pages of the document, which in
127 turn reference other objects describing single pages, which point to objects
128 describing drawing operations or text blocks, etc.
131 ### Dissecting the PDFs created by XSane ###
133 Now that we know how a PDF document looks like, we can go back to out initial
134 problem and try to find out why my PDF file was so huge. I will walk you through
135 the PDF object by object.
148 This is just the magic string declaring the document as PDF-1.4, and the root
149 object with object number 1, which references objects number 2 for Outlines and
150 number 3 for Pages. We're not interested in outlines, let's look at the pages.
166 OK, apparently this document has four pages, which are referenced by objects
167 number 6, 8, 10 and 12. This makes sense, since I scanned four pages ;-)
169 Let's start with object number 6:
175 /MediaBox [0 0 596 842]
177 /Resources << /ProcSet 8 0 R >>
182 We see that object number 6 is a page object, and the actual content is in
183 object number 7. More redirection, yay!
187 << /Length 2678332 >>
191 1.000000 0.000000 -0.000000 1.000000 0 0 cm
192 595.080017 0 0 841.679993 0 0 cm
200 x$¼[$;¾åù!fú¥¡aæátq.4§ [ ...byte stream shortened... ]
207 Aha, here is where the magic happens. Object number 7 is a stream object of
208 2,678,332 bytes (about 2 MB) and contains drawing operations! After skipping
209 around a bit in Adobe’s PDF reference (chapters 3 and 4), here is the annotated
210 version of the stream content:
213 q % Save drawing context
214 1 0 0 1 0 0 cm % Set up coordinate space for image
215 1.000000 0.000000 -0.000000 1.000000 0 0 cm
216 595.080017 0 0 841.679993 0 0 cm
218 /W 1653 % Image width is 1653 pixel
219 /H 2338 % Image height is 2338 pixel
220 /CS /G % Color space is Gray
221 /BPC 8 % 8 bits per pixel
222 /F /FlateDecode % Filters: data is Deflate-compressed
223 ID % Image Data follows:
224 x$¼[$;¾åù!fú¥¡aæátq.4§ [ ...byte stream shortened... ]
226 Q % Restore drawing context
229 So now we know why the PDF was so huge: the line `/F /FlateDecode` tells us that
230 the image data is stored losslessly with [Deflate][] compression (this is
231 basically what PNG uses). However, scanned images, as well as photographed
232 pictures, have the tendency to become very big when stored losslessly, due to te
233 fact that image sensors always add noise from the universe and lossless
234 compression also has to take account of this noise. In contrast, lossy
235 compression like JPEG, which uses [discrete cosine transform][dct], only has to
236 approximate the image (and therefore the noise from the sensor) to a certain
237 degree, therefore reducing the space needed to save the image. And the PDF
238 standard also allows image data to be DCT-compressed, by adding `/DCTDecode` to
241 [Deflate]: https://en.wikipedia.org/wiki/DEFLATE "Wikipedia: DEFLATE algorithm"
242 [dct]: http://en.wikipedia.org/wiki/Discrete_cosine_transform "Wikipedia: Discrete cosine transform"
245 Second solution: use a (better) compression algorithm
248 Now that I knew where the problem was, I could try to create PDFs with DCT
249 compression. I still had the original, uncompressed [PNM][] files that fell out
250 of XSane’ multipage mode (just look in the multipage project folder), so I
251 started to play around a bit with [ImageMagick’s][im] `convert` tool, which can
252 also convert images to PDF.
254 [im]: http://www.imagemagick.org "ImageMagic homepage"
255 [PNM]: https://en.wikipedia.org/wiki/Netpbm_format "Wikipedia: Netpbm format"
257 ### Converting PNM to PDF ###
258 First, I tried converting the umcompressed PNM to PDF:
260 $ convert image*.pnm document.pdf
262 `convert` generally takes parameters of the form `inputfile outputfile`, but it
263 also allows us to specify more than one input file (which is somehow
264 undocumented in the [man page][man-convert]). In that case it tries to create
265 multi-page documents, if possible. With PDF as output format, this results in
266 one input file per page.
268 [man-convert]: http://manpages.debian.net/cgi-bin/man.cgi?query=convert "man convert(1)"
270 The embedded image objects looked somewhat like the following:
278 /Filter [ /RunLengthDecode ]
290 The filter `/RunLengthDecode` indicates that the stream data is compressed with
291 [Run-length encoding][RLE], another simple lossless compression. Not what I
292 wanted. (Apart from that, `convert` embeds images as XObjects, but there is not
293 much difference to the inline images described above.)
295 [RLE]: https://en.wikipedia.org/wiki/Run-length_encoding "Wikipedia: Run-length encoding"
297 ### Converting PNM to JPG, then to PDF ###
299 Next, I converted the PNMs to JPG, then to PDF.
301 $ convert image*.pnm image.jpg
302 $ convert image*jpg document.pdf
304 (The first command creates the output files `image-1.jpg`, `image-2.jpg`, etc.,
305 since JPG does not support multiple pages in one file.)
307 When looking at the PDF, we see that we now have DCT-compressed images inside
316 /Filter [ /DCTDecode ]
328 ### Converting PNM to JPG, then to PDF, and fix page size ###
330 However, the pages in `document.pdf` are 82.47×58.31 cm, which results in
331 about 72 dpi in respect to the size of the original images. But `convert`
332 also allows us to specify the pixel density, so we'll set that to 200 dpi
333 in X and Y direction, which was the resolution at which the images were scanned:
335 $ convert image*jpg -density 200x200 document.pdf
337 *Update:* You can also use the [`-page` parameter][page] to set the page size
338 directly. It takes a multitude of predefined paper formats (see link) and will
339 do the pixel density calculation for you, as well as adding any neccessary
340 offset if the image ratio is not quite exact:
342 $ convert image*jpg -page A4 document.pdf
344 With that approach, I could reduce the size of my PDF from 250 MB with
345 losslessly compressed images to 38 MB with DCT compression.
347 Too long, didn’t read
350 Here’s the gist for you:
352 * Read the article above, it’s very comprehensive :P
353 * Use `convert` on XSane’s multipage images and specify your
356 $ convert image*.pnm image.jpg
357 $ convert image*jpg -density 200x200 document.pdf
363 There is probably software out there which does those thing for you, with a
364 shiny user interface, but I could not find one quickly. What I did find though,
365 was [this detailed article][scan-to-pdfa], which describes how to get
366 high-resolution scans wihh OCR information in PDF/A and DjVu format, using
367 `scantailor` and `unpaper`.
369 Also, Didier Stevens helped me understand stream objects in in his
370 [illustrated blogpost][pdf-stream-objects]. He seems to write about PDF more
371 often, and it was fun to poke around in his blog. There is also a nice script,
372 [`pdf-parser`][pdf-tools], which helps you visualize the structure of a PDF
375 [scan-to-pdfa]: http://blog.konradvoelkel.de/2013/03/scan-to-pdfa/ "Konrad Voelkel: Linux, OCR and PDF: Scan to PDF/A"
376 [pdf-stream-objects]: http://blog.didierstevens.com/2008/05/19/pdf-stream-objects/ "Didier Stevens: PDF Stream Objects"
377 [pdf-tools]: http://blog.didierstevens.com/programs/pdf-tools/ "Didier Stevens: PDF Tools"
378 [page]: http://www.imagemagick.org/script/command-line-options.php#page "ImageMagick: Command-line Options"
380 [[!tag PDF note_to_self howto ImageMagic convert file_formats reference longpost]]