c9db67bfd8057f7bfc8f3cfee67502f8a108851e
[www-rohieb-name.git] / blag / post / optimizing-xsane-s-scanned-pdfs.mdwn
1 [[!meta title="Optimizing XSane's scanned PDFs (also: PDF internals)"]]
2 [[!meta author="rohieb"]]
3 [[!meta license="CC-BY-SA 3.0"]]
4 [[!img defaults size=x200]]
5
6 [[!toc levels=2]]
7
8 Problem
9 -------
10
11 I use [XSane][xsane] to scan documents for my digital archive. I want them to be
12 in PDF format and have a reasonable resolution (better than 200 dpi, so I
13 can try OCRing them afterwards). However, the PDFs created by XSane’s multipage
14 mode are too large, about 250 MB for a 20-page document scanned at
15 200 dpi.
16
17 [xsane]: http://www.xsane.org/ "XSane homepage"
18
19 [[!img xsane-multipage-mode.png caption="XSane’s Multipage mode"]]
20
21
22 First (non-optimal) solution
23 --------------
24
25 At first, I tried to optimize the PDF using [GhostScript][gs]. I
26 [[use-ghostscript-to-convert-pdf-files|already wrote]] about how GhostScript’s
27 `-dPDFSETTINGS` option can be used to minimize PDFs by redering the pictures to
28 a smaller resolution. In fact, there are [multiple rendering modes][gs-ps-pdf]
29 (`screen` for 96 dpi, `ebook` for 150 dpi, `printer` for 300 dpi,
30 and `prepress` for color-preserving 300 dpi), but they are pre-defined, and
31 for my 200 dpi images, `ebook` was not enough (I would lose resolution),
32 while `printer` was too high and would only enlarge the PDF.
33
34 [gs-ps-pdf]: http://milan.kupcevic.net/ghostscript-ps-pdf/#refs "Ghostscript PDF Reference & Tips"
35
36
37 Interlude: PDF Internals
38 ------------------
39
40 The best thing to do was to find out how the images were embedded in the PDF.
41 Since most PDF files are also partly human-readable, I opened my file with vim.
42 (Also, I was surprised that [vim has syntax highlighting for
43 PDF](vim-syntax-highlighting.png).) Before we continue, I'll give a short
44 introduction to the PDF file format (for the long version, see [Adobe’s PDF
45 reference][pdf-ref]).
46
47 [pdf-ref]: http://partners.adobe.com/public/developer/en/pdf/PDFReference.pdf "Adobe Portable Document Format, Version 1.4"
48
49 ### Building Blocks ###
50 Every PDF file starts with the [magic string][magic] that identifies the version
51 of the standard which the document conforms to, like `%PDF-1.4`. After that, a
52 PDF document is made up of the following objects:
53
54 [magic]: https://en.wikipedia.org/wiki/Magic_number_(programming)#Magic_numbers_in_files "Wikipedia: Magic numbers in files"
55
56 Boolean values
57 : `true` and `false`
58
59 Integers and floating-point numbers
60 : for example, `1337`, `-23.42` and `.1415`
61
62 Strings
63 : * interpreted as literal characters when enclosed in parentheses: `(This
64 is a string.)` These can contain escaped characters, particularly
65 escaped closing braces and control characters: `(This string contains a
66 literal \) and some\n newlines.\n)`.
67 * interpreted as hexadecimal data when enclosed in angled brackets:
68 `<53 61 6D 70 6C 65>` equals `(Sample)`.
69
70 Names
71 : starting with a forward slash, like `/Type`. You can think of them like
72 identifiers in programming languages.
73
74 Arrays
75 : enclosed in square brackets:
76 `[ -1 4 6 (A String) /AName [ (strings in arrays in arrays!) ] ]`
77
78 Dictionaries
79 : key-value stores, which are enclosed in double angled brackets. The key must
80 be a name, the value can be any object. Keys and values are given in turns,
81 beginning with the first key:
82 `<< /FirstKey (First Value) /SecondKey 3.14 /ThirdKey /ANameAsValue >>`
83 Usually, the first key is `/Type` and defines what the dictionary actually
84 describes.
85
86 Stream Objects
87
88 : a collection of bytes. In contrast to strings, stream objects are usually
89 used for large amount of data which may not be read entirely, while strings
90 are always read as a whole. For example, streams can be used to embed images
91 or metadata.
92
93 : Streams consist of a dictionary, followed by the keyword `stream`, the raw
94 content of the stream, and the keyword `endstream`. The dictionary describes
95 the stream’s length and the filters that have been applied to it, which
96 basically define the encoding the data is stored in. For example, data
97 streams can be compressed with various algorithms.
98
99 The Null Object
100 : Represented by the literal string `null`.
101
102 Indirect Objects
103
104 : Every object in a PDF document can also be stored as a indirect object,
105 which means that it is given a label and can be used multiple times in the
106 document. The label consists of two numbers, a positive *object number*
107 (which makes the object unique) and a non-negative *generation number*
108 (which allows to incrementally update objects by appending to the file).
109
110 : Indirect objects are defined by their object number, followed by their
111 generation number, the keyword `obj`, the contents of the object, and the
112 keyword `endobj`. Example: `1 0 obj (I'm an object!) endobj` defines the
113 indirect object with object number 1 and generation number 0, which consists
114 only of the string “I'm an object!”. Likewise, more complex data structures
115 can be labeled with indirect objects.
116
117 : Referencing an indirect object works by giving the object and generation
118 number, followed by an uppercase R: `1 0 R` references the object created
119 above. References can be used everywhere where a (direct) object could be
120 used instead.
121
122 Using these object, a PDF document builds up a tree structure, starting from the
123 root object, which has the object number 1 and is a dictionary with the value
124 `/Catalog` assigned to the key `/Type`. The other values of this dictionary
125 point to the objects describing the outlines and pages of the document, which in
126 turn reference other objects describing single pages, which point to objects
127 describing drawing operations or text blocks, etc.
128
129
130 ### Dissecting the PDFs created by XSane ###
131
132 Now that we know how a PDF document looks like, we can go back to out initial
133 problem and try to find out why my PDF file was so huge. I will walk you through
134 the PDF object by object.
135
136 [[!format pdf <<EOF
137 %PDF-1.4
138
139 1 0 obj
140 << /Type /Catalog
141 /Outlines 2 0 R
142 /Pages 3 0 R
143 >>
144 endobj
145 EOF]]
146
147 This is just the magic string declaring the document as PDF-1.4, and the root
148 object with object number 1, which references objects number 2 for Outlines and
149 number 3 for Pages. We're not interested in outlines, let's look at the pages.
150
151 [[!format pdf <<EOF
152 3 0 obj
153 << /Type /Pages
154 /Kids [
155 6 0 R
156 8 0 R
157 10 0 R
158 12 0 R
159 ]
160 /Count 4
161 >>
162 endobj
163 EOF]]
164
165 OK, apparently this document has four pages, which are referenced by objects
166 number 6, 8, 10 and 12. This makes sense, since I scanned four pages ;-)
167
168 Let's start with object number 6:
169
170 [[!format pdf <<EOF
171 6 0 obj
172 << /Type /Page
173 /Parent 3 0 R
174 /MediaBox [0 0 596 842]
175 /Contents 7 0 R
176 /Resources << /ProcSet 8 0 R >>
177 >>
178 endobj
179 EOF]]
180
181 We see that object number 6 is a page object, and the actual content is in
182 object number 7. More redirection, yay!
183
184 [[!format pdf <<EOF
185 7 0 obj
186 << /Length 2678332 >>
187 stream
188 q
189 1 0 0 1 0 0 cm
190 1.000000 0.000000 -0.000000 1.000000 0 0 cm
191 595.080017 0 0 841.679993 0 0 cm
192 BI
193 /W 1653
194 /H 2338
195 /CS /G
196 /BPC 8
197 /F /FlateDecode
198 ID
199 x$¼[$;¾åù!fú¥¡aæátq.4§ [ ...byte stream shortened... ]
200 EI
201 Q
202 endstream
203 endobj
204 EOF]]
205
206 Aha, here is where the magic happens. Object number 7 is a stream object of
207 2,678,332 bytes (about 2 MB) and contains drawing operations! After skipping
208 around a bit in Adobe’s PDF reference (chapters 3 and 4), here is the annotated
209 version of the stream content:
210
211 [[!format pdf <<EOF
212 q % Save drawing context
213 1 0 0 1 0 0 cm % Set up coordinate space for image
214 1.000000 0.000000 -0.000000 1.000000 0 0 cm
215 595.080017 0 0 841.679993 0 0 cm
216 BI % Begin Image
217 /W 1653 % Image width is 1653 pixel
218 /H 2338 % Image height is 2338 pixel
219 /CS /G % Color space is Gray
220 /BPC 8 % 8 bits per pixel
221 /F /FlateDecode % Filters: data is Deflate-compressed
222 ID % Image Data follows:
223 x$¼[$;¾åù!fú¥¡aæátq.4§ [ ...byte stream shortened... ]
224 EI % End Image
225 Q % Restore drawing context
226 EOF]]
227
228 So now we know why the PDF was so huge: the line `/F /FlateDecode` tells us that
229 the image data is stored losslessly with [Deflate][] compression (this is
230 basically what PNG uses). However, scanned images, as well as photographed
231 pictures, have the tendency to become very big when stored losslessly, due to te
232 fact that image sensors always add noise from the universe and lossless
233 compression also has to take account of this noise. In contrast, lossy
234 compression like JPEG, which uses [discrete cosine transform][dct], only has to
235 approximate the image (and therefore the noise from the sensor) to a certain
236 degree, therefore reducing the space needed to save the image. And the PDF
237 standard also allows image data to be DCT-compressed, by adding `/DCTDecode` to
238 the filters.
239
240 [Deflate]: https://en.wikipedia.org/wiki/DEFLATE "Wikipedia: DEFLATE algorithm"
241 [dct]: http://en.wikipedia.org/wiki/Discrete_cosine_transform "Wikipedia: Discrete cosine transform"
242
243
244 Second solution: use a (better) compression algorithm
245 ------------------
246
247 Now that I knew where the problem was, I could try to create PDFs with DCT
248 compression. I still had the original, uncompressed [PNM][] files that fell out
249 of XSane’ multipage mode (just look in the multipage project folder), so I
250 started to play around a bit with [ImageMagick’s][im] `convert` tool, which can
251 also convert images to PDF.
252
253 [im]: http://www.imagemagick.org "ImageMagic homepage"
254 [PNM]: https://en.wikipedia.org/wiki/Netpbm_format "Wikipedia: Netpbm format"
255
256 ### Converting PNM to PDF ###
257 First, I tried converting the umcompressed PNM to PDF:
258
259 $ convert image*.pnm document.pdf
260
261 `convert` generally takes parameters of the form `inputfile outputfile`, but it
262 also allows us to specify more than one input file (which is somehow
263 undocumented in the [man page][man-convert]). In that case it tries to create
264 multi-page documents, if possible. With PDF as output format, this results in
265 one input file per page.
266
267 [man-convert]: http://manpages.debian.net/cgi-bin/man.cgi?query=convert "man convert(1)"
268
269 The embedded image objects looked somewhat like the following:
270
271 [[!format pdf <<EOF
272 8 0 obj
273 <<
274 /Type /XObject
275 /Subtype /Image
276 /Name /Im0
277 /Filter [ /RunLengthDecode ]
278 /Width 1653
279 /Height 2338
280 /ColorSpace 10 0 R
281 /BitsPerComponent 8
282 /Length 9 0 R
283 >>
284 stream
285 % [ raw byte data ]
286 endstream
287 EOF]]
288
289 The filter `/RunLengthDecode` indicates that the stream data is compressed with
290 [Run-length encoding][RLE], another simple lossless compression. Not what I
291 wanted. (Apart from that, `convert` embeds images as XObjects, but there is not
292 much difference to the inline images described above.)
293
294 [RLE]: https://en.wikipedia.org/wiki/Run-length_encoding "Wikipedia: Run-length encoding"
295
296 ### Converting PNM to JPG, then to PDF ###
297
298 Next, I converted the PNMs to JPG, then to PDF.
299
300 $ convert image*.pnm image.jpg
301 $ convert image*jpg document.pdf
302
303 (The first command creates the output files `image-1.jpg`, `image-2.jpg`, etc.,
304 since JPG does not support multiple pages in one file.)
305
306 When looking at the PDF, we see that we now have DCT-compressed images inside
307 the PDF:
308
309 [[!format pdf <<EOF
310 8 0 obj
311 <<
312 /Type /XObject
313 /Subtype /Image
314 /Name /Im0
315 /Filter [ /DCTDecode ]
316 /Width 1653
317 /Height 2338
318 /ColorSpace 10 0 R
319 /BitsPerComponent 8
320 /Length 9 0 R
321 >>
322 stream
323 % [ raw byte data ]
324 endstream
325 EOF]]
326
327 ### Converting PNM to JPG, then to PDF, and fix page size ###
328
329 However, the pages in `document.pdf` are 82.47×58.31&nbsp;cm, which results in
330 about 72&nbsp;dpi in respect to the size of the original images. But `convert`
331 also allows us to specify the pixel density, so we'll set that to 200&nbsp;dpi
332 in X and Y direction, which was the resolution at which the images were scanned:
333
334 $ convert image*jpg -density 200x200 document.pdf
335
336 With that approach, I could reduce the size of my PDF from 250&nbsp;MB with
337 losslessly compressed images to 38&nbsp;MB with DCT compression.
338
339 Too long, didn’t read
340 -----------------
341
342 Here’s the gist for you:
343
344 * Read the article above, it’s very comprehensive :P
345 * Use `convert` on XSane’s multipage images and specify your
346 scanning resolution:
347
348 $ convert image*.pnm image.jpg
349 $ convert image*jpg -density 200x200 document.pdf
350
351
352 Further reading
353 -------------
354
355 There is probably software out there which does those thing for you, with a
356 shiny user interface, but I could not find one quickly. What I did find though,
357 was [this detailed article][scan-to-pdfa], which describes how to get
358 high-resolution scans wihh OCR information in PDF/A and DjVu format, using
359 `scantailor` and `unpaper`.
360
361 Also, Didier Stevens helped me understand stream objects in in his
362 [illustrated blogpost][pdf-stream-objects]. He seems to write about PDF more
363 often, and it was fun to poke around in his blog. There is also a nice script,
364 [`pdf-parser`][pdf-tools], which helps you visualize the structure of a PDF
365 document.
366
367 [scan-to-pdfa]: http://blog.konradvoelkel.de/2013/03/scan-to-pdfa/ "Konrad Voelkel: Linux, OCR and PDF: Scan to PDF/A"
368 [pdf-stream-objects]: http://blog.didierstevens.com/2008/05/19/pdf-stream-objects/ "Didier Stevens: PDF Stream Objects"
369 [pdf-tools]: http://blog.didierstevens.com/programs/pdf-tools/ "Didier Stevens: PDF Tools"
370
371 [[!tag PDF note_to_self howto ImageMagic convert file_formats reference longpost]]
This page took 0.060755 seconds and 3 git commands to generate.