--------------
At first, I tried to optimize the PDF using [GhostScript][gs]. I
-[[use-ghostscript-to-convert-pdf-files|already wrote]] about how GhostScript’s
+[[already wrote|use-ghostscript-to-convert-pdf-files]] about how GhostScript’s
`-dPDFSETTINGS` option can be used to minimize PDFs by redering the pictures to
a smaller resolution. In fact, there are [multiple rendering modes][gs-ps-pdf]
(`screen` for 96 dpi, `ebook` for 150 dpi, `printer` for 300 dpi,
for my 200 dpi images, `ebook` was not enough (I would lose resolution),
while `printer` was too high and would only enlarge the PDF.
+[gs]: http://ghostscript.com "Ghostscript homepage"
[gs-ps-pdf]: http://milan.kupcevic.net/ghostscript-ps-pdf/#refs "Ghostscript PDF Reference & Tips"
literal \) and some\n newlines.\n)`.
* interpreted as hexadecimal data when enclosed in angled brackets:
`<53 61 6D 70 6C 65>` equals `(Sample)`.
+
Names
: starting with a forward slash, like `/Type`. You can think of them like
identifiers in programming languages.
This is just the magic string declaring the document as PDF-1.4, and the root
object with object number 1, which references objects number 2 for Outlines and
-number 3 for pages. We're not interested in outlines, let's look at the pages.
+number 3 for Pages. We're not interested in outlines, let's look at the pages.
[[!format pdf <<EOF
3 0 obj
/BPC 8
/F /FlateDecode
ID
-x\9c$¼[\8b$;¾åù!\ 6f\9eú¥\87¡a\1e\ 6æátq.4§
-% [ ...byte stream shortened... ]
+x$¼[$;¾åù!fú¥¡aæátq.4§ [ ...byte stream shortened... ]
EI
Q
endstream
EOF]]
So now we know why the PDF was so huge: the line `/F /FlateDecode` tells us that
-the image ata is stored losslessly with [Deflate][] compression (this is
+the image data is stored losslessly with [Deflate][] compression (this is
basically what PNG uses). However, scanned images, as well as photographed
pictures, have the tendency to become very big when stored losslessly, due to te
fact that image sensors always add noise from the universe and lossless
multi-page documents, if possible. With PDF as output format, this results in
one input file per page.
-[man-converted]: http://manpages.debian.net/cgi-bin/man.cgi?query=convert "man convert(1)"
+[man-convert]: http://manpages.debian.net/cgi-bin/man.cgi?query=convert "man convert(1)"
The embedded image objects looked somewhat like the following:
$ convert image*jpg document.pdf
(The first command creates the output files `image-1.jpg`, `image-2.jpg`, etc.,
-since JPG does nut support multiple pages in one file.)
+since JPG does not support multiple pages in one file.)
When looking at the PDF, we see that we now have DCT-compressed images inside
the PDF:
$ convert image*jpg -density 200x200 document.pdf
+*Update:* You can also use the [`-page` parameter][page] to set the page size
+directly. It takes a multitude of predefined paper formats (see link) and will
+do the pixel density calculation for you, as well as adding any neccessary
+offset if the image ratio is not quite exact:
+
+ $ convert image*jpg -page A4 document.pdf
+
With that approach, I could reduce the size of my PDF from 250 MB with
losslessly compressed images to 38 MB with DCT compression.
[scan-to-pdfa]: http://blog.konradvoelkel.de/2013/03/scan-to-pdfa/ "Konrad Voelkel: Linux, OCR and PDF: Scan to PDF/A"
[pdf-stream-objects]: http://blog.didierstevens.com/2008/05/19/pdf-stream-objects/ "Didier Stevens: PDF Stream Objects"
[pdf-tools]: http://blog.didierstevens.com/programs/pdf-tools/ "Didier Stevens: PDF Tools"
+[page]: http://www.imagemagick.org/script/command-line-options.php#page "ImageMagick: Command-line Options"
[[!tag PDF note_to_self howto ImageMagic convert file_formats reference longpost]]