wishlist update

[www-rohieb-name.git] / blag / post / optimizing-xsane-s-scanned-pdfs.mdwn
diff --git a/blag/post/optimizing-xsane-s-scanned-pdfs.mdwn b/blag/post/optimizing-xsane-s-scanned-pdfs.mdwn

index 529e873..5169a60 100644 (file)
--- a/blag/post/optimizing-xsane-s-scanned-pdfs.mdwn
+++ b/blag/post/optimizing-xsane-s-scanned-pdfs.mdwn
@@ -23,7 +23,7 @@ First (non-optimal) solution
  --------------
  
  At first, I tried to optimize the PDF using [GhostScript][gs]. I
-[[use-ghostscript-to-convert-pdf-files|already wrote]] about how GhostScript’s
+[[already wrote|use-ghostscript-to-convert-pdf-files]] about how GhostScript’s
  `-dPDFSETTINGS` option can be used to minimize PDFs by redering the pictures to
  a smaller resolution. In fact, there are [multiple rendering modes][gs-ps-pdf]
  (`screen` for 96&nbsp;dpi, `ebook` for 150&nbsp;dpi, `printer` for 300&nbsp;dpi,
@@ -31,6 +31,7 @@ and `prepress` for color-preserving 300&nbsp;dpi), but they are pre-defined, and
  for my 200&nbsp;dpi images, `ebook` was not enough (I would lose resolution),
  while `printer` was too high and would only enlarge the PDF.
  
+[gs]: http://ghostscript.com "Ghostscript homepage"
  [gs-ps-pdf]: http://milan.kupcevic.net/ghostscript-ps-pdf/#refs "Ghostscript PDF Reference & Tips"
  
  
@@ -66,6 +67,7 @@ Strings
          literal \) and some\n newlines.\n)`.
      *   interpreted as hexadecimal data when enclosed in angled brackets:
          `<53 61 6D 70 6C 65>` equals `(Sample)`.
+
  Names
  :   starting with a forward slash, like `/Type`. You can think of them like
      identifiers in programming languages.
@@ -145,7 +147,7 @@ EOF]]
  
  This is just the magic string declaring the document as PDF-1.4, and the root
  object with object number 1, which references objects number 2 for Outlines and
-number 3 for pages. We're not interested in outlines, let's look at the pages.
+number 3 for Pages. We're not interested in outlines, let's look at the pages.
  
  [[!format pdf <<EOF
  3 0 obj
@@ -195,8 +197,7 @@ BI
    /BPC 8
    /F /FlateDecode
  ID
-x\9c$¼[\8b$;¾åù!\ 6f\9eú¥\87¡a\1e\ 6æátq.4§
-% [ ...byte stream shortened... ]
+x$¼[$;¾åù!fú¥¡aæátq.4§ [ ...byte stream shortened... ]
  EI
  Q
  endstream
@@ -226,7 +227,7 @@ Q                 % Restore drawing context
  EOF]]
  
  So now we know why the PDF was so huge: the line `/F /FlateDecode` tells us that
-the image ata is stored losslessly with [Deflate][] compression (this is
+the image data is stored losslessly with [Deflate][] compression (this is
  basically what PNG uses). However, scanned images, as well as photographed
  pictures, have the tendency to become very big when stored losslessly, due to te
  fact that image sensors always add noise from the universe and lossless
@@ -264,7 +265,7 @@ undocumented in the [man page][man-convert]). In that case it tries to create
  multi-page documents, if possible. With PDF as output format, this results in
  one input file per page.
  
-[man-converted]: http://manpages.debian.net/cgi-bin/man.cgi?query=convert "man convert(1)"
+[man-convert]: http://manpages.debian.net/cgi-bin/man.cgi?query=convert "man convert(1)"
  
  The embedded image objects looked somewhat like the following:
  
@@ -301,7 +302,7 @@ Next, I converted the PNMs to JPG, then to PDF.
      $ convert image*jpg document.pdf
  
  (The first command creates the output files `image-1.jpg`, `image-2.jpg`, etc.,
-since JPG does nut support multiple pages in one file.)
+since JPG does not support multiple pages in one file.)
  
  When looking at the PDF, we see that we now have DCT-compressed images inside
  the PDF:
@@ -333,9 +334,27 @@ in X and Y direction, which was the resolution at which the images were scanned:
  
      $ convert image*jpg -density 200x200 document.pdf
  
+*Update:* You can also use the [`-page` parameter][page] to set the page size
+directly. It takes a multitude of predefined paper formats (see link) and will
+do the pixel density calculation for you, as well as adding any neccessary
+offset if the image ratio is not quite exact:
+
+    $ convert image*jpg -page A4 document.pdf
+
  With that approach, I could reduce the size of my PDF from 250&nbsp;MB with
  losslessly compressed images to 38&nbsp;MB with DCT compression.
  
+*Another update (2023):* Marcus notified me that it is possible to use
+ImageMagick's `-compress jpeg` option, this way we can leave out the
+intermediate step and convert PNM to PDF directly:
+
+    $ convert image*.pnm -compress jpeg -quality 85 output.pdf
+
+You can also play around with the `-quality` parameter to set the JPEG
+compression level (100% makes almost pristine, but huge images; 1% makes very
+small, very blocky images), 85% should still be readable for most documents
+in that resolution.
+
  Too long, didn’t read
  -----------------
  
@@ -367,5 +386,6 @@ document.
  [scan-to-pdfa]: http://blog.konradvoelkel.de/2013/03/scan-to-pdfa/ "Konrad Voelkel: Linux, OCR and PDF: Scan to PDF/A"
  [pdf-stream-objects]: http://blog.didierstevens.com/2008/05/19/pdf-stream-objects/ "Didier Stevens: PDF Stream Objects"
  [pdf-tools]: http://blog.didierstevens.com/programs/pdf-tools/ "Didier Stevens: PDF Tools"
+[page]: http://www.imagemagick.org/script/command-line-options.php#page "ImageMagick: Command-line Options"
  
  [[!tag PDF note_to_self howto ImageMagic convert file_formats reference longpost]]