Últimos Cambios |
||
Blog personal: El hilo del laberinto |
Última Actualización: 12 de enero de 2013
Trabajando en la recuperación de la revista en papel Data Bus me encontré con un problema imprevisto: los PDF generados eran ENORMES. Mucho más grandes que los ficheros originales.
El problema es que reportlab convierte las imágenes que no son JPEG a formato RGB, lo que es enormemente "overkill" para imágenes bilevel. La solución fue parchear la librería. La conversación (parte 1 y parte 2) está en los archivos de la lista de correo.
Message-ID: <4F0322A4.1090605@jcea.es> Date: Tue, 03 Jan 2012 16:45:40 +0100 From: Jesus Cea <jcea@jcea.es> To: reportlab-users@lists2.reportlab.com Subject: Re: [reportlab-users] Optimizing greyscale and bilevel images -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 31/12/11 20:18, Glenn Linderman wrote: > In my work with bilevel images, I have found that TIFF with Group > 4 compression seems to produce the smallest files with lossless > compression, or DJVU even smaller, but I think just slightly lossy. > I think the bilevel compression used by PDF is also Group 4? PDF 1.4 and higher support JBIG2 natively. > Am I correct that PNG doesn't support Group 4 compression, or am > I missing an option that would allow bilevel PNG files to be as > small as TIFF/Group 4 ? The problem is this: Current ReportLab takes the picture and if it is a JPEG file, includes it as is in the PDF (if it is a greyscale JPEG, it exploit it). But if the image source is a PIL object, it ALWAYS convert it to RGB, export the raw pixel data and simply uses a ZLIB (deflate) compression on the pixel data. So using a PIL "true color" image with reportlab will produce huge files. It is better to save the image as JPEG and import it in reportlab as JPEG file, not as a PIL object. But because of this internal conversion, non "true color" images, like greyscale, bilevel or indexed images will waste a lot. If your image is a pie char with ten colors, it will be inserted in the PDF as a RGB blob (24 bits per pixel) with the only "improvement" of a ZLIB wrapping. This is wasteful and slow. Here we have a few choices: 1. If I remember correctly, when current code is given a image filename, it includes the JPEG directly if it is a JPEG. But if the file has ANY other format (PNG, for instance), it will be imported in PIL, converted to RGB and inserted as a RGB blob+ZLIB. Beside size expansion, you are limited to PIL recognized formats. Would be nice, for instance, being able to insert jbig2 files or TIFF/Group 4 files. Recent versions of PDF standard support them natively, so reportlab doesn't need to decode the image, it can insert the file in the PDF with minimal header parsing, if any (like it already does with jpeg files). This is nice because, for instance, we don't have to worry about patents. Reportlab is generating PDF 1.3 files. I don't know what could go wrong if we increase version to 1.4, that allows native jbig2. 2. When inserting a PIL image, current reportlab converts it to RGB always. I think the lib should support natively bilevel, greyscale and indexed PIL files. Seems easy enough. I have written a small and trivial path to support bilevel (with a width multiple of 8) and greyscale PIL images. I have patched "drawInlineImage" because it was way easier that "drawImage" and enough for my inmediate needs. I don't understand PDF standard description enough to implement indexed images. My patch reduces my generated PDF files to half size, so I am happy, but I can't invest any more time on this. Holidays are over :). My patch: """ jcea@ubuntu:/usr/lib/python2.6/dist-packages/reportlab/pdfgen$ diff -u pdfimages.py.OLD pdfimages.py - --- pdfimages.py.OLD 2009-02-03 22:26:43.000000000 +0100 +++ pdfimages.py 2011-12-29 00:54:36.923812086 +0100 @@ -100,21 +100,30 @@ if image.mode == 'CMYK': myimage = image colorSpace = 'DeviceCMYK' - - bpp = 4 + bpp = 4*8 + elif image.mode == '1' : + myimage = image + colorSpace = 'DeviceGray' + bpp = 1 + elif image.mode == 'L' : + myimage = image + colorSpace = 'DeviceGray' + bpp = 1*8 else: myimage = image.convert('RGB') colorSpace = 'RGB' - - bpp = 3 + bpp = 3*8 imgwidth, imgheight = myimage.size # this describes what is in the image itself # *NB* according to the spec you can only use the short form in inline images #imagedata=['BI /Width %d /Height /BitsPerComponent 8 /ColorSpace /%s /Filter [/Filter [ /ASCII85Decode /FlateDecode] ID]' % (imgwidth, imgheight,'RGB')] - - imagedata=['BI /W %d /H %d /BPC 8 /CS /%s /F [/A85 /Fl] ID' % (imgwidth, imgheight,colorSpace)] + imagedata=['BI /W %d /H %d /BPC %d /CS /%s /F [/A85 /Fl] ID' % + (imgwidth, imgheight,1 if bpp<8 else 8,colorSpace)] #use a flate filter and Ascii Base 85 to compress raw = myimage.tostring() - - assert len(raw) == imgwidth*imgheight*bpp, "Wrong amount of data for image" + assert len(raw) == imgwidth*imgheight*bpp/8.0, "Wrong amount of data for image" compressed = zlib.compress(raw) #this bit is very fast... encoded = pdfutils._AsciiBase85Encode(compressed) #...sadly this may not be #append in blocks of 60 characters """ The bilevel images MUST have a width multiple of 8. I guess this condition can be lifted, but I needed something "yesterday". Have a good year!. - -- Jesus Cea Avion _/_/ _/_/_/ _/_/_/ jcea@jcea.es - http://www.jcea.es/ _/_/ _/_/ _/_/ _/_/ _/_/ jabber / xmpp:jcea@jabber.org _/_/ _/_/ _/_/_/_/_/ . _/_/ _/_/ _/_/ _/_/ _/_/ "Things are not so easy" _/_/ _/_/ _/_/ _/_/ _/_/ _/_/ "My name is Dump, Core Dump" _/_/_/ _/_/_/ _/_/ _/_/ "El amor es poner tu felicidad en la felicidad de otro" - Leibniz -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iQCVAwUBTwMipJlgi5GaxT1NAQKk4wQAj+dPM5mLxOxKVKENJ7VVoJg0e8HFx1Gt 6lDxm6WPMkeOxuv6R34wwirfwgjezspp04WeNTDtbEzfxA1mCXGm//5ckItk92St yAecoZlcbDlN+PztuL011/j0Hn6GSkDYwnkhdVwAnPln4cfic23D4zeAYA3UZJ7d sMgQZzX1PyE= =yAYn -----END PGP SIGNATURE-----
Más información sobre los OpenBadges
Donación BitCoin: 19niBN42ac2pqDQFx6GJZxry2JQSFvwAfS