pdf spam solution idea arni Thu Jun 28 12:03:21 2007


its come up several times now that people ask for a way to directly detect pdf spam by the pdf content and not only through headers or other means (hashes, bayes). I've found a solution that should be pretty easy to realise in a Fuzzy-OCR like plugin. Here is what it should do:

Use xpdf (http://www.foolabs.com/xpdf/download.html) to read the pdf document
export the images to ppm files using `pdfimages`
export the text parts to a simple text using `pdftotext`

This plugin should run as one of the first to make the raw text read available (for example by attaching it as an extra mime part or somehow internally) as well as make the images available to FuzzyOCR or similar by the same means as above.

Unfortunately i wont be able to write such a plugin myself, it should be rather easy to do but i cant start to learn pearl just for this ;-)

Maybe i gave some hints ...