Dear All, I would like to have an application which can read through a document say pdf to know its contents and do a word count etc. Any script for me to follow? Thank you.
The easiest solution for linux for pdf files might be to use the pdftotext commandlinie utility: http://pnaplinux.blogspot.com/2008/11/in-linux-flavours-how-to-pdf2txt.html 1) Convert the pdf file to a text file. 2) read the content of the pdf file into a string, split this string by whitespaces and then count the tokens. See here for an example on how to split a string into words: http://www.php.net/manual/en/function.preg-split.php#92632
I dont think that images matter for the pdftotext utility. As images are not text, they will be ignored.