I am looking for effective ways of storing text strings from a pdf document in a database. Are there effective pdf text string readers, or is there a way to "un" pdf the document (into text strings)?
I am looking for effective ways of storing text strings from a pdf document in a database. Are there effective pdf text string readers, or is there a way to "un" pdf the document (into text strings)?
As for some products... there is one called Text Extraction Toolkit.
A word of warning - when ever possible work with the data generator to have the output generated in a digestable format for later use. Most composition engines can also give a line data version of the document, saving you this step. I have seen a lot of output where the resulting text was fragmented due to how the composition engine created the document in the first place. The text will appear in the order it was applied to the PDF page - not in the order you might normally read it.
Douglas Hanna is a member of the Production Print Technology team at Aon.
www.aonhewitt.com