Our company (www.alkhawarizmy.com/en/index_en.php) has developed an Arabic search engine and we need to be able to convert Arabic pdf files correctly to html files, in order to include pdf files in our new desktop search product.
At the moment Arabic pdf files are not converted correctly. We are willing to purchase any component that will allows us to perform the conversion properly on the .NET platform.
Is there ANYTHING available (preferably a .NET dll/Activex com that can do this.
I would very much appreciate your help, as we would like to include support for Arabic pdf files in our KSearch Desktop Edition Product.
Thanks and best regards.
Hossam Mahgoub
President and CEO
AlKhawarizmy Language Software
Arabic, along with Hebrew and Thai is considered a Level IV language - in that the complexity of the language is multiplied due to the way glyphs and the meaning of those glyphs is dependent upon how they relate to each other.
As you have developed an Arabic search engine, I am sure you are already aware of this - the above is for others that may be following along at home.
One of the difficult problems you will run into with extraction of a Type IV language out of PDF has to do with how the glyphs were put in there. In other words, how was the typography + the font selected handled. If they just used character overlay (very common) - with no relation to UNICODE you will have to manually disect the document. If they used a full UNICODE font, but the composition engine wasn't Arabic aware, you are better off, but will still have text extraction issues.
The best you can hope for is a UNICODE font and an Arabic aware composition engine.
As for the text extraction - I cannot give the name of the company, but they sell a product called 'Text Extraction Toolkit'. It is UNICODE aware. That should provide you with the development tool you are looking for.
Douglas Hanna is a member of the Production Print Technology team at Aon.
www.aonhewitt.com