These forums are now Read Only. If you have an Acrobat question, ask questions and get help from one of our experts.

Arabic PDF Conversion to other file formats

hemahgoub
Registered: Sep 24 2007
Posts: 4

Our company (www.alkhawarizmy.com/en/index_en.php) has developed an Arabic search engine and we need to be able to convert Arabic pdf files correctly to html files, in order to include pdf files in our new desktop search product.
 
At the moment Arabic pdf files are not converted correctly. We are willing to purchase any component that will allows us to perform the conversion properly on the .NET platform.
Is there ANYTHING available (preferably a .NET dll/Activex com that can do this.
 
I would very much appreciate your help, as we would like to include support for Arabic pdf files in our KSearch Desktop Edition Product.
 
Thanks and best regards.
Hossam Mahgoub
President and CEO
AlKhawarizmy Language Software

My Product Information:
Acrobat Standard 8.0999999999999996447286321199499070644378662109375, Windows
dthanna
ExpertTeam
Registered: Sep 28 2005
Posts: 248
Not being a native, or even secondary, fluent in Arabic, I will give this a bit of a stab, though.

Arabic, along with Hebrew and Thai is considered a Level IV language - in that the complexity of the language is multiplied due to the way glyphs and the meaning of those glyphs is dependent upon how they relate to each other.

As you have developed an Arabic search engine, I am sure you are already aware of this - the above is for others that may be following along at home.

One of the difficult problems you will run into with extraction of a Type IV language out of PDF has to do with how the glyphs were put in there. In other words, how was the typography + the font selected handled. If they just used character overlay (very common) - with no relation to UNICODE you will have to manually disect the document. If they used a full UNICODE font, but the composition engine wasn't Arabic aware, you are better off, but will still have text extraction issues.

The best you can hope for is a UNICODE font and an Arabic aware composition engine.

As for the text extraction - I cannot give the name of the company, but they sell a product called 'Text Extraction Toolkit'. It is UNICODE aware. That should provide you with the development tool you are looking for.

Douglas Hanna is a member of the Production Print Technology team at Aon.
www.aonhewitt.com

hemahgoub
Registered: Sep 24 2007
Posts: 4
Thanks very much for your very comprehensive answer. In fact, the PDF-to-Text converter I am currently using, puts out UNICODE characters in the text stream; this would mean I still need to "re-compose" the double byte chracters to single byte characters for display purposes (whether it be UTF-8 or CP-1256), but that is not a very big deal.

It would be useful, though, if Adobe could exert a bit of an effort in including correct export of Arabic PDF documents, especially since the Arabic customer base will be expanding rapidly in the next decade. We know this, after having developed an award winning business plan, in order to cover the Arabic market.

Thanks again and best wishes,
Hossam Mahgoub