首页 > 解决方案 > 无法使用自定义字体读取阿拉伯 PDF 文件

问题描述

我有一个包含自定义字体的阿拉伯 PDF 文件,所以当我尝试阅读该文件时,我遇到了一些不可读的单词和被另一个字符或符号替换的字符。

这是我正在处理的 PDF 文件的链接

public class TikaAnalysis {
    public static String extractContentUsingFacade(InputStream stream) throws IOException, TikaException {
        Tika tika = new Tika();
        String content = tika.parseToString(stream);
        try {
            WriteOnWordDoc(str);
        } catch (Exception e) {
            e.printStackTrace();
        }

        return content;
    }

    public static void WriteOnWordDoc(String fileContent) throws Exception {
        XWPFDocument document = new XWPFDocument();
        XWPFParagraph tmpParagraph = document.createParagraph();
        XWPFRun tmpRun = tmpParagraph.createRun();
        tmpRun.setText(fileContent);
        tmpRun.setFontSize(10);
        FileOutputStream fos = new FileOutputStream(new File("extractedContent.docx"));
        document.write(fos);
        fos.close();
    }

    public static void main(String[] args) {

        FileInputStream  inputStream = null;
        String path ="File.pdf";
        try {
            File file=new File(path);
            inputStream = new FileInputStream(file);
            InputStream input = new BufferedInputStream(inputStream);
            TikaAnalysis.extractContentUsingFacade(inputStream);
            inputStream.close();    
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            if (inputStream != null) {
                try {
                    System.out.println("close the file  ");
                    inputStream.close();
                } catch (Exception e) {
                    e.printStackTrace();
                }
            }
        }
    }
}

标签: javaarabic

解决方案


推荐阅读