java - 如何使用带有 Java 的 pdf2dom 更改最终的 HTML 输出？

问题描述

我想将 PDF 文档转换为 HTML 文件，并让我的 HTML 输出尽可能接近原始 PDF。为此，我正在使用 Pdf2Dom。但是，出于商业原因，我需要将样式 div 从标题移到正文部分。我尝试的天真的解决方案是获取样式 div 的文本内容，并将其写在我的文档末尾，如下所示：

 public InputStream fileToHtml(InputStream inputStream) throws IOException {

    ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
    Writer writer = new BufferedWriter(new OutputStreamWriter(outputStream));
    PDFDomTree parser = new PDFDomTree();
    PDDocument pdf = PDDocument.load(inputStream);
    Document dom = parser.createDOM(pdf);

    Node styleNode = dom.getElementsByTagName("style").item(0);
  
    String content = style.getTextContent();
    outputStream.write(("<style>" + content + "</style>").getBytes());
    parser.writeText(pdf, writer);


    return new ByteArrayInputStream(outputStream.toByteArray());
}

但是我对这个解决方案有两个问题：

样式 div 不在正文部分，而是在文档的最后，我不想要
我复制了样式 div，一个在文档末尾（参见 1.），一个仍在标题部分，因为它没有被删除。

因此，我尝试了另一种方法，尝试在转换之前修改节点，如下所示：

 public InputStream fileToHtml(InputStream inputStream) throws IOException {

    ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
    Writer writer = new BufferedWriter(new OutputStreamWriter(outputStream));
    PDFDomTree parser = new PDFDomTree();
    PDDocument pdf = PDDocument.load(inputStream);
    Document dom = parser.createDOM(pdf);

    Node node = dom.getElementsByTagName("body").item(0);

    // I just change the content of the body part to check if the final HTML output changed 
    node.setTextContent("my new content");
    // I do not get "my new content" in the final HTML output, however the content of the node is "my new content" according to the terminal
    System.out.println(node.getTextContent());

    parser.writeText(pdf, writer);


    return new ByteArrayInputStream(outputStream.toByteArray());
}

但是，我没有像预期的那样得到一个简单的“我的新内容”，而是原始的 pdf 内容。

我尝试的最后一件事是创建一个新文档，从初始文档对其进行操作，然后对其进行转换（在此示例中，我根本没有修改内容，我只想从原始文档创建一个新文档只是为了检查这种方法是否可行）：

 public InputStream fileToHtml(InputStream inputStream) throws throws IOException, ParserConfigurationException , TransformerException {

    ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
    Writer writer = new BufferedWriter(new OutputStreamWriter(outputStream));
    PDFDomTree parser = new PDFDomTree();
    PDDocument pdf = PDDocument.load(inputStream);
    Document dom = parser.createDOM(pdf);

    ByteArrayOutputStream outputStream1 = new ByteArrayOutputStream();
    Source xmlSource = new DOMSource(dom);
    Result outputTarget = new StreamResult(outputStream1);
    TransformerFactory.newInstance().newTransformer().transform(xmlSource, outputTarget);
    InputStream is = new ByteArrayInputStream(outputStream1.toByteArray());
    NodeList style = 
    PDDocument newPdf = PDDocument.load(is);

    parser.writeText(newPdf, writer);


    return new ByteArrayInputStream(outputStream.toByteArray());
}

但是，我收到以下错误消息：java.io.IOException: Error: End-of-File, expected line

标签： java

我通过使用 HTML 解析器 Jsoup 解决了这个问题。我首先解析 PDF 文件，然后将其转换为我将传递给 Jsoup 解析器的输入流，然后在那里应用我的修改：

    PDFDomTree parser = new PDFDomTree();
    PDDocument pdf = PDDocument.load(inputStream);
    ByteArrayOutputStream outputStream = new ByteArrayOutputStream();

    org.w3c.dom.Document dom = parser.createDOM(pdf);
    ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream();
    Source xmlSource = new DOMSource(dom);
    Result outputTarget = new StreamResult(byteArrayOutputStream);
    TransformerFactory.newInstance().newTransformer().transform(xmlSource, outputTarget);
    InputStream is = new ByteArrayInputStream(byteArrayOutputStream.toByteArray());

    org.jsoup.nodes.Document htmlDoc = Jsoup.parse(is, null, "");
    Element body = htmlDoc.body();
    body.append(pdf2domStyle);
    outputStream.write(body.outerHtml().getBytes());

    outputStream.close();
    byteArrayOutputStream.close();

    return new ByteArrayInputStream(outputStream.toByteArray());

java - 如何使用带有 Java 的 pdf2dom 更改最终的 HTML 输出？

问题描述

解决方案

推荐阅读