首页 > 解决方案 > 如何从scala中的PDF文件中提取其他符号

问题描述

给定:一个 PDF 文件,我想从该 pdf 文件中提取符号。

试过:

val foldedFlow = Flow[ByteString].fold(ByteString()) {
  case (bs, element) => bs ++ element
}
val logFlow = Flow.fromFunction { bytes: ByteString =>
  logger.info("Received test bytes: " + bytes.length)
  bytes
}

     val result: ByteString = Await.result(response.entity.dataBytes
          .via(logFlow)
          .via(foldedFlow)
          .runWith(Sink.head[ByteString])(client.materializer),
          10.seconds)
    
        val pdf = PDDocument.load(result.toArray[Byte])
        val stripper = new PDFTextStripper
        val contents = stripper.getText(pdf)
        pdf.close()
        contents

输入:

 私は素晴らしいよ原因こんにちは、これは日本語のテキストの例は、正しくレンダリ
ングです!
S0001 HEADACHE Mar 22, 2014
S0008 NAUSEA May 18, 2014
S0011 STOMACACHE Feb 12, 2008
S0001 HEADACHE Mar 22, 2014
S0008 NAUSEA May 18, 2014
S0011 STOMACACHE Feb 12, 2008

输出:

S0001 HEADACHE Mar 22, 2014
S0008 NAUSEA May 18, 2014
S0011 STOMACACHE Feb 12, 2008
S0001 HEADACHE Mar 22, 2014
S0008 NAUSEA May 18, 2014
S0011 STOMACACHE Feb 12, 2008

pdftextstripper 无法从文件中提取“私は素晴らしいよ原因こんにちは、これは日本语のテキストの例は、正しくレン”这个如何解决这个建议

标签: scalapdfpdfbox

解决方案


推荐阅读