python - 如何将使用 Tika 从 PDF 中提取的文本放入 JSON?
问题描述
我想知道是否可以将使用 Tika Python 从 PDF 中提取的文本放入 JSON 中,以便将来我可以将它们导入系统的相应记录中。下面是我用来从 PDF 返回解析文本的代码。
from tika import parser
def extract_text(file):
parsed = parser.from_file(file)
parsed_text = parsed['content']
return parsed_text
file_name_with_extension = input("Enter File Name:")
text = extract_text(file_name_with_extension)
print(text)
解决方案
你是这个意思吗:
from tika import parser
import json
def extract_text(file):
parsed = parser.from_file(file)
parsed_text = json.dumps(parsed_pdf['metadata'] , indent = 2)
return parsed_text
text = extract_text('Untitled.pdf')
print(text)
输出:
{
"Content-Type": "application/pdf",
"Creation-Date": "2021-07-31T12:15:55Z",
"Last-Modified": "2021-07-31T12:15:55Z",
"Last-Save-Date": "2021-07-31T12:15:55Z",
"X-Parsed-By": [
"org.apache.tika.parser.DefaultParser",
"org.apache.tika.parser.pdf.PDFParser"
],
"X-TIKA:content_handler": "ToTextContentHandler",
"X-TIKA:embedded_depth": "0",
"X-TIKA:parse_time_millis": "26",
"access_permission:assemble_document": "true",
"access_permission:can_modify": "true",
"access_permission:can_print": "true",
"access_permission:can_print_degraded": "true",
"access_permission:extract_content": "true",
"access_permission:extract_for_accessibility": "true",
"access_permission:fill_in_form": "true",
"access_permission:modify_annotations": "true",
"created": "2021-07-31T12:15:55Z",
"date": "2021-07-31T12:15:55Z",
"dc:format": "application/pdf; version=1.3",
"dc:title": "Untitled",
"dcterms:created": "2021-07-31T12:15:55Z",
"dcterms:modified": "2021-07-31T12:15:55Z",
"meta:creation-date": "2021-07-31T12:15:55Z",
"meta:save-date": "2021-07-31T12:15:55Z",
"modified": "2021-07-31T12:15:55Z",
"pdf:PDFVersion": "1.3",
"pdf:charsPerPage": "1393",
"pdf:docinfo:created": "2021-07-31T12:15:55Z",
"pdf:docinfo:creator_tool": "Pages",
"pdf:docinfo:modified": "2021-07-31T12:15:55Z",
"pdf:docinfo:producer": "",
"pdf:docinfo:title": "Untitled",
"pdf:encrypted": "false",
"pdf:hasMarkedContent": "true",
"pdf:hasXFA": "false",
"pdf:hasXMP": "false",
"pdf:unmappedUnicodeCharsPerPage": "0",
"producer": "",
"resourceName": "b'Untitled.pdf'",
"title": "Untitled",
"xmp:CreatorTool": "Pages",
"xmpTPg:NPages": "1"
}
推荐阅读
- c - 为什么访问二维数组需要少于一维数组?
- javascript - 为什么 MutationRecord[] 是一个数组?(突变观察)
- c - 如何计算循环中的计算次数?
- python - 使用 MLPCLassifier,多次使用 partial_fit 会产生比使用 fit() 最差的准确度,尽管有混洗数据
- android - 如何从firebase android获取这种类型的模式
- android - 通过 xml 将字符串值传递给 viemodel(数据绑定)
- r - 我正在尝试在运行 macOS Big Sur 的 r 中安装软件包 glmnet 但它不起作用
- visual-c++ - VS2019 : 用于在 VS2019 中调试简单字符串变量的复杂树部分
- spring-mvc - 如何在错误页面上显示异常状态代码使用springmvc
- swift - TabView“点”索引颜色不变