python - 从pdf中提取文本时如何忽略表格及其内容
问题描述
到目前为止,我已成功从 pdf 文件中提取文本内容。我被困在必须提取表格之外的文本内容(忽略表格及其内容)并需要帮助的地步
Pdf 可以从这里下载
import pdfplumber
pdfinstance = pdfplumber.open(r'\List of Reportable Jurisdictions for 2020 CRS information reporting_9 Feb.pdf')
for epage in range(len(pdfinstance.pages)):
page = pdfinstance.pages[epage]
text = page.extract_text(x_tolerance=3, y_tolerance=3)
print(text)
解决方案
For the PDF you have shared, you can use the following code to extract the text outside the tables
import pdfplumber
def not_within_bboxes(obj):
"""Check if the object is in any of the table's bbox."""
def obj_in_bbox(_bbox):
"""See https://github.com/jsvine/pdfplumber/blob/stable/pdfplumber/table.py#L404"""
v_mid = (obj["top"] + obj["bottom"]) / 2
h_mid = (obj["x0"] + obj["x1"]) / 2
x0, top, x1, bottom = _bbox
return (h_mid >= x0) and (h_mid < x1) and (v_mid >= top) and (v_mid < bottom)
return not any(obj_in_bbox(__bbox) for __bbox in bboxes)
with pdfplumber.open("file.pdf") as pdf:
for page in pdf.pages:
print("\n\n\n\n\nAll text:")
print(page.extract_text())
# Get the bounding boxes of the tables on the page.
bboxes = [
table.bbox
for table in page.find_tables(
table_settings={
"vertical_strategy": "explicit",
"horizontal_strategy": "explicit",
"explicit_vertical_lines": page.curves + page.edges,
"explicit_horizontal_lines": page.curves + page.edges,
}
)
]
print("\n\n\n\n\nText outside the tables:")
print(page.filter(not_within_bboxes).extract_text())
I am using the .filter()
method provided by pdfplumber
to drop any objects that fall inside the bounding box of any of the tables (in not_within_bboxes(...)
) and creating a filtered version of the page which will only contain those objects that fall outside any of the tables.
推荐阅读
- javascript - 在 chrome 中加载我的简单 D3 项目会给我这个错误。“加载资源失败:net::ERR_NAME_NOT_RESOLVED”。我能做些什么来修复它?
- php - 对来宾客户隐藏 Magento 2 自定义产品选项卡
- python - 如何使用python数据集为excel提供文件名?
- angular - 我们可以使用 JSZip 和 Angular7Csv 来压缩多个 csv 文件吗?
- c++ - OpenCV CV_16F 型
- c# - WPF datagird 复选框 - 一键选择,选中/取消选中触发事件
- sql - SQL:从字符串字段中提取数字
- python - OpenCV:getOptimalNewCameraMatrix() - newImgSize 参数?
- html - 引导程序中的导航与导航栏类,区别?
- javascript - Javascript - 如何检查一个字符串是否在一个条件下包含多个子字符串