python - 从 python 数据框中的链接打开、保存和提取文本 PDF
问题描述
我想遍历保存在 python 数据框中的 PDF 链接。目标是打开 PDF 链接,保存 PDF 并从中提取文本,然后将每个相应链接的文本保存在新列中。
数据框如下所示:
URL
0 https://westafricatradehub.com/wp-content/uploads/2021/07/RFA-WATIH-1295_Senegal-RMNCAH-Activity_English-Version.pdf
1 https://westafricatradehub.com/wp-content/uploads/2021/07/RFA-WATIH-1295_Activit%C3%A9-RMNCAH-S%C3%A9n%C3%A9gal_Version-Fran%C3%A7aise.pdf
2 https://westafricatradehub.com/wp-content/uploads/2021/07/Attachment-2_Full-Application-Template_Senegal-RMNCAH-Activity_English-Version.docx
3 https://westafricatradehub.com/wp-content/uploads/2021/07/Pi%C3%A8ce-Jointe-2_Mod%C3%A8le-de-Demande-Complet_Activit%C3%A9-RMNCAH-S%C3%A9n%C3%A9gal_Version-Fran%C3%A7aise.docx
4 https://westafricatradehub.com/wp-content/uploads/2021/07/Attachment-3_Trade-Hub-Performance-Indicators-Table.xlsx
5 https://westafricatradehub.com/wp-content/uploads/2021/07/Attachment-10_Project-Budget-Template-RMNCAH.xlsx
6 https://westafricatradehub.com/wp-content/uploads/2021/08/Senegal-Health-RFA-Webinar-QA.pdf
7 https://westafricatradehub.com/wp-content/uploads/2021/02/APS-WATIH-1021_Catalytic-Business-Concepts-Round-2.pdf
8 https://westafricatradehub.com/wp-content/uploads/2021/02/APS-WATIH-1021_Concepts-d%E2%80%99Affaires-Catalytiques-2ieme-Tour.pdf
9 https://westafricatradehub.com/wp-content/uploads/2021/06/APS-WATIH-1247_Research-Development-Round-2.pdf
我能够为一个链接做到这一点,但不能为整个数据框
import urllib.request
pdf_link = "https://westafricatradehub.com/wp-content/uploads/2021/07/RFA-WATIH-1295_Senegal-RMNCAH-Activity_English-Version.pdf"
def download_file(download_url, filename):
response = urllib.request.urlopen(download_url)
file = open(filename + ".pdf", 'wb')
file.write(response.read())
file.close()
download_file(pdf_link, "Test")
#Code to extract text from PDF
import textract
text = textract.process("/Users/fze/Dropbox (LCG Team)/LCG Folder (1)/BD Scan Automation/Python codes/Test.PDF")
print(text)
谢谢!
解决方案
干得好:
import urllib.request
import textract
def download_file(download_url, filename):
response = urllib.request.urlopen(download_url)
file = open(filename + ".pdf", 'wb')
file.write(response.read())
file.close()
df['Text']=''
for i in range(df.shape[0]):
pdf_link=df.iloc[i,0]
download_file(pdf_link, f"pdf_{i}")
text = textract.process(f"/Users/fze/Dropbox (LCG Team)/LCG Folder (1)/BD Scan Automation/Python codes/pdf_{i}.PDF")
df['Text'][i]=text
推荐阅读
- vue.js - Nuxt 嵌套路由
- excel-formula - Excel:如果新客户在 x 个月后离开,则给定月份的客户数
- javascript - Ionic 3:如何随时在外部键盘上获取 keydown/keypress 事件
- list - 如何在haskell中显示给定范围的列表
- java - 如何使用 javaparser 获取 switch 语句的数据类型?
- postgresql - 连接两个一对多表会重复记录
- javascript - 如何从 Google Places API 获取菜单和产品列表的数据?
- angular-cli - 从路由中检查成功变量并在 Angular 中进行比较
- c++ - setup.h 没有这样的文件或目录
- html - list-style-image 项目符号显示不合适