python - 遍历多个 html 文件并转换为 csv
问题描述
我有 32 个单独的 html 文件,其中的数据采用表格格式,包含 8 列数据。每个文件都针对特定种类的真菌。
我需要将 32 个 html 文件转换为 32 个 csv 文件和数据。我有单个文件的脚本,但无法弄清楚如何使用一些命令系统地执行所有 32 个文件,而不是运行我有 32 次的命令。
这是我正在使用的脚本,试图让它遍历所有 32 个文件:
directory = r'../html/species'
data = []
for filename in os.listdir(directory):
if filename.endswith('.html'):
fname = os.path.join(directory,filename)
with open(fname, 'r') as f:
soup = BeautifulSoup(f.read(),'html.parser')
HTML_data = soup.find_all("table")[0].find_all("tr")[1:]
for element in HTML_data:
sub_data = []
for sub_element in element:
try:
sub_data.append(sub_element.get_text())
except:
continue
data.append(sub_data)
data
以下是出于复制目的而简化的上述脚本的一些输出数据:
[['\n\t\t\t\t\t\tAfrica\n\t\t\t\t\t'],
['Kenya',
'Present',
'',
'Introduced',
'',
'',
'Shomari (1996); Ohler (1979); Mniu (1998); Nayar (1998)',
''],
['Malawi',
'Present',
'',
'',
'',
'',
'Malawi, Ministry of Agriculture (1990)',
''],
['Mozambique',
'Present',
'',
'Introduced',
'',
'',
'Ohler (1979); Shomari (1996); Mniu (1998); CABI (Undated)',
''],
['Nigeria',
'Present',
'',
'Introduced',
'',
'',
'Ohler (1979); Shomari (1996); Nayar (1998); CABI (Undated)',
''],
['South Africa', 'Present', '', '', '', '', 'Swart (2004)', ''],
['Tanzania',
'Present',
'',
'',
'',
'',
'Casulli (1979); Martin et al. (1997)',
''],
['Zambia',
'Present',
'',
'Introduced',
'',
'',
'Ohler (1979); Shomari (1996); Mniu (1998); Nayar (1998)',
''],
['\n\t\t\t\t\t\tAsia\n\t\t\t\t\t'],
['India', 'Present', '', 'Introduced', '', '', 'Intini (1987)', ''],
['\n\t\t\t\t\t\tSouth America\n\t\t\t\t\t'],
['Brazil', 'Present', '', '', '', '', 'Ponte (1986)', ''],
['-Sao Paulo',
'Present',
'',
'Native',
'',
'',
'Waller et al. (1992); Shomari (1996)',
''],
['\n\t\t\t\t\t\tAfrica\n\t\t\t\t\t'],
['Egypt',
'Present',
'',
'',
'',
'',
'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
''],
['Ethiopia',
'Present',
'',
'',
'',
'',
'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
''],
['Libya',
'Present',
'',
'',
'',
'',
'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
''],
['Malawi',
'Present',
'',
'',
'',
'',
'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
''],
['Morocco',
'Present',
'',
'',
'',
'',
'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
''],
['Mozambique',
'Present',
'',
'',
'',
'',
'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
''],
['South Africa',
'Present',
'',
'',
'',
'',
'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
''],
['Sudan',
'Present',
'',
'',
'',
'',
'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
''],
['Tanzania',
'Present',
'',
'',
'',
'',
'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
''],
['Tunisia', 'Present', '', '', '', '', 'Djébali et al. (2009)', ''],
['Uganda',
'Present',
'',
'',
'',
'',
'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
''],
['\n\t\t\t\t\t\tAsia\n\t\t\t\t\t'],
['Afghanistan',
'Present',
'',
'',
'',
'',
'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
''],
['Armenia',
'Present',
'',
'',
'',
'',
'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
''],
['Azerbaijan',
'Present',
'',
'',
'',
'',
'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
''],
['Bhutan', 'Present', '', '', '', '', 'CABI and EPPO (2010)', '']]
我认为我需要的是每个物种的格式都更像这样.. [[info_species1],[info_species1],[info_species1]], [[info_species2],[info_species2],[info_species2]] 或者在我的输出中我需要:
['-Sao Paulo',
'Present',
'',
'Native',
'',
'',
'Waller et al. (1992); Shomari (1996)',
'']], # AN EXTRA SQUARE BRACKET RIGHT HERE
['\n\t\t\t\t\t\tAfrica\n\t\t\t\t\t'],
['Egypt',
'Present',
解决方案
您是否考虑过只用熊猫阅读表格标签?
import pandas as pd
import os
directory = r'../html/species'
for filename in os.listdir(directory):
if filename.endswith('.html'):
csv_filename = filename.replace('.html','.csv')
fname = os.path.join(directory,filename)
with open(fname, 'r') as f:
table = pd.read_html(f.read())[0]
table.to_csv(csv_filename, index=False)
print(data)
推荐阅读
- javascript - 带产量的生成器函数
- excel - 如何将excel中的1列拆分为2列
- angular - (点击)中奇怪的角度行为 - 需要添加'true;' 如果设置了 false 值,则更新 UI
- javascript - 谷歌图表时间线 - 只渲染几个月
- vue.js - 阻止源为“http://localhost:3000”的框架访问跨域框架 VUE IFRAME
- api - Google Ads Api,如何获取 CrmBasedUserList
- pandas - 根据另一列更改另一列的值
- c# - 从列表框 c# 发送多封电子邮件
- azure - 我们可以在 ADF 中编写插入查询查找活动吗
- angular - Observable between library - Angular 11+