首页 > 解决方案 > 遍历多个 html 文件并转换为 csv

问题描述

我有 32 个单独的 html 文件,其中的数据采用表格格式,包含 8 列数据。每个文件都针对特定种类的真菌。

我需要将 32 个 html 文件转换为 32 个 csv 文件和数据。我有单个文件的脚本,但无法弄清楚如何使用一些命令系统地执行所有 32 个文件,而不是运行我有 32 次的命令。

这是我正在使用的脚本,试图让它遍历所有 32 个文件:

directory = r'../html/species'
data = []
for filename in os.listdir(directory):
    if filename.endswith('.html'):
        fname = os.path.join(directory,filename)
        with open(fname, 'r') as f:
            soup = BeautifulSoup(f.read(),'html.parser')
            HTML_data = soup.find_all("table")[0].find_all("tr")[1:] 
            for element in HTML_data: 
                sub_data = [] 
                for sub_element in element: 
                    try: 
                        sub_data.append(sub_element.get_text())
                    except: 
                        continue
                data.append(sub_data) 
data

以下是出于复制目的而简化的上述脚本的一些输出数据:

[['\n\t\t\t\t\t\tAfrica\n\t\t\t\t\t'],
 ['Kenya',
  'Present',
  '',
  'Introduced',
  '',
  '',
  'Shomari (1996); Ohler (1979); Mniu (1998); Nayar (1998)',
  ''],
 ['Malawi',
  'Present',
  '',
  '',
  '',
  '',
  'Malawi, Ministry of Agriculture (1990)',
  ''],
 ['Mozambique',
  'Present',
  '',
  'Introduced',
  '',
  '',
  'Ohler (1979); Shomari (1996); Mniu (1998); CABI (Undated)',
  ''],
 ['Nigeria',
  'Present',
  '',
  'Introduced',
  '',
  '',
  'Ohler (1979); Shomari (1996); Nayar (1998); CABI (Undated)',
  ''],
 ['South Africa', 'Present', '', '', '', '', 'Swart (2004)', ''],
 ['Tanzania',
  'Present',
  '',
  '',
  '',
  '',
  'Casulli (1979); Martin et al. (1997)',
  ''],
 ['Zambia',
  'Present',
  '',
  'Introduced',
  '',
  '',
  'Ohler (1979); Shomari (1996); Mniu (1998); Nayar (1998)',
  ''],
 ['\n\t\t\t\t\t\tAsia\n\t\t\t\t\t'],
 ['India', 'Present', '', 'Introduced', '', '', 'Intini (1987)', ''],
 ['\n\t\t\t\t\t\tSouth America\n\t\t\t\t\t'],
 ['Brazil', 'Present', '', '', '', '', 'Ponte (1986)', ''],
 ['-Sao Paulo',
  'Present',
  '',
  'Native',
  '',
  '',
  'Waller et al. (1992); Shomari (1996)',
  ''],
 ['\n\t\t\t\t\t\tAfrica\n\t\t\t\t\t'],
 ['Egypt',
  'Present',
  '',
  '',
  '',
  '',
  'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
  ''],
 ['Ethiopia',
  'Present',
  '',
  '',
  '',
  '',
  'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
  ''],
 ['Libya',
  'Present',
  '',
  '',
  '',
  '',
  'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
  ''],
 ['Malawi',
  'Present',
  '',
  '',
  '',
  '',
  'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
  ''],
 ['Morocco',
  'Present',
  '',
  '',
  '',
  '',
  'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
  ''],
 ['Mozambique',
  'Present',
  '',
  '',
  '',
  '',
  'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
  ''],
 ['South Africa',
  'Present',
  '',
  '',
  '',
  '',
  'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
  ''],
 ['Sudan',
  'Present',
  '',
  '',
  '',
  '',
  'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
  ''],
 ['Tanzania',
  'Present',
  '',
  '',
  '',
  '',
  'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
  ''],
 ['Tunisia', 'Present', '', '', '', '', 'Djébali et al. (2009)', ''],
 ['Uganda',
  'Present',
  '',
  '',
  '',
  '',
  'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
  ''],
 ['\n\t\t\t\t\t\tAsia\n\t\t\t\t\t'],
 ['Afghanistan',
  'Present',
  '',
  '',
  '',
  '',
  'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
  ''],
 ['Armenia',
  'Present',
  '',
  '',
  '',
  '',
  'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
  ''],
 ['Azerbaijan',
  'Present',
  '',
  '',
  '',
  '',
  'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
  ''],
 ['Bhutan', 'Present', '', '', '', '', 'CABI and EPPO (2010)', '']]

认为我需要的是每个物种的格式都更像这样.. [[info_species1],[info_species1],[info_species1]], [[info_species2],[info_species2],[info_species2]] 或者在我的输出中我需要:

['-Sao Paulo',
  'Present',
  '',
  'Native',
  '',
  '',
  'Waller et al. (1992); Shomari (1996)',
  '']], # AN EXTRA SQUARE BRACKET RIGHT HERE
 ['\n\t\t\t\t\t\tAfrica\n\t\t\t\t\t'],
 ['Egypt',
  'Present',

标签: pythoncsvweb-scrapingbeautifulsoupdata-cleaning

解决方案


您是否考虑过只用熊猫阅读表格标签?

import pandas as pd
import os

directory = r'../html/species'

for filename in os.listdir(directory):
    if filename.endswith('.html'):
        csv_filename = filename.replace('.html','.csv')
        fname = os.path.join(directory,filename)
        with open(fname, 'r') as f:
            table = pd.read_html(f.read())[0]
            table.to_csv(csv_filename, index=False)

print(data)

推荐阅读