首页 > 解决方案 > 在 Python 中解析 HTML 期间创建多维列表

问题描述

在帮助下,我设法更正了链接中的代码并将其增强为:

from bs4 import BeautifulSoup
import re
import os
from os.path import join
from click.termui import pause

BeforeFinalJob = []
TheFinalResultThatUsed = []
ListOfFiles = []

for (dirname, dirs, files) in os.walk('.'):
    for filename in files:
        if filename.endswith('.html'):
            thefile = os.path.join(dirname, filename)
            ListOfFiles += [thefile]
            with open(thefile, 'r') as f:
                contents = f.read()
                soup = BeautifulSoup(contents, 'lxml')
                Initialtext = soup.get_text()
                MediumText = Initialtext.lower().split()

                TokensToClean = [t for t in MediumText
                                if re.match(r'[^\W\d]*$', t)]

                removementWords = ['here', 'than']


                for somewords in range(len(TokensToClean)):
                    if TokensToClean[somewords] not in removementWords:
                        BeforeFinalJob.add(TokensToClean[somewords])
TheFinalResultThatUsed = list( dict.fromkeys(BeforeFinalJob) )

但是,文件列表和单词在不同的表中。如何在 list[a][b] 中得到结果,其中 list[a] 是名称文件,而 list[a][b] 是文件 list[a] 的单词?此外,list[a] 中不应有任何双打

例如,list[a] == file 和 list[a][b] == some word, which in the file

UPD:ZIP 文件中有一些 HTML:

https://www94.zippyshare.com/v/vJgy2sk1/file.html

标签: pythonparsingmultidimensional-arraybeautifulsoup

解决方案


以下代码使结构

'filename.html': {
    'path': 'D:\\folder\\filename.html',
    'words': [
        'one',
        'two'
    ]
}

您可以访问这样的数据

filename = 'filename.html'
print(data[filename]['words'][1])
# two
print(data[filename]['path'])
# D:\\folder\\filename.html

完整代码

from bs4 import BeautifulSoup
from os import path, walk
from pprint import pprint
import re

removement_words = ['here', 'than']
source = 'D:\folder'
result = {}

for root, folders, files in walk(source):
    for filename in files:
        if not filename.endswith('.html'):
            continue

        filepath = path.join(root, filename)
        result[filename] = {
            'path': filepath,
            'words': []
        }

        for word in BeautifulSoup(open(filepath), 'lxml').get_text().lower().split():
            if not word in removement_words and re.match(r'[^\W\d]*$', word):
                result[filename]['words'].append(word)

pprint(result)
# 'y2k.html': {'path': 'D:\\folder\\y2k.html',
#              'words': ['the',
#                        'digest',
#                        'rhf',
#                        'joke',
#                        ...

推荐阅读