首页 > 解决方案 > 如何解析包含单引号和双引号的 HTMl 文本

问题描述

所以,我试图用 Selenium 制作我想阅读的网络小说的刮板,但是当我解析 HTML 并写入文件时,单引号和双引号变成带问号的菱形。我搜索但我找不到任何东西。我认为它与unicode有关,但我对此了解不多。无论如何,这是我的代码:

url = 'https://parahumans.wordpress.com/2011/06/11/1-1/'
driver.get(url)

chapter_name = driver.find_element_by_class_name('entry-title')
print(chapter_name.text)

text_div = driver.find_element_by_class_name('entry-content')
text = text_div.find_elements_by_tag_name('p')

with open(os.path.join(os.path.dirname(__file__), path), 'w') as file:
   for paragraph in text[3:]:
       file.write(paragraph.text + '\n')

.txt 文件中的输出是:

Since the start of the semester, I had been looking forward to the part of Mr. Gladly�s World 
Issues class where we�d start discussing capes.  Now that it had finally arrived, I couldn�t 
focus.  I fidgeted, my pen moving from hand to hand, tapping, or absently drawing some figure 
in the corner of the page to join the other doodles.  My eyes were restless too, darting from 
the clock above the door to Mr. Gladly and back to the clock.  I wasn�t picking up enough of 
his lesson to follow along.  Twenty minutes to twelve; five minutes left before class ended.

标签: pythonselenium-webdriverweb-scrapinghtml-parsing

解决方案


我的朋友,你去吧,它从网络连续剧中抓取所有章节并将其保存到一个名为Worm.txt你可以更改为任何你想要的文件中,我还使用内置的进度条,tqdm这样你就可以检查进度,结束了300章,每章大约需要1s来刮,所以预计至少需要5分钟,但仍然比使用快得多selenium

import requests
from bs4 import BeautifulSoup
from tqdm import tqdm

f = open("Worm.txt", "w")
a = requests.get("https://parahumans.wordpress.com/table-of-contents/")
soup = BeautifulSoup(a.text, "lxml")
text_div = soup.find("", {"class": "entry-content"})
links = text_div.find_all("a", href=True)[:-2]
for url in tqdm(links):
    a = requests.get(url['href'])
    soup = BeautifulSoup(a.text, "lxml")
    text_div = soup.find("", {"class": "entry-content"})
    text = text_div.find_all("p")
    for paragraph in text[3:]:
        f.write(paragraph.text + '\n')
f.close()

推荐阅读