python - 如何解析包含单引号和双引号的 HTMl 文本
问题描述
所以,我试图用 Selenium 制作我想阅读的网络小说的刮板,但是当我解析 HTML 并写入文件时,单引号和双引号变成带问号的菱形。我搜索但我找不到任何东西。我认为它与unicode有关,但我对此了解不多。无论如何,这是我的代码:
url = 'https://parahumans.wordpress.com/2011/06/11/1-1/'
driver.get(url)
chapter_name = driver.find_element_by_class_name('entry-title')
print(chapter_name.text)
text_div = driver.find_element_by_class_name('entry-content')
text = text_div.find_elements_by_tag_name('p')
with open(os.path.join(os.path.dirname(__file__), path), 'w') as file:
for paragraph in text[3:]:
file.write(paragraph.text + '\n')
.txt 文件中的输出是:
Since the start of the semester, I had been looking forward to the part of Mr. Gladly�s World
Issues class where we�d start discussing capes. Now that it had finally arrived, I couldn�t
focus. I fidgeted, my pen moving from hand to hand, tapping, or absently drawing some figure
in the corner of the page to join the other doodles. My eyes were restless too, darting from
the clock above the door to Mr. Gladly and back to the clock. I wasn�t picking up enough of
his lesson to follow along. Twenty minutes to twelve; five minutes left before class ended.
解决方案
我的朋友,你去吧,它从网络连续剧中抓取所有章节并将其保存到一个名为Worm.txt
你可以更改为任何你想要的文件中,我还使用内置的进度条,tqdm
这样你就可以检查进度,结束了300章,每章大约需要1s来刮,所以预计至少需要5分钟,但仍然比使用快得多selenium
。
import requests
from bs4 import BeautifulSoup
from tqdm import tqdm
f = open("Worm.txt", "w")
a = requests.get("https://parahumans.wordpress.com/table-of-contents/")
soup = BeautifulSoup(a.text, "lxml")
text_div = soup.find("", {"class": "entry-content"})
links = text_div.find_all("a", href=True)[:-2]
for url in tqdm(links):
a = requests.get(url['href'])
soup = BeautifulSoup(a.text, "lxml")
text_div = soup.find("", {"class": "entry-content"})
text = text_div.find_all("p")
for paragraph in text[3:]:
f.write(paragraph.text + '\n')
f.close()
推荐阅读
- qt - 退出 1000 的 QThread - pyqt5
- r - 在超过 10% 的重复记录中查找值 (R)
- python - Tensorflow:你在 Adam 和 Adagrad 中设置的学习率只是初始学习率吗?
- r - R 不可打印分隔符
- android - Android 字符串数组 itemm
- ios - UI 表重用单元约束在 IB 中看起来不错,但在模拟时不起作用
- python-3.x - 小部件在 KV 文件中不断重复/复制
- python - 3-vector 系列 LSTM 不能突破 0.5 精度
- angular - angulardart:StreamController 在以下示例中如何工作?
- python - 退出 python 脚本时是否可以保留源 shell 脚本?