python - 如何根据 txt 文件中的 url 从多个页面中抓取文本正文
问题描述
我试图编写一个调用多个 URL 的代码,然后将整个抓取的文本保存在一个 txt 文件中,但我无法弄清楚在不破坏所有内容的情况下在哪里实现循环函数。
这就是代码现在的样子:
import requests
from bs4 import BeautifulSoup
from fake_useragent import UserAgent
from dhooks import Webhook, Embed
def getReadMe():
with open('urls.txt','r') as file:
return file.read()
def getHtml(readMe):
ua = UserAgent()
header = {'user-agent':ua.random}
response = requests.get(readMe,headers=header,timeout=3)
response.raise_for_status()
return response.content
readMe = getReadMe()
print(readMe)
html = getHtml(readMe)
soup = BeautifulSoup(html, 'html.parser')
text = soup.find_all(text=True)
output =''
blacklist = [
'[document]',
'noscript',
'header',
'html',
'meta',
'head',
'input',
'script',
'style'
]
for t in text:
if t.parent.name not in blacklist:
output += '{} '.format(t)
print(output)
with open("copy.txt", "w") as file:
file.write(str(output))
解决方案
import requests
from bs4 import BeautifulSoup
from fake_useragent import UserAgent
from dhooks import Webhook, Embed
def getReadMe():
with open('urls.txt','r') as file:
return file.read()
def getHtml(readMe):
ua = UserAgent()
header = {'user-agent':ua.random}
response = requests.get(readMe,headers=header,timeout=3)
response.raise_for_status()
return response.content
readMe = getReadMe()
print(readMe)
for line in readMe:
html = getHtml(line)
soup = BeautifulSoup(html, 'html.parser')
text = soup.find_all(text=True)
output =''
blacklist = [
'[document]',
'noscript',
'header',
'html',
'meta',
'head',
'input',
'script',
'style'
]
for t in text:
if t.parent.name not in blacklist:
output += '{} '.format(t)
print(output)
#the option a makes u append the new data to the file
with open("copy.txt", "a") as file:
file.write(str(output))
试试这个,看看它是否有效。
推荐阅读
- r - 基于多列的特定序列的R dplyr过滤器
- swift - Firestore 查询问题
- testing - 为什么 HttpClient 类型的 getnow() 方法未定义?
- amazon-web-services - 如何防止通过 IP 地址访问 AWS beanstalk 应用程序?
- arrays - 请求 wsdl 服务数组对象
- laravel - 为什么在 AWS ubuntu 上安装 laravel 8 应用程序我在屏幕上出现错误?
- mysql - Laravel 中每种类型用户的唯一电子邮件
- javascript - SVG:如何合并多行以获得单个对象
- scala - 在 Scala / Spark 中向数据框添加列表,以便将每个元素添加到单独的行
- c# - 无法将文本写入文件