首页 > 解决方案 > 如何根据 txt 文件中的 url 从多个页面中抓取文本正文

问题描述

我试图编写一个调用多个 URL 的代码,然后将整个抓取的文本保存在一个 txt 文件中,但我无法弄清楚在不破坏所有内容的情况下在哪里实现循环函数。

这就是代码现在的样子:

import requests
from bs4 import BeautifulSoup
from fake_useragent import UserAgent
from dhooks import Webhook, Embed


def getReadMe():
    with open('urls.txt','r') as file:
        return file.read()

def getHtml(readMe):
    ua = UserAgent()
    header = {'user-agent':ua.random}
    response = requests.get(readMe,headers=header,timeout=3)
    response.raise_for_status() 
    return response.content

readMe = getReadMe()


print(readMe)


html = getHtml(readMe)
soup = BeautifulSoup(html, 'html.parser')
text = soup.find_all(text=True)


output =''


blacklist = [
    '[document]',
    'noscript',
    'header',
    'html',
    'meta',
    'head', 
    'input',
    'script',
    'style'
    
]


for t in text:
    if t.parent.name not in blacklist:
        output += '{} '.format(t)

print(output)

with open("copy.txt", "w") as file:
    file.write(str(output))

标签: pythonpython-3.xweb-scrapingbeautifulsoup

解决方案


import requests
from bs4 import BeautifulSoup
from fake_useragent import UserAgent
from dhooks import Webhook, Embed


def getReadMe():
    with open('urls.txt','r') as file:
        return file.read()

def getHtml(readMe):
    ua = UserAgent()
    header = {'user-agent':ua.random}
    response = requests.get(readMe,headers=header,timeout=3)
    response.raise_for_status() 
    return response.content

readMe = getReadMe()


print(readMe)

for line in readMe:
    html = getHtml(line)
    soup = BeautifulSoup(html, 'html.parser')
    text = soup.find_all(text=True)
    output =''
    blacklist = [
        '[document]',
        'noscript',
        'header',
        'html',
        'meta',
        'head', 
        'input',
        'script',
        'style'
        
    ]
    for t in text:
        if t.parent.name not in blacklist:
            output += '{} '.format(t)

    print(output)
    #the option a makes u append the new data to the file
    with open("copy.txt", "a") as file:
        file.write(str(output))

试试这个,看看它是否有效。


推荐阅读