首页 > 解决方案 > BeautifulSoup 响应 - Beautiful Soup 不是 HTTP 客户端

问题描述

该脚本应该找到包含文章的子页面的地址并从中收集必要的数据。数据应该进入数据库。应通过处理 HTML 文档来收集数据。

它应该是什么: 1. 找出 10 个最常用的单词及其数字。2. 找出每个作者最常用的 10 个单词及其编号。3. 发表作者姓名

我不确定其余代码是否运行良好,但现在我收到以下错误: 错误

import requests
from bs4 import BeautifulSoup as bs
from selenium import webdriver
import psycopg2 as pg2
from sqlalchemy.dialects.postgresql import psycopg2

url = 'https://teonite.com/blog/page/{}/index.html'
all_links = []

headers = {
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
    'User-Agent': 'Mozilla/5.0'
}
with requests.Session() as s:
    r = s.get('https://teonite.com/blog/')
    soup = bs(r.content, 'lxml')
    article_links = ['https://teonite.com' + item['href'][2:] for item in soup.select('.post-content a')]
    all_links.append(article_links)
    num_pages = int(soup.select_one('.page-number').text.split('/')[1])

    for page in range(2, num_pages + 1):
        r = s.get(url.format(page))
        soup = bs(r.content, 'lxml')
        article_links = ['https://teonite.com' + item['href'][2:] for item in soup.select('.post-content a')]
        all_links.append(article_links)

    all_links = [item for i in all_links for item in i]

    d = webdriver.Chrome()

    for article in all_links:
        d.get(article)
        soup = bs(d.page_source, 'lxml')
        [t.extract() for t in soup(['style', 'script', '[document]', 'head', 'title'])]
        visible_text = soup.getText()

        try:
            print(soup.select_one('.post-title').text)
        except:
            print(article)
            print(soup.select_one('h1').text)
            break

    # nie moj !!!!!!

    # 2.2. Post contents
    contents = []
    for article_links in all_links:
        soup = bs((article), 'html.parser')
        content = soup.find('section', attrs={'class': 'post-content'})
        contents.append(content)

    # 2.1. Authors

    authors = []
    for article in all_links:
        soup = bs(article, 'html.parser')
        author = soup.find('span', attrs={'class': 'author-content'})
        authors.append(author)

    # POSTGRESQL CONNECTION
    # 1. Connect to local database using psycopg2

    import psycopg2

    hostname = 'balarama.db.elephantsql.com'
    username = 'yagoiucf'
    password = 'jxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
    database = 'yagoiucf'

    conn = psycopg2.connect(host='balarama.db.elephantsql.com', user='yagoiucf',
                            password='xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx', dbname='yagoiucf')
    conn.close()

标签: pythonpostgresqlweb-scraping

解决方案


多个问题:

看看这部分代码

# 2.2. Post contents
contents = []
for article_links in all_links:
    soup = bs((article), 'html.parser')
    content = soup.find('section', attrs={'class': 'post-content'})
    contents.append(content)


# 2.1. Authors

authors = []
for article in all_links:
    soup = bs(article, 'html.parser')
    author = soup.find('span', attrs={'class': 'author-content'})
    authors.append(author)

在第一个循环中你有article_links,但是你传递articleBeautifulSoup. 首先article是来自前一个循环的人工制品,代表一个 url。我猜你实际上想通过article_links

其次,在该片段上方的代码中,您使用 selenium 来检索页面源。

d = webdriver.Chrome()

for article in all_links:
    d.get(article)
    soup = bs(d.page_source, 'lxml')

你需要再次做同样的事情(或者如果可以的话,使用请求)


推荐阅读