python - BeautifulSoup 响应 - Beautiful Soup 不是 HTTP 客户端
问题描述
该脚本应该找到包含文章的子页面的地址并从中收集必要的数据。数据应该进入数据库。应通过处理 HTML 文档来收集数据。
它应该是什么: 1. 找出 10 个最常用的单词及其数字。2. 找出每个作者最常用的 10 个单词及其编号。3. 发表作者姓名
import requests
from bs4 import BeautifulSoup as bs
from selenium import webdriver
import psycopg2 as pg2
from sqlalchemy.dialects.postgresql import psycopg2
url = 'https://teonite.com/blog/page/{}/index.html'
all_links = []
headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'User-Agent': 'Mozilla/5.0'
}
with requests.Session() as s:
r = s.get('https://teonite.com/blog/')
soup = bs(r.content, 'lxml')
article_links = ['https://teonite.com' + item['href'][2:] for item in soup.select('.post-content a')]
all_links.append(article_links)
num_pages = int(soup.select_one('.page-number').text.split('/')[1])
for page in range(2, num_pages + 1):
r = s.get(url.format(page))
soup = bs(r.content, 'lxml')
article_links = ['https://teonite.com' + item['href'][2:] for item in soup.select('.post-content a')]
all_links.append(article_links)
all_links = [item for i in all_links for item in i]
d = webdriver.Chrome()
for article in all_links:
d.get(article)
soup = bs(d.page_source, 'lxml')
[t.extract() for t in soup(['style', 'script', '[document]', 'head', 'title'])]
visible_text = soup.getText()
try:
print(soup.select_one('.post-title').text)
except:
print(article)
print(soup.select_one('h1').text)
break
# nie moj !!!!!!
# 2.2. Post contents
contents = []
for article_links in all_links:
soup = bs((article), 'html.parser')
content = soup.find('section', attrs={'class': 'post-content'})
contents.append(content)
# 2.1. Authors
authors = []
for article in all_links:
soup = bs(article, 'html.parser')
author = soup.find('span', attrs={'class': 'author-content'})
authors.append(author)
# POSTGRESQL CONNECTION
# 1. Connect to local database using psycopg2
import psycopg2
hostname = 'balarama.db.elephantsql.com'
username = 'yagoiucf'
password = 'jxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
database = 'yagoiucf'
conn = psycopg2.connect(host='balarama.db.elephantsql.com', user='yagoiucf',
password='xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx', dbname='yagoiucf')
conn.close()
解决方案
多个问题:
看看这部分代码
# 2.2. Post contents
contents = []
for article_links in all_links:
soup = bs((article), 'html.parser')
content = soup.find('section', attrs={'class': 'post-content'})
contents.append(content)
# 2.1. Authors
authors = []
for article in all_links:
soup = bs(article, 'html.parser')
author = soup.find('span', attrs={'class': 'author-content'})
authors.append(author)
在第一个循环中你有article_links
,但是你传递article
给BeautifulSoup
. 首先article
是来自前一个循环的人工制品,代表一个 url。我猜你实际上想通过article_links
。
其次,在该片段上方的代码中,您使用 selenium 来检索页面源。
d = webdriver.Chrome()
for article in all_links:
d.get(article)
soup = bs(d.page_source, 'lxml')
你需要再次做同样的事情(或者如果可以的话,使用请求)
推荐阅读
- python - 蟒蛇硒| 要求输入,输出到网站?
- php - F3(数据库会话)中会话的重复 session_id PRIMARY KEY
- html - 有没有更好的方法在元素上创建凸/凹边框
- android - 为什么 AudioManager 适用于手机,但不适用于 Android TV 模拟器?
- python - 编写此 python 代码的更有效方法是什么?(7个嵌套循环代码耗时太长)
- ios - 展开 Swiftui 视图以覆盖屏幕
- amazon-web-services - 使用 AWS Java SDK 的分段上传在 99% 时挂起
- sql - 如何创建一个仅计算 redshift 上其他列变化的列?
- python - 无法遍历谷歌云笔记本实例中文件夹内的文件
- swift - 在windodDidResize 事件结束时执行函数