python - beautifulsoup for loop 仅提取第一页数据
问题描述
我有一个 txt 文件,里面有 2 个 url
https://www.kununu.com/de/volkswagen/kommentare
https://www.kununu.com/de/audi/kommentare
我想用beautifulsoup从那个url中的所有页面中提取一些数据。下面的代码提取该数据,但仅用于第一页。我应该遗漏一些东西,你能更新代码吗,它将从所有页面中提取?
firma = []
lineList2 = [line.rstrip('\n') for line in open(r"C:/myfolder/555.txt")]
print(lineList2)
for url in lineList2:
with requests.Session() as session:
session.headers = {
'x-requested-with': 'XMLHttpRequest'
}
page = 1
while True:
print(f"Processing page {page}..")
url = f'{url}/{page}'
response = session.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
articles = soup.find_all('article')
print("Number of articles: " + str(len(articles)))
for article in articles:
try:
firmaText = article.find('div', text=re.compile(r'Firma')).find_next('div').text.strip()
firma.append(firmaText)
except:
firma.append('N/A')
page += 1
pagination = soup.find_all('div', {'class': 'paginationControl'})
if not pagination:
break
df = pd.DataFrame({
'Company': firma
})
print(df)
解决方案
from bs4 import BeautifulSoup
import requests
import pandas as pd
firma = []
lineList2=[]
with open('555.txt', 'r') as file:
lines = file.readlines()
for line in lines:
lineList2.append(line.strip('\n'))
print(lineList2)
for lurl in lineList2:
with requests.Session() as session:
session.headers = {
'x-requested-with': 'XMLHttpRequest'
}
page = 1
while True:
print("in while")
print(f"Processing page {page}..")
url = f'{lurl}/{page}'
print(url)
response = session.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
articles = soup.find_all('article')
print("Number of articles: " + str(len(articles)))
for article in articles:
try:
firmaText = article.find('div', text=re.compile(r'Firma')).find_next('div').text.strip()
firma.append(firmaText)
except:
firma.append('N/A')
page += 1
pagination = soup.find_all('div', {'class': 'paginationControl'})
if not pagination:
break
df = pd.DataFrame({
'Company': firma
})
print(df)
推荐阅读
- python - 如何在 Windows10 中从 bash 提示符运行 python '__main__' 程序文件?
- python - Python 模块和多个 python-dev 安装
- javascript - 使用 Material Components Web 在应用栏上进行可扩展搜索
- java - context.getExternalFilesDirs(null) 即使挂载了 sd 卡也会返回一个空元素
- java - HashMap 重复 Android Studio 中的最后一个元素
- powerbi - Power Bi 去除过滤器
- powershell - VSTS 扩展:“由于缺少一个或多个强制参数,无法处理命令:appdirectory webappname ResourceGroupName”
- excel - 向 If Then 语句添加错误消息
- angularjs - angular-translate 防止 lang-keys 在路由更改后闪烁
- c++ - 只读来自 Kafka 主题的最后一条消息