python - beautifulsoup for-if 循环提取
问题描述
我想使用 for/if 循环从下面的网站中提取数据。下面的代码使用 for/if 循环成功地从文章中提取数据,但我想更新它并提取公司、满意百分比和总体评分数据(始终相同)也使用循环。
overall=[]
satisfied=[]
company=[]
arbeitsatmosphare = []
vorgesetztenverhalten = []
kollegenzusammenhalt= []
lurl='https://www.kununu.com/de/volkswagenconsulting/kommentare'
with requests.Session() as session:
session.headers = {
'x-requested-with': 'XMLHttpRequest'
}
page = 1
while True:
print(f"Processing page {page}..")
url = f'{lurl}/{page}'
print(url)
response = session.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
articles = soup.find_all('article')
print("Number of articles: " + str(len(articles)))
for article in articles:
for key in [{'label': 'Arbeitsatmosphäre', 'list': arbeitsatmosphare},
{'label': 'Vorgesetztenverhalten', 'list': vorgesetztenverhalten},
{'label': 'Kollegenzusammenhalt', 'list': kollegenzusammenhalt}]:
span = article.find('span', text=re.compile(key['label']))
#print(span)
if span and span.find_next('span'):
key['list'].append(span.find_next('span').text.strip())
else:
key['list'].append('N/A')
# THIS PART IS NOT WORKING
div = soup.find(class_="company-profile-container")
for key2 in [{'label2': 'company-name', 'list': company},
{'label2': 'review-recommend-value', 'list': satisfied},
{'label2': 'review-rating-value', 'list': overall}]:
span2 = div.find('span', text=re.compile(key2['label2']))
#print(span2)
if span2 and span2.find('span'):
key2['list'].append(span2.find('span').text.strip())
else:
key2['list'].append('N/A')
page += 1
pagination = soup.find_all('div', {'class': 'paginationControl'})
if not pagination:
break
#print(overall)
df = pd.DataFrame({'Arbeitsatmosphäre': arbeitsatmosphare,
'Vorgesetztenverhalten': vorgesetztenverhalten,
'Kollegenzusammenhalt': kollegenzusammenhalt,
'company': company,
'satisfied': satisfied,
'overall':overall
})
print(df)
我以上面的代码为例,但看起来我的部分不起作用。我找不到问题,你能帮忙吗?
解决方案
If the company name, satisfied rating and overall rating are same for each row, you don't have to put them inside the list in for-loop. Just get the necessary information at the end and use, for example, list *
operator:
import re
import requests
from bs4 import BeautifulSoup
arbeitsatmosphare = []
vorgesetztenverhalten = []
kollegenzusammenhalt= []
lurl='https://www.kununu.com/de/volkswagenconsulting/kommentare'
with requests.Session() as session:
session.headers = {
'x-requested-with': 'XMLHttpRequest'
}
page = 1
while True:
print(f"Processing page {page}..")
url = f'{lurl}/{page}'
print(url)
response = session.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
articles = soup.find_all('article')
print("Number of articles: " + str(len(articles)))
for article in articles:
for key in [{'label': 'Arbeitsatmosphäre', 'list': arbeitsatmosphare},
{'label': 'Vorgesetztenverhalten', 'list': vorgesetztenverhalten},
{'label': 'Kollegenzusammenhalt', 'list': kollegenzusammenhalt}]:
span = article.find('span', text=re.compile(key['label']))
if span and span.find_next('span'):
key['list'].append(span.find_next('span').text.strip())
else:
key['list'].append('N/A')
page += 1
pagination = soup.find_all('div', {'class': 'paginationControl'})
if not pagination:
break
company = soup.select_one('.company-name').get_text(strip=True)
satisfied = soup.select_one('.review-recommend-value').get_text(strip=True)
overall = soup.select_one('.review-rating-value').get_text(strip=True)
df = pd.DataFrame({'Arbeitsatmosphäre': arbeitsatmosphare,
'Vorgesetztenverhalten': vorgesetztenverhalten,
'Kollegenzusammenhalt': kollegenzusammenhalt,
'company': [company] * len(arbeitsatmosphare),
'satisfied': [satisfied] * len(arbeitsatmosphare),
'overall':[overall] * len(arbeitsatmosphare)
})
print(df)
Prints:
Arbeitsatmosphäre Vorgesetztenverhalten Kollegenzusammenhalt company satisfied overall
0 5,00 5,00 5,00 Volkswagen Consulting 86% 4,27
1 5,00 5,00 5,00 Volkswagen Consulting 86% 4,27
2 5,00 5,00 5,00 Volkswagen Consulting 86% 4,27
3 5,00 5,00 5,00 Volkswagen Consulting 86% 4,27
4 2,00 1,00 3,00 Volkswagen Consulting 86% 4,27
5 5,00 5,00 5,00 Volkswagen Consulting 86% 4,27
6 5,00 5,00 5,00 Volkswagen Consulting 86% 4,27
7 5,00 5,00 4,00 Volkswagen Consulting 86% 4,27
....and so on.
推荐阅读
- python - python日志记录和线程
- python - qwidget.setGeometry() 在 linux 上的位置不正确
- javascript - Javascript 中的作用域,使用 let
- javascript - 无法读取未定义的 Vue js 的属性“_router”
- opencart - 如何更改打开购物车中的主页类别和图像
- wordpress - WordPress comment_form() $fields arg 没有生效
- c - 使用 `addrinfo` 结构的内存管理
- regex - 查找具有给定名字的名字
- octobercms - 有没有我可以听的重新排序事件?
- android - 从另一个类访问共享内存