首页 > 解决方案 > beautifulsoup for-if 循环提取

问题描述

我想使用 for/if 循环从下面的网站中提取数据。下面的代码使用 for/if 循环成功地从文章中提取数据,但我想更新它并提取公司、满意百分比和总体评分数据(始终相同)也使用循环。

overall=[]

satisfied=[]
company=[]

arbeitsatmosphare = []
vorgesetztenverhalten = []
kollegenzusammenhalt= []



lurl='https://www.kununu.com/de/volkswagenconsulting/kommentare'
with requests.Session() as session:
    session.headers = {
        'x-requested-with': 'XMLHttpRequest'
    }
    page = 1
    while True:
        print(f"Processing page {page}..")
        url = f'{lurl}/{page}'
        print(url)
        response = session.get(url)
        soup = BeautifulSoup(response.text, 'html.parser')

        articles = soup.find_all('article')
        print("Number of articles: " + str(len(articles)))
        for article in articles:

            for key in [{'label': 'Arbeitsatmosphäre', 'list': arbeitsatmosphare},
                        {'label': 'Vorgesetztenverhalten', 'list': vorgesetztenverhalten},
                        {'label': 'Kollegenzusammenhalt', 'list': kollegenzusammenhalt}]:
                span = article.find('span', text=re.compile(key['label']))
                #print(span)
                if span and span.find_next('span'):
                    key['list'].append(span.find_next('span').text.strip())
                else:
                    key['list'].append('N/A')



# THIS PART IS NOT WORKING

            div = soup.find(class_="company-profile-container")
            for key2 in [{'label2': 'company-name', 'list': company},
                             {'label2': 'review-recommend-value', 'list': satisfied},
                            {'label2': 'review-rating-value', 'list': overall}]:
                span2 = div.find('span', text=re.compile(key2['label2']))
                #print(span2)
                if span2 and span2.find('span'):
                    key2['list'].append(span2.find('span').text.strip())
                else:
                    key2['list'].append('N/A')
        page += 1
        pagination = soup.find_all('div', {'class': 'paginationControl'})
        if not pagination:
            break

    #print(overall)
    df = pd.DataFrame({'Arbeitsatmosphäre': arbeitsatmosphare,
                       'Vorgesetztenverhalten': vorgesetztenverhalten,
                       'Kollegenzusammenhalt': kollegenzusammenhalt,
                       'company': company,
                       'satisfied': satisfied,
                       'overall':overall
                       })

print(df)

我以上面的代码为例,但看起来我的部分不起作用。我找不到问题,你能帮忙吗?

标签: pythonloopsfor-loopif-statementbeautifulsoup

解决方案


If the company name, satisfied rating and overall rating are same for each row, you don't have to put them inside the list in for-loop. Just get the necessary information at the end and use, for example, list * operator:

import re
import requests
from bs4 import BeautifulSoup

arbeitsatmosphare = []
vorgesetztenverhalten = []
kollegenzusammenhalt= []

lurl='https://www.kununu.com/de/volkswagenconsulting/kommentare'
with requests.Session() as session:
    session.headers = {
        'x-requested-with': 'XMLHttpRequest'
    }
    page = 1
    while True:
        print(f"Processing page {page}..")
        url = f'{lurl}/{page}'
        print(url)
        response = session.get(url)
        soup = BeautifulSoup(response.text, 'html.parser')

        articles = soup.find_all('article')
        print("Number of articles: " + str(len(articles)))
        for article in articles:

            for key in [{'label': 'Arbeitsatmosphäre', 'list': arbeitsatmosphare},
                        {'label': 'Vorgesetztenverhalten', 'list': vorgesetztenverhalten},
                        {'label': 'Kollegenzusammenhalt', 'list': kollegenzusammenhalt}]:
                span = article.find('span', text=re.compile(key['label']))
                if span and span.find_next('span'):
                    key['list'].append(span.find_next('span').text.strip())
                else:
                    key['list'].append('N/A')

        page += 1
        pagination = soup.find_all('div', {'class': 'paginationControl'})
        if not pagination:
            break

    company = soup.select_one('.company-name').get_text(strip=True)
    satisfied = soup.select_one('.review-recommend-value').get_text(strip=True)
    overall = soup.select_one('.review-rating-value').get_text(strip=True)

    df = pd.DataFrame({'Arbeitsatmosphäre': arbeitsatmosphare,
                       'Vorgesetztenverhalten': vorgesetztenverhalten,
                       'Kollegenzusammenhalt': kollegenzusammenhalt,
                       'company': [company] * len(arbeitsatmosphare),
                       'satisfied': [satisfied] * len(arbeitsatmosphare),
                       'overall':[overall] * len(arbeitsatmosphare)
                       })

print(df)

Prints:

   Arbeitsatmosphäre Vorgesetztenverhalten Kollegenzusammenhalt                company satisfied overall
0               5,00                  5,00                 5,00  Volkswagen Consulting       86%    4,27
1               5,00                  5,00                 5,00  Volkswagen Consulting       86%    4,27
2               5,00                  5,00                 5,00  Volkswagen Consulting       86%    4,27
3               5,00                  5,00                 5,00  Volkswagen Consulting       86%    4,27
4               2,00                  1,00                 3,00  Volkswagen Consulting       86%    4,27
5               5,00                  5,00                 5,00  Volkswagen Consulting       86%    4,27
6               5,00                  5,00                 5,00  Volkswagen Consulting       86%    4,27
7               5,00                  5,00                 4,00  Volkswagen Consulting       86%    4,27
....and so on.

推荐阅读