首页 > 解决方案 > 使用 python 进行 Web 抓取/动态内容中的循环问题和使用 panda 的标题

问题描述

我是一个完整的法国新手。我正在制作一个网络抓取脚本,以便从网站(这里是未来的爸爸)获取所有汽车销售信息(公里、年龄、颜色、价格等)。

我的第一个问题是循环问题。它一遍又一遍地让我看到相同的页面(“我”次......),它不会迭代动态内容以在前一个页面之后发送下一页。

第二个是 csv 中数据帧的标题,它为每一行重复。

非常感谢您,

我对此非常绝望。

from bs4 import BeautifulSoup
import json
import re
import pandas as pd
import csv
 
df = pd.DataFrame()
 
#dynamic content modifiers
"""{
    "modelevh[]": "toyota",
    "motorisation[]": "Hybride",
    "pmin": "5000",
    "pmax": "80000",
    "couleur[]": "Rouge",
    "orderby": "date",
    "page": "2"
}"""
 
#how many pages ?
 
url ='https://www.teamcolin-lexus.fr/vehicules-occasion/'
s = requests.Session()
response=s.get(url)
if response.ok:
    soup=BeautifulSoup(response.content, 'lxml')
    page= soup.find('span',{'class':'meta-nav'})
    pages= str(page)
    page_nb=re.search(' sur (.*)<', pages)
    pages_nb=int(page_nb.group(1))
 
"""print(page_nb.group(1))
print(page_nb)"""
 
 
#loop on pages
 
for i in range (1,pages_nb):
    
    payload = {'page': i, 'modelevh[]': 'toyota'}
 
    _ = s.post(url, data=payload)
    r = s.get("https://www.teamcolin-lexus.fr/vehicules-occasion/")
 
#intersting data on the page

    if r.ok:
        soup=BeautifulSoup(r.content, 'lxml')
        cellules= soup.findAll(class_="liste-vehicule col-12 col-sm-6 col-md-4 col-lg-4 d-flex flex-column justify-content-between align-items-center")
 
 
    links = []
    for class_ in cellules:
 
    #get links

            link =class_.get('href')
            links.append(link)
 
    for link in links:
            url=link.strip()  
            response=requests.get(url)
            if response.ok:
                    soup_=BeautifulSoup(response.text)
 
                    prix=soup_.find('b',{'class':'price-value'}).text
 
 
                    head=soup_.find('title').text.replace('| LEXUS - Team Colin Lexus', '')
                    #recherche de la table de caractéristiques                
                    table=soup_.find('table',{'class':'table table-sm table-striped d-none d-md-table'})
                    headers=[]

            #get the basic datas table 
     
                    for th in table.find_all("th"):
                        title = th.text
                        headers.append(title)
                        df_data = pd.DataFrame(columns = headers)
                    for j in table.find_all('tr')[1:]:
                        row_data = j.find_all('td')
                        row = [tr.text for tr in row_data]
                        length = len(df_data)
                        df_data.loc[length] = row    
 
            #add columns
 
                        df_data['Lien annonce']=link
 
                        df_data['Prix']=prix
                        df_data['Modèle']=modele_voiture
 
                        df_data.to_csv('car.csv', encoding='utf-16', mode='a', index = True, header=False)
                        print (df_data)```

https://pastebin.com/7wYHyBtp

标签: pythonpandasdataframeloopsweb-scraping

解决方案


您可以使用此示例如何从页面获取信息并将其保存到 pandas DataFrame:

import requests
import pandas as pd
from bs4 import BeautifulSoup


url = "https://www.teamcolin-lexus.fr/vehicules-occasion/"
payload = {"orderby": "date", "page": ""}

data = []
current_page = 1
while True:
    payload["page"] = current_page
    soup = BeautifulSoup(
        requests.post(url, data=payload).content, "html.parser"
    )

    cars = soup.select("a.liste-vehicule")
    if not cars:
        break

    for a in cars:
        print("Getting {}".format(a["href"]))
        soup = BeautifulSoup(requests.get(a["href"]).content, "html.parser")
        row = {
            "Type": soup.title.text.split("|")[0].strip(),
            "Price": soup.select_one(".price-value").text,
            "Url": a["href"],
        }
        data.append(row)
        for th in soup.select(".mx-auto th"):
            row[th.get_text(strip=True)] = th.find_next("td").get_text(
                strip=True
            )

    current_page += 1

df = pd.DataFrame(data)
print(df)
df.to_csv("data.csv", index=False)

将所有约 700 辆汽车保存到data.csv(来自 Libre Office 的屏幕截图):

在此处输入图像描述


推荐阅读