python - 网络抓取不同的 URL - 限制
问题描述
我为汽车交易者编写了一个网络爬虫,但由于某种原因,在遍历 url 时,我的数据帧只能获得 1300 的最大长度。每页有 13 个结果,所以限制 100 是否有某种意义,或者我只是做错了什么?任何帮助将不胜感激:)我在下面附上了我的代码
# Import required libraries
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
# List of urls
path = 'https://www.autotrader.co.uk/car-search?advertClassification=standard&postcode=RH104JJ&make=&price-from=500&price-to=100000&onesearchad=Used&onesearchad=Nearly%20New&onesearchad=New&advertising-location=at_cars&is-quick-search=TRUE&page='
urls = []
for i in range(1,500):
url = path + str(i)
urls.append(url)
# Lists to store the scraped data in
makes = []
prices = []
ratings = []
dates = []
types = []
miles = []
litres = []
bhps = []
transmissions = []
fuels = []
owners = []
attributes = [makes, ratings, dates, types, miles, litres, bhps, transmissions, fuels, owners]
# Iterate through urls
sum = 0
for url in urls:
sum += 1
if sum%10 == 0:
print(sum)
# Attempt to connect to the url
try:
response = get(url)
except:
print('oops')
html_soup = BeautifulSoup(response.text, 'html.parser')
# Get a list of individual cars and iterate through it
car_containers = html_soup.find_all('li', class_ = 'search-page__result')
for container in car_containers:
try:
container.find("div", {"class": "js-tooltip"}).find("div", {"class": "pi-indicator js-tooltip-trigger"}).text
rating = container.find("div", {"class": "js-tooltip"}).find("div", {"class": "pi-indicator js-tooltip-trigger"}).text.strip()
except:
rating = ''
ratings.append(rating)
make = container.h2.text.strip().title().split(' ')[0]
makes.append(make)
price = container.find("div", {"class": "vehicle-price"}).text[1:]
prices.append(price)
specs = container.find("ul", {"class": "listing-key-specs"}).find_all("li", recursive=True)
for spec in specs:
if spec.text.split(' ')[0].isdigit() and len(spec.text.split(' ')[0]) == 4:
date = spec.text.split(' ')[0]
dates.append(date)
if 'mile' in str(spec):
mile = spec.text.split(' ')[0]
miles.append(mile)
if 'l' in str(spec).lower() and str(spec.text)[:-1].replace('.', '').isnumeric() and not spec.text.split(' ')[0].isdigit():
litre = spec.text[:-1]
litres.append(litre)
if any(x in str(spec).lower() for x in ['automatic', 'manual']):
transmission = spec.text
transmissions.append(transmission)
if any(x in str(spec).lower() for x in ['bhp', 'ps']):
bhp = spec.text
bhps.append(bhp)
if any(x in str(spec).lower() for x in ['petrol', 'diesel']):
fuel = spec.text
fuels.append(fuel)
if 'owner' in str(spec):
owner = spec.text
owners.append(owner.split(' ')[0])
typelist = ['hatchback', 'saloon', 'convertible', 'coupe', 'suv', 'mpv', 'estate', 'limousine',
'pickup']
if any(x in str(spec).lower() for x in typelist):
typ = spec.text
types.append(typ)
# Filling in empty spaces
for attribute in attributes:
if len(attribute) < len(prices):
attribute.append('')
# Creating a dataframe from the lists
df = ({'makes': makes,
'Price': prices,
'Rating': ratings,
'Year': dates,
'Type': types,
'Miles': miles,
'Litres': litres,
'BHP': bhps,
'Transmission': transmissions,
'Fuel': fuels,
'Owners': owners
})
df = pd.DataFrame(df)
解决方案
Maybe just use a url shortener if the length of the url is too long
推荐阅读
- c# - html/web 抓取 C# 特定文件结构
- json - REST:创建基于 Json 的查询:使用哪种 http 方法?
- raku - 在数字上下文 raku 中使用 Any 类型的未初始化值
- android - 如何在没有budle的情况下将数据从活动传递到android中的片段?
- entity-framework - EF Core 3 GroupBy 多列计数带扩展的抛出但 linq 有效
- tensorflow - GCP AI Platform Notebook 中 No module named 'tensorflow' 的错误消息
- javascript - 通过 POST 方法将 Javascript 中动态生成的复杂结构的表单提交到 PHP 脚本
- python - AttributeError:“InstagramBot”对象没有属性“登录”
- facebook-graph-api - 当在 facebook 中发布带有特定字符串的帖子时,如何从 facebook 获取 webhook 通知
- reactjs - 尽管创建了文件对象,但 Multer req.file 始终未定义