首页 > 解决方案 > 如何抓取相同的类名数据

问题描述

我试图抓取一些房地产网站,但我遇到的一个 div 在一个 div 下具有相同的类名,并且该 div 还有另外 2 个具有相同类名的 div。我想抓取子类数据(我认为)。

我想抓取以下类数据:

<div class="m-srp-card__summary__info">New Property</div>

下面是我试图抓取的整个代码块:

<div class="m-srp-card__collapse js-collapse" aria-collapsed="collapsed" data-container="srp-card- 
   summary">
   <div class="m-srp-card__summary js-collapse__content" data-content="srp-card-summary">   
   <input type="hidden" id="propertyArea42679361" value="888 sqft">
      <div class="m-srp-card__summary__item">
        <div class="m-srp-card__summary__title">carpet area</div>
        <div class="m-srp-card__summary__info">888&nbsp;sqft</div>
      </div>
      <div class="m-srp-card__summary__item">
        <div class="m-srp-card__summary__title">status</div>
        <div class="m-srp-card__summary__info">Ready to Move</div>
      </div>
      <div class="m-srp-card__summary__item">
        <div class="m-srp-card__summary__title">floor</div>
        <div class="m-srp-card__summary__info">9 out of 13 floors</div>
      </div>
      <div class="m-srp-card__summary__item">
        <div class="m-srp-card__summary__title">transaction</div>
        <div class="m-srp-card__summary__info">New Property</div>
      </div>
      <div class="m-srp-card__summary__item">
        <div class="m-srp-card__summary__title">furnishing</div>
        <div class="m-srp-card__summary__info">Unfurnished</div>
      </div>
      <div class="m-srp-card__summary__item">
        <div class="m-srp-card__summary__title">facing</div>
        <div class="m-srp-card__summary__info">South -West</div>
      </div>
      <div class="m-srp-card__summary__item">
        <div class="m-srp-card__summary__title">overlooking</div>
        <div class="m-srp-card__summary__info">Garden/Park, Main Road</div>
      </div>
      <div class="m-srp-card__summary__item">
        <div class="m-srp-card__summary__title">society</div>
        <div class="m-srp-card__summary__info">
        <a id="project-link-42679361" class="m-srp-card__summary__link" 
        href="https://www.magicbricks.com/skylights-bopal-ahmedabad-pdpid-4d4235303936323633" 
        target="_blank">Skylights</a>
        </div>
      </div>
      <div class="m-srp-card__summary__item">
        <div class="m-srp-card__summary__title">car parking</div>
        <div class="m-srp-card__summary__info">1 Covered</div>
      </div>
      <div class="m-srp-card__summary__item">
        <div class="m-srp-card__summary__title">bathroom</div>
        <div class="m-srp-card__summary__info">3</div>
      </div>
      <div class="m-srp-card__summary__item">
        <div class="m-srp-card__summary__title">balcony</div>
        <div class="m-srp-card__summary__info">2</div>
      </div>
      <div class="m-srp-card__summary__item">
        <div class="m-srp-card__summary__title">ownership</div>
        <div class="m-srp-card__summary__info">Co-operative Society</div>
      </div>
    </div>
    <div class="m-srp-card__collapse__control js-collapse__control" data-toggle="list-collapse" 
     data-target="srp-card-summary" onclick="stopPage=true;">
  <div class="ico m-srp-card__ico">
  <svg role="icon">
   <use xlink:href="#icon-caret-down"></use>
  </svg>
</div>

我尝试了索引但一无所获。

下面是我的代码:

req = Request('https://www.magicbricks.com/property-for-sale/residential-real-estate?proptype=Multistorey-Apartment,Builder-Floor-Apartment,Penthouse,Studio-Apartment,Residential-House,Villa&Locality=Bopal&cityName=Ahmedabad', headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
soup = BeautifulSoup(req, 'html.parser')
containers = soup.find_all('div', {'class': 'm-srp-card__desc flex__item'})
container = containers[0]
no_apartment = container.find('h3').find('span', {'class': 'm-srp-card__title__bhk'}).getText()
c_area = container.find('div', {'class': 'm-srp-card__summary__info'}).getText()
p_price = container.find('div', {'class': 'm-srp-card__info flex__item'})
p_type = container.find('div', {'class': 'm-srp-card__summary js-collapse__content'})[3].find('div', {'class': 'm-srp-card__summary__info'})

提前致谢!

标签: pythonweb-scrapingbeautifulsoup

解决方案


import requests
from bs4 import BeautifulSoup
import csv
import re

r = requests.get('https://www.magicbricks.com/property-for-sale/residential-real-estate?proptype=Multistorey-Apartment,Builder-Floor-Apartment,Penthouse,Studio-Apartment,Residential-House,Villa&Locality=Bopal&cityName=Ahmedabad')
soup = BeautifulSoup(r.text, 'html.parser')

category = []
size = []
price = []
floor = []
for item in soup.findAll('span', {'class': 'm-srp-card__title__bhk'}):
    category.append(item.get_text(strip=True))
for item in soup.findAll(text=re.compile('area$')):
    size.append(item.find_next('div').text)
for item in soup.findAll('span', {'class': 'm-srp-card__price'}):
    price.append(item.text)
for item in soup.findAll(text='floor'):
    floor.append(item.find_next('div').text)
data = []
for items in zip(category, size, price, floor):
    data.append(items)

with open('output.csv', 'w+', newline='', encoding='UTF-8-SIG') as file:
    writer = csv.writer(file)
    writer.writerow(['Category', 'Size', 'Price', 'Floor'])
    writer.writerows(data)
    print("Operation Completed")

在线查看输出:点击这里


推荐阅读