首页 > 解决方案 > 网页抓取的新手。如何从div中提取标题。如何将抓取的数据放入Dataframe

问题描述

我最近开始了我的数据科学之旅。我正在使用 Google Colab 的 Jupyter 来完成这项任务。

第一个问题

我正在尝试从一个房地产网站上抓取数据,我现在想在该网站上抓取房产名称、价格、位置、床位、浴室。

https://www.zameen.com/Homes/Lahore-1-1.html检查

<span class="_4720d1a0 "><span class="_0c8a5353 c1b40987"></span><span aria-label="Beds" class="b6a29bc0">5</span></span>,
 <span class="_0c8a5353 c1b40987"></span>,
 <span aria-label="Beds" class="b6a29bc0">5</span>,
 <span class="_4720d1a0 "><span class="_0c8a5353 fa6c05cc"></span><span aria-label="Baths" class="b6a29bc0">6</span></span>,
 <span class="_0c8a5353 fa6c05cc"></span>,
 <span aria-label="Baths" class="b6a29bc0">6</span>,
 <span class="_4720d1a0 "><span class="_0c8a5353 d2db01cb"></span><span aria-label="Area" class="b6a29bc0"><div class="_7ac32433" title="1 Kanal Luxury Bungalow For Sale In Lahore Dha"><div class="_1e0ca152 _026d7bff"><div><span>1 Kanal</span></div></div></div></span></span>,
 <span class="_0c8a5353 d2db01cb"></span>

我能够将价格、位置、床和浴室作为清单

从每个属性中查找区域

property = soup.find_all("span", attrs={"aria-label":"Area"})

从每个属性中查找价格

property = soup.find_all("span", attrs={"class":"f343d9ce"})

但我无法理解如何在跨度中提取属性标题,然后再次在 div 中。

<span aria-label="Area" class="b6a29bc0"><div class="_7ac32433" title="1 Kanal Luxury Bungalow For Sale In Lahore Dha"><div class="_1e0ca152 _026d7bff"><div><span>1 Kanal</span></div></div></div></span>

从每个属性中查找标题

property = soup.find_all("div", class_="_7ac32433")
for i in property:
  print(i.get_text())

它只是显示

PKR5.5 Crore
1 Kanal
PKR6.5 Crore
1 Kanal
PKR69.9 Lakh
5 Marla
PKR4.45 Crore
1 Kanal
PKR6.29 Crore
1 Kanal
PKR2.25 Crore
10 Marla
PKR55 Lakh
5 Marla
PKR5.28 Crore
1 Kanal
PKR1.4 Crore
5.5 Marla
PKR1.05 Crore
4 Marla
PKR5.15 Crore
1.1 Kanal
PKR6.35 Crore
1 Kanal
PKR1.15 Crore
5 Marla
PKR68 Lakh
3 Marla
PKR3.6 Crore
1 Kanal
PKR2.25 Crore

第二个问题

一旦我能够从 URL 中提取所需的数据。如何创建数据框并将这些数据导入数据科学项目的数据框?我真的很新,所以我什至无法构建代码。

标签: pythonhtmljupyter

解决方案


这是一个提取区域、价格和标题并将它们添加到数据框的示例:

import pandas as pd 
import requests
from bs4 import BeautifulSoup

url = 'https://www.zameen.com/Homes/Lahore-1-1.html'

page = requests.get(url)

html = BeautifulSoup(page.text, 'html')

area_elements = html.find_all("span", attrs={"aria-label":"Area"})
areas = [el.text for el in area_elements]

price_elements = html.find_all("span", attrs={"class":"f343d9ce"})
prices = [el.text for el in price_elements]

title_elements = html.find_all("a", attrs={"class":"_7ac32433"})
titles = [el.get('title') for el in title_elements]

# create dataframe
df = pd.DataFrame({
    'area': areas,
    'title': titles,
    'price': prices
})

df.head()

推荐阅读