python - Python Returning An Empty List Using Beautiful Soup HTML Parsing
问题描述
I'm currently working on a project that involves web scraping a real estate website (for educational purposes). I'm taking data from home listings like address, price, bedrooms, etc.
After building and testing along the way with the print function (it worked successfully!), I'm now building a dictionary for each data point in the listing. I'm storing that dictionary in a list in order to eventually use Pandas to create a table and send to a CSV.
Here is my problem. My list is displaying an empty dictionary with no error. Please note, I've successfully scraped the data already and have seen the data when using the print function. Now its displaying nothing after adding each data point to a dictionary and putting it in a list. Here is my code:
import requests
from bs4 import BeautifulSoup
r=requests.get("https://www.century21.com/real-estate/colorado-springs-co/LCCOCOLORADOSPRINGS/")
c=r.content
soup=BeautifulSoup(c,"html.parser")
all=soup.find_all("div", {"class":"infinite-item"})
all[0].find("a",{"class":"listing-price"}).text.replace("\n","").replace(" ","")
l=[]
for item in all:
d={}
try:
d["Price"]=item.find("a",{"class":"listing-price"}.text.replace("\n","").replace(" ",""))
d["Address"]=item.find("div",{"class":"property-address"}).text.replace("\n","").replace(" ","")
d["City"]=item.find_all("div",{"class":"property-city"})[0].text.replace("\n","").replace(" ","")
try:
d["Beds"]=item.find("div",{"class":"property-beds"}).find("strong").text
except:
d["Beds"]=None
try:
d["Baths"]=item.find("div",{"class":"property-baths"}).find("strong").text
except:
d["Baths"]=None
try:
d["Area"]=item.find("div",{"class":"property-sqft"}).find("strong").text
except:
d["Area"]=None
except:
pass
l.append(d)
When I call l
(the list that contains my dictionary) - this is what I get:
[{},
{},
{},
{},
{},
{},
{},
{},
{},
{},
{},
{},
{},
{},
{},
{},
{},
{},
{},
{},
{}]
I'm using Python 3.8.2 with Beautiful Soup 4. Any ideas or help with this would be greatly appreciated. Thanks!
解决方案
This does what you want much more concisely and is more pythonic (using nested list comprehension):
import requests
from bs4 import BeautifulSoup
r = requests.get("https://www.century21.com/real-estate/colorado-springs-co/LCCOCOLORADOSPRINGS/")
c = r.content
soup = BeautifulSoup(c, "html.parser")
css_classes = [
"listing-price",
"property-address",
"property-city",
"property-beds",
"property-baths",
"property-sqft",
]
pl = [{css_class.split('-')[1]: item.find(class_=css_class).text.strip() # this shouldn't error if not found
for css_class in css_classes} # find each class in the class list
for item in soup.find_all('div', class_='property-card-primary-info')] # find each property card div
print(pl)
Output:
[{'address': '512 Silver Oak Grove',
'baths': '6 baths',
'beds': '4 beds',
'city': 'Colorado Springs CO 80906',
'price': '$1,595,000',
'sqft': '6,958 sq. ft'},
{'address': '8910 Edgefield Drive',
'baths': '5 baths',
'beds': '5 beds',
'city': 'Colorado Springs CO 80920',
'price': '$499,900',
'sqft': '4,557 sq. ft'},
{'address': '135 Mayhurst Avenue',
'baths': '3 baths',
'beds': '3 beds',
'city': 'Colorado Springs CO 80906',
'price': '$420,000',
'sqft': '1,889 sq. ft'},
{'address': '7925 Bard Court',
'baths': '4 baths',
'beds': '5 beds',
'city': 'Colorado Springs CO 80920',
'price': '$405,000',
'sqft': '3,077 sq. ft'},
{'address': '7641 N Sioux Circle',
'baths': '3 baths',
'beds': '4 beds',
'city': 'Colorado Springs CO 80915',
'price': '$389,900',
'sqft': '3,384 sq. ft'},
...
]
推荐阅读
- python - 在 TensorFlow 图中使用预训练的 Keras 模型
- powershell - 使用调试器,查看函数的输出流
- java - 为什么默认值会覆盖已通过覆盖方法设置的值
- python - Pandas DataFrame:如何根据特定列的值巧妙地选择数据?
- cordova - 带有离子和电容器的 PWA,本机插件在浏览器和设备上抱怨 cordova 不可用
- django - 如何在 Django 中使用 TimestampSigner 反向匹配 url?
- c# - NHibernate 将本地日期年份和月份与日期集合进行比较
- c# - 文本框不更新 CanExecute
- java - 我如何知道准备好的语句是否被缓存?
- embedded - 链接描述文件是否始终确定代码放置在哪些地址