python - 如何将 BeautifulSoup 结果中的 URL 存储到列表中,然后存储到表中
问题描述
我正在抓取一个房地产网页,试图获取一些 URL,然后创建一个表格。 https://www.zonaprop.com.ar/locales-comerciales-alquiler-palermo-hollywood-0-ambientes-publicado-hace-menos-de-1-mes.html 我有好几天要尝试
- 将结果存储到列表或字典中,然后
- 创建一个表,但我真的卡住了
from bs4 import BeautifulSoup
import requests
import re
source=requests.get('https://www.zonaprop.com.ar/locales-comerciales-alquiler-palermo-hollywood-0-ambientes-publicado-hace-menos-de-1-mes.html').text
soup=BeautifulSoup(source,'lxml')
#Extract URL
link_text = ''
URL=[]
PlacesDf = pd.DataFrame(columns=['Address', 'Location.lat', 'Location.lon'])
for a in soup.find_all('a', attrs={'href': re.compile("/propiedades/")}):
link_text = a['href']
URL='https://www.zonaprop.com.ar'+link_text
print(URL)
好的,输出对我来说没问题:
https://www.zonaprop.com.ar/propiedades/local-en-alquiler-soler-6000-palermo-hollywood-a-44227001.html#map
https://www.zonaprop.com.ar/propiedades/local-en-alquiler-soler-6000-palermo-hollywood-a-44227001.html
https://www.zonaprop.com.ar/propiedades/local-en-alquiler-soler-6000-palermo-hollywood-a-44227001.html
https://www.zonaprop.com.ar/propiedades/excelente-esquina-en-alquiler-s-lote-propio-con-43776599.html
https://www.zonaprop.com.ar/propiedades/excelente-esquina-en-alquiler-s-lote-propio-con-43776599.html
https://www.zonaprop.com.ar/propiedades/excelente-esquina-en-alquiler-s-lote-propio-con-43776599.html
https://www.zonaprop.com.ar/propiedades/excelente-local-en-alquiler-palermo-hollywood-fitz-44505027.html#map
https://www.zonaprop.com.ar/propiedades/excelente-local-en-alquiler-palermo-hollywood-fitz-44505027.html
https://www.zonaprop.com.ar/propiedades/excelente-local-en-alquiler-palermo-hollywood-fitz-44505027.html
https://www.zonaprop.com.ar/propiedades/local-palermo-hollywood-44550855.html#map
https://www.zonaprop.com.ar/propiedades/local-palermo-hollywood-44550855.html
https://www.zonaprop.com.ar/propiedades/local-palermo-hollywood-44550855.html
https://www.zonaprop.com.ar/propiedades/local-comercial-o-edificio-corporativo-oficinas-500-43164952.html
https://www.zonaprop.com.ar/propiedades/local-comercial-o-edificio-corporativo-oficinas-500-43164952.html
https://www.zonaprop.com.ar/propiedades/local-comercial-o-edificio-corporativo-oficinas-500-43164952.html
https://www.zonaprop.com.ar/propiedades/local-palermo-viejo-44622843.html#map
https://www.zonaprop.com.ar/propiedades/local-palermo-viejo-44622843.html
https://www.zonaprop.com.ar/propiedades/local-palermo-viejo-44622843.html
https://www.zonaprop.com.ar/propiedades/alquiler-de-local-comercial-en-palermo-hollywood-44571635.html#map
https://www.zonaprop.com.ar/propiedades/alquiler-de-local-comercial-en-palermo-hollywood-44571635.html
https://www.zonaprop.com.ar/propiedades/alquiler-de-local-comercial-en-palermo-hollywood-44571635.html
问题是输出是真实的链接(您可以单击它们并转到页面)
但是当我尝试将它存储在一个新变量中(列表或字典,列名为“地址”以加入“PlacesDf”(相同的列名“地址”))/转换为表/或任何我找不到解决方案的技巧. 事实上,当我尝试转换为 pandas 时:
Address = pd.dataframe(URL)
它只创建一个单行表。
我希望看到类似的东西
Adresses=['https://www.zonaprop.com.ar/propiedades/local-en-alquiler-soler-6000-palermo-hollywood-a-44227001.html#map','
https://www.zonaprop.com.ar/propiedades/local-en-alquiler-soler-6000-palermo-hollywood-a-44227001.html',...]
或字典或任何我可以转向熊猫的桌子
解决方案
您应该执行以下操作:
from bs4 import BeautifulSoup
import requests
import re
import pandas as pd
source=requests.get('https://www.zonaprop.com.ar/locales-comerciales-alquiler-palermo-hollywood-0-ambientes-publicado-hace-menos-de-1-mes.html').text
soup=BeautifulSoup(source,'lxml')
#Extract URL
all_url = []
link_text = ''
PlacesDf = pd.DataFrame(columns=['Address', 'Location.lat', 'Location.lon'])
for a in soup.find_all('a', attrs={'href': re.compile("/propiedades/")}):
link_text = a['href']
URL='https://www.zonaprop.com.ar'+link_text
print(URL)
all_url.append(URL)
df = pd.DataFrame({"URLs":all_url}) #replace "URLs" with your desired column name
希望这可以帮助
推荐阅读
- android - 使用 Intent putExtra Serializable 传输时应用程序终止
- java - 通过身份验证将 pdf 内容在线传输到 BufferedInputStream
- c# - 如何从错误 1216-无法添加或更新子行中获得更透明的消息:MySQL 中的外键约束失败?
- php - htaccess 重定向在实时站点中不起作用
- flutter - Flutter - 如何同步两个或多个 PageController
- c++ - rs2_device_info 详细定义在哪里?
- python - 检测到带有 Chromedriver 的 Python Selenium
- jupyterhub - pre-commit run 路径python3.6(来自--python=python3.6)不存在
- c# - VS2019 IDE 和命令行编译器在 C# 语句上存在分歧
- reactjs - CSS在推送到网站时不显示,但在本地主机上显示 - ReactJS / Gatsby