python - 在 google colaboratory 中使用 python 抓取深度网络数据
问题描述
我有来自我另一个问题的答案的代码。
它可以在每个页面中提取数据。所以,我的下一个问题是如何在每件衣服中拖动数据,比如模特的名字、模特的尺寸和特征。
不仅如此,每件连衣裙还有不止一个模特(例如 BOHO BIRD Amore Wrap Dress 有 3 个模特,他们穿着尺码 10、14 和 16等)。
import json
import requests
from bs4 import BeautifulSoup
cookies = {
"ServerID": "1033",
"__zlcmid": "10tjXhWpDJVkUQL",
}
headers = {
"user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36"
}
def extract_info(bs: BeautifulSoup, tag: str, attr_value: str) -> list:
return [i.text.strip() for i in bs.find_all(tag, {"itemprop": attr_value})]
def extract_info(bs: BeautifulSoup, tag: str, attr_value: str) -> list:
return [i.text.strip() for i in bs.find_all(tag, {"itemprop": attr_value})]
all_pages = []
for page in range(1, 29):
print(f"{all_pages}\nFound: {len(all_pages)} dresses.")
current_page = f"https://www.birdsnest.com.au/womens/dresses?page={page}"
source = requests.get(current_page, headers=headers, cookies=cookies)
soup = BeautifulSoup(source.content, 'html.parser')
brand = extract_info(soup, tag="strong", attr_value="brand")
name = extract_info(soup, tag="h2", attr_value="name")
price = extract_info(soup, tag="span", attr_value="price")
all_pages.extend(
[
{
"brand": b,
"name": n,
"price": p,
} for b, n, p in zip(brand, name, price)
]
)
with open("all_the_dresses2.json", "w") as jf:
json.dump(all_pages, jf, indent=4)
解决方案
您想要的信息是动态生成的。所以,你不会得到它requests
。我建议您为此使用硒。
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import time
link = 'https://www.birdsnest.com.au/brands/boho-bird/73067-amore-wrap-dress'
options = Options()
options.add_argument('--headless')
options.add_argument('--disable-gpu')
driver = webdriver.Chrome('C:/Users/../Downloads/../chromedriver.exe', options=options)
driver.get(link)
time.sleep(3)
soup = BeautifulSoup(driver.page_source, 'html.parser')
driver.close()
page_new = soup.find('div', class_='model-info clearfix')
results = page_new.find_all('p')
for result in results:
print(result.text)
输出
Marnee usually wears a size 8.
She is wearing a size 10 in this style.
Her height is 178 cm.
Show Marnee’s body measurements
Marnee’s body measurements are:
Bust 81 cm
Waist 64 cm
Hips 89 cm
<div class="model-info-header">
<p>
<strong><span class="model-info__name">Marnee</span></strong> usually wears a size <strong><span class="model-info__standard-size">8</span></strong>.
She is wearing a size <strong><span class="model-info__wears-size">10</span></strong> in this style.
</p>
<p class="model-info-header__height">Her height is <strong><span class="model-info__height">178 cm</span></strong>.</p>
<p>
<span class="js-model-info-more model-info__link model-info-header__more">Show <span class="model-info__name">Marnee</span>’s body measurements</span>
</p>
</div>
requests
你会错过所有你想要的粗体数据。
推荐阅读
- arrays - 在 swift 代码中保留 UISearchController 中的选定项目(如 Snapchat/Messenger/Whatsapp)
- bitcoin - 运行 RPC 查询命令时,Electrumx 服务器不显示数据
- java - 将 JDK8 升级到 OpenJKD 11:sun.security.rsa
- angular - 通过 *ngFor,Angular 9 从数组中生成 html 模板中的列表的问题
- python - 如何使用 Flask、Jinja 和 Dataframe 使 bootstrap-table-filter-control 工作
- vb.net - vb.net 中的进度条值自动更新
- flutter - LTRB 和 Offset 有什么关系?
- c# - 如何将数据库从类替换为连接字符串
- c# - asp.net c#gridview中的多个条件
- javascript - 添加第二个图表时..第一个图表正在消失(画布图表)