首页 > 解决方案 > 在 google colaboratory 中使用 python 抓取深度网络数据

问题描述

我有来自我另一个问题的答案的代码。

它可以在每个页面中提取数据。所以,我的下一个问题是如何在每件衣服中拖动数据,比如模特的名字、模特的尺寸和特征。

不仅如此,每件连衣裙还有不止一个模特(例如 BOHO BIRD Amore Wrap Dress 有 3 个模特,他们穿着尺码 10、14 和 16等)

    import json
        
    import requests
    from bs4 import BeautifulSoup

    cookies = {
        "ServerID": "1033",
        "__zlcmid": "10tjXhWpDJVkUQL",
    }
    
    headers = {
        "user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 "
                      "(KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36"
    }

    def extract_info(bs: BeautifulSoup, tag: str, attr_value: str) -> list:
        return [i.text.strip() for i in bs.find_all(tag, {"itemprop": attr_value})]

    def extract_info(bs: BeautifulSoup, tag: str, attr_value: str) -> list:
        return [i.text.strip() for i in bs.find_all(tag, {"itemprop": attr_value})]
    
    all_pages = []
    for page in range(1, 29):
        print(f"{all_pages}\nFound: {len(all_pages)} dresses.")
    
        current_page = f"https://www.birdsnest.com.au/womens/dresses?page={page}"
        source = requests.get(current_page, headers=headers, cookies=cookies)
        soup = BeautifulSoup(source.content, 'html.parser')
    
        brand = extract_info(soup, tag="strong", attr_value="brand")
        name = extract_info(soup, tag="h2", attr_value="name")
        price = extract_info(soup, tag="span", attr_value="price")
    
        all_pages.extend(
            [
                {
                    "brand": b,
                    "name": n,
                    "price": p,
                } for b, n, p in zip(brand, name, price)
            ]
        )
    
    with open("all_the_dresses2.json", "w") as jf:
        json.dump(all_pages, jf, indent=4)

标签: pythonweb-scrapingbeautifulsoupgoogle-colaboratory

解决方案


您想要的信息是动态生成的。所以,你不会得到它requests。我建议您为此使用硒。

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import time


link = 'https://www.birdsnest.com.au/brands/boho-bird/73067-amore-wrap-dress'
options = Options()
options.add_argument('--headless')
options.add_argument('--disable-gpu')
driver = webdriver.Chrome('C:/Users/../Downloads/../chromedriver.exe', options=options)
driver.get(link)
time.sleep(3)

soup = BeautifulSoup(driver.page_source, 'html.parser')
driver.close()
page_new = soup.find('div', class_='model-info clearfix')
results = page_new.find_all('p')
for result in results:
    print(result.text)

输出

Marnee usually wears a size 8.
                She is wearing a size 10 in this style.
              
Her height is 178 cm.

Show Marnee’s body measurements

Marnee’s body measurements are:
Bust 81 cm
Waist 64 cm
Hips 89 cm

<div class="model-info-header">
              <p>
                <strong><span class="model-info__name">Marnee</span></strong> usually wears a size <strong><span class="model-info__standard-size">8</span></strong>.
                She is wearing a size <strong><span class="model-info__wears-size">10</span></strong> in this style.
              </p>
              <p class="model-info-header__height">Her height is <strong><span class="model-info__height">178 cm</span></strong>.</p>
              <p>
                <span class="js-model-info-more model-info__link model-info-header__more">Show <span class="model-info__name">Marnee</span>’s body measurements</span>
              </p>
            </div>

requests你会错过所有你想要的粗体数据。


推荐阅读