首页 > 解决方案 > 在网络抓取中寻找项目

问题描述

我正在寻找一种在亚马逊商店中刮取作者和价格的方法。(然后删除美元符号,在输出中保留 3.99。)

到目前为止,我已经尝试并获得了标题和评级,但不确定如何检索作者的姓名。

from bs4 import BeautifulSoup
import pandas as pd
import requests

url = '             '
html_content = requests.get(url).text
soup = BeautifulSoup(html_content, "html")

rate = []

for x in soup.select("li.zg-item-immersion"):
    item = {}
    
    item['name'] = x.select_one('a').get_text(strip=True)

    item['rank'] = x.select_one('span span').get_text(strip=True)

    rate.append(item)
        
rate

在上述输入之后的输出中,我得到:

在此处输入图像描述

只是想知道如何删除每个名称部分括号中的项目。

例如“就在她身后(Bree Taggert Book 4)”

可以“就在她身后”

标签: pythonweb-scrapingbeautifulsoup

解决方案


您可以使用此代码:

x.find("span", {"class": "p13n-sc-price"}).get_text().split('$')[1]

整个代码:

from bs4 import BeautifulSoup
import pandas as pd
import requests

url = 'https://www.amazon.com/Best-Sellers-Kindle-Store/zgbs/digital-text'
html_content = requests.get(url).text
soup = BeautifulSoup(html_content, "html")

rate = []

for x in soup.select("li.zg-item-immersion"):
    item = {}
    
    item['name'] = x.select_one('a').get_text(strip=True).split('(')[0].strip()

    item['rank'] = x.select_one('span span').get_text(strip=True)
    
    item['price'] = x.find("span", {"class": "p13n-sc-price"}).get_text().split('$')[1]
    try:
        item['author'] = x.find("a", {"class":"a-size-small a-link-child"}).text
    except:
        item['author'] = 'Not Found Author Name'

    rate.append(item)

输出:

[{'name': 'Peril', 'rank': '#1', 'price': '14.99', 'author': 'Bob Woodward'},
 {'name': 'Apples Never Fall',
  'rank': '#2',
  'price': '14.99',
  'author': 'Liane Moriarty'},
...
]

推荐阅读