python - 从特定表格元素中抓取特定文本时返回错误数据
问题描述
感谢 SO 和 @QHarr,下面的代码可以很好地处理 URL,例如
https://www.amazon.com/dp/B00FSCBQV2
但它不适用于这样的 URL -
https://www.amazon.com/dp/B01N1ZD912/
我的结果是——
'R1_NO' :'.zg_hrsr { margin: 0; padding: 0; list-style-type:
none;}\n.zg_hrsr_item { margin: 0 0 0 10px; }\n.zg_hrsr_rank {
display:inline-block; width: 80px; text-align: right; }'}'
它应该返回
R1_NO = 42553
R1_CAT = Baby Care Products
R2_NO = 6452
R2_CAT = Baby Bathing Products (Health & Household)
这是因为排名数据不在第一行。需要做什么才能获得预期的结果?这个脚本也可以精简/更有效吗?
我试过用 bs4 select.one 处理它,得到文本条,我没有做任何工作。请帮我!
fields = ['Amazon Best Sellers Rank']
temp_dict = {}
for field in fields:
element = soup.select_one('li:contains("' + field + '")')
if element is None:
temp_dict[field] = 'NA'
else:
if field == 'Amazon Best Sellers Rank':
item='NA'
item = [re.sub('#|\(','', string).strip() for string in soup.select_one('li:contains("' + field + '")').stripped_strings][1].split(' in ')
temp_dict[field] = item
else:
item = [string for string in element.stripped_strings][1]
temp_dict[field] = item.replace('(', '').strip()
ranks = soup.select('.zg_hrsr_rank')
ladders = soup.select('.zg_hrsr_ladder')
if ranks:
cat_nos = [item.text.split('#')[1] for item in ranks]
else:
cat_nos = ['NA']
if ladders:
cats = [item.text.split('\xa0')[1] for item in soup.select('.zg_hrsr_ladder')]
else:
cats = ['NA']
rankings = dict(zip(cat_nos, cats))
map_dict = {'Amazon Best Sellers Rank': ['R1_NO','R1_CAT']}
final_dict = {}
final_dict['R2_NO'] = 'NA'
final_dict['R2_CAT'] = 'NA'
final_dict['R3_NO'] = 'NA'
final_dict['R3_CAT'] = 'NA'
final_dict['R4_NO'] = 'NA'
final_dict['R4_CAT'] = 'NA'
for k,v in temp_dict.items():
if k == 'Amazon Best Sellers Rank' and v!= 'NA':
item = dict(zip(map_dict[k],v))
final_dict = {**final_dict, **item}
elif k == 'Amazon Best Sellers Rank' and v == 'NA':
item = dict(zip(map_dict[k], [v, v]))
final_dict = {**final_dict, **item}
else:
final_dict[map_dict[k]] = v
for k,v in enumerate(rankings):
#print(k + 1, v, rankings[v])
prefix = 'R' + str(k + 2) + '_'
final_dict[prefix + 'NO'] = v
final_dict[prefix + 'CAT'] = rankings[v]
我希望它能够处理并返回问题中发布的两个 URL 的值
解决方案
因此,由于 html 布局的差异,剥离的字符串会导致返回内联 css。您可以尝试缩短并使用正则表达式。可以收紧正则表达式,但我会等着看你是否先找到失败案例。
import requests
from bs4 import BeautifulSoup as bs
import re
links = ['https://www.amazon.com/dp/B00FSCBQV2?th=1','https://www.amazon.com/dp/B01N1ZD912/','https://www.amazon.com/Professional-Dental-Guard-Remoldable-Customizable/dp/B07L4YHBQ4', 'https://www.amazon.com/dp/B0040ODFK4/?tag=stackoverfl08-20']
map_dict = {'Product Dimensions': 'dimensions', 'Shipping Weight': 'weight', 'Item model number': 'Item_No', 'Amazon Best Sellers Rank': ['R1_NO','R1_CAT']}
# This handles when a ranking is from 1 to x,xxx,xxx
p = re.compile(r'#([0-9][0-9,]*)+[\n\s]+in[\n\s]+([A-Za-z&\s]+)')
with requests.Session() as s:
for link in links:
r = s.get(link, headers = {'User-Agent': 'Mozilla\5.0'})
soup = bs(r.content, 'lxml')
fields = ['Product Dimensions', 'Shipping Weight', 'Item model number', 'Amazon Best Sellers Rank']
final_dict = {}
for field in fields:
element = soup.select_one('li:contains("' + field + '")')
if element is None:
if field == 'Amazon Best Sellers Rank':
item = dict(zip(map_dict[field], ['N/A','N/A']))
final_dict = {**final_dict, **item}
else:
final_dict[map_dict[field]] = 'N/A'
else:
if field == 'Amazon Best Sellers Rank':
text = element.text
i = 1
for x,y in p.findall(text):
prefix = 'R' + str(i) + '_'
final_dict[prefix + 'NO'] = x
final_dict[prefix + 'CAT'] = y.strip()
i+=1
else:
item = [string for string in element.stripped_strings][1]
final_dict[map_dict[field]] = item.replace('(', '').strip()
print(final_dict)
推荐阅读
- r - R ggplot2 重新排序条并在最后放置一个特定的
- android - 如何通过数据库填充flexibleadapter?
- sql - 使用 sql 将具有某些列名的表转换为具有这些列名作为行值的表
- javascript - 在 Postman 中检查此 API 时无法添加另一个 JSON 对象
- r - 如何找到两个工作表索引中的差异然后将差异复制到另一个工作表索引
- python - 通过迭代列表来更改字符串的一部分
- javascript - 如何使用一种“GET”方法重新加载多个 HTML 元素
- php - Laravel:使用发布路线重定向
- json - 我无法使用 API POST 函数解决错误
- python-3.x - 杀死openhab打开的进程不起作用