首页 > 解决方案 > 用美丽的汤从 aria-label 获得评分

问题描述

我有一个汤对象,例如:

r = requests.get('https://www.yelp.com/biz/panera-bread-markham')
soup = BeautifulSoup(r.text, 'html.parser')

我正在尝试从以下代码中找到评级,

rating_list = soup.find_all('span', {"class":"lemon--span__373c0__3997G display--inline__373c0__1DbOG border-color--default__373c0__2oFDT"})
rating_list

输出是这样的列表,

[<span class="lemon--span__373c0__3997G display--inline__373c0__1DbOG border-color--default__373c0__2oFDT"><div aria-label="3 star rating" class="lemon--div__373c0__1mboc i-stars__373c0__Y2F3O i-stars--large-3__373c0__2oM4P border-color--default__373c0__2oFDT overflow--hidden__373c0__8Jq2I" role="img"><img alt="" class="lemon--img__373c0__3GQUb offscreen__373c0__1KofL" height="560" src="https://s3-media0.fl.yelpcdn.com/assets/public/stars.yelp_design_web.yji-9bec2045845c24d3bff3ddb582884eda.png" width="132"/></div></span>,
 <span class="lemon--span__373c0__3997G display--inline__373c0__1DbOG border-color--default__373c0__2oFDT"><div aria-label="4 star rating" class="lemon--div__373c0__1mboc i-stars__373c0__Y2F3O i-stars--regular-4__373c0__3acau border-color--default__373c0__2oFDT overflow--hidden__373c0__8Jq2I" role="img"><img alt="" class="lemon--img__373c0__3GQUb offscreen__373c0__1KofL" height="560" src="https://s3-media0.fl.yelpcdn.com/assets/public/stars.yelp_design_web.yji-9bec2045845c24d3bff3ddb582884eda.png" width="132"/></div></span>,
 <span class="lemon--span__373c0__3997G display--inline__373c0__1DbOG border-color--default__373c0__2oFDT"><div aria-label="5 star rating" class="lemon--div__373c0__1mboc i-stars__373c0__Y2F3O i-stars--regular-5__373c0__ySHIl border-color--default__373c0__2oFDT overflow--hidden__373c0__8Jq2I" role="img"><img alt="" class="lemon--img__373c0__3GQUb offscreen__373c0__1KofL" height="560" src="https://s3-media0.fl.yelpcdn.com/assets/public/stars.yelp_design_web.yji-9bec2045845c24d3bff3ddb582884eda.png" width="132"/></div></span>,
 <span class="lemon--span__373c0__3997G display--inline__373c0__1DbOG border-color--default__373c0__2oFDT"><div aria-label="3 star rating" class="lemon--div__373c0__1mboc i-stars__373c0__Y2F3O i-stars--regular-3__373c0__1DXMK border-color--default__373c0__2oFDT overflow--hidden__373c0__8Jq2I" role="img"><img alt="" class="lemon--img__373c0__3GQUb offscreen__373c0__1KofL" height="560" src="https://s3-media0.fl.yelpcdn.com/assets/public/stars.yelp_design_web.yji-9bec2045845c24d3bff3ddb582884eda.png" width="132"/></div></span>,
 <span class="lemon--span__373c0__3997G display--inline__373c0__1DbOG border-color--default__373c0__2oFDT"><p class="lemon--p__373c0__3Qnnj text__373c0__2pB8f text-color--mid__373c0__3G312 text-align--left__373c0__2pnx_ text-size--small__373c0__3SGMi"><span aria-hidden="true" class="lemon--span__373c0__3997G icon__373c0__ehCWV icon--18-check-in" style="width:18px;height:18px;fill:#0077bc"><svg class="icon_svg" height="18" viewbox="0 0 18 18" width="18" xmlns="http://www.w3.org/2000/svg"><path d="M18 9l-2.136-1.84.932-2.66-2.772-.525-.524-2.77-2.66.93L8.997 0 7.163 2.136 4.5 1.206l-.525 2.77-2.77.524.932 2.66L0 9l2.137 1.84-.932 2.66 2.77.525.526 2.77 2.664-.932L8.998 18l1.84-2.137 2.662.932.524-2.77 2.772-.524-.932-2.66L18 9zm-9.85 3.23L5.324 9.4l1.13-1.13 1.698 1.696 3.396-3.395 1.13 1.134-4.525 4.525z"></path></svg></span> <!-- -->1 check-in</p></span>,
 <span class="lemon--span__373c0__3997G display--inline__373c0__1DbOG border-color--default__373c0__2oFDT"><div aria-label="1 star rating" class="lemon--div__373c0__1mboc i-stars__373c0__Y2F3O i-stars--regular-1__373c0__14nrQ border-color--default__373c0__2oFDT overflow--hidden__373c0__8Jq2I" role="img"><img alt="" class="lemon--img__373c0__3GQUb offscreen__373c0__1KofL" height="560" src="https://s3-media0.fl.yelpcdn.com/assets/public/stars.yelp_design_web.yji-9bec2045845c24d3bff3ddb582884eda.png" width="132"/></div></span>,
 <span class="lemon--span__373c0__3997G display--inline__373c0__1DbOG border-color--default__373c0__2oFDT"><p class="lemon--p__373c0__3Qnnj text__373c0__2pB8f text-color--mid__373c0__3G312 text-align--left__373c0__2pnx_ text-size--small__373c0__3SGMi"><span aria-hidden="true" class="lemon--span__373c0__3997G icon__373c0__ehCWV icon--18-check-in" style="width:18px;height:18px;fill:#0077bc"><svg class="icon_svg" height="18" viewbox="0 0 18 18" width="18" xmlns="http://www.w3.org/2000/svg"><path d="M18 9l-2.136-1.84.932-2.66-2.772-.525-.524-2.77-2.66.93L8.997 0 7.163 2.136 4.5 1.206l-.525 2.77-2.77.524.932 2.66L0 9l2.137 1.84-.932 2.66 2.77.525.526 2.77 2.664-.932L8.998 18l1.84-2.137 2.662.932.524-2.77 2.772-.524-.932-2.66L18 9zm-9.85 3.23L5.324 9.4l1.13-1.13 1.698 1.696 3.396-3.395 1.13 1.134-4.525 4.525z"></path></svg></span> <!-- -->1 check-in</p></span>,
         <span class="lemon--span__373c0__3997G display--inline__373c0__1DbOG border-color--default__373c0__2oFDT"><div aria-label="1 star .....
    .
    .
    .

关于从中获得评级的任何建议<div aria-label="3 star rating"

标签: web-scrapingbeautifulsouptext-mining

解决方案


实际上有很多方法,通过加载JSONfromscript标签,或者找到分配的 div。但我认为以下方式很清楚:)

import requests
from bs4 import BeautifulSoup


def main(url):
    r = requests.get(url)
    soup = BeautifulSoup(r.content, 'html.parser')
    target = soup.findAll("meta", itemprop="author")
    for tar in target:
        print(tar['content'], tar.findNext("meta")['content'])


main("https://www.yelp.com/biz/panera-bread-markham")

输出:

Shia L. 4.0
Ryan L. 5.0
Chi K. 3.0
Joan T. 1.0
Nicky D S. 4.0
Matthew K. 3.0
Michelle W. 1.0
Jennifer C. 4.0
Niral P. 3.0
Shajitha R. 1.0
Veronica C. 3.0
Tanveer K. 1.0
Joey J. 2.0
Broadwaygirl M. 1.0
Sheena Y. 3.0
Wendy B. 4.0
Jacqueline L. 2.0
Mi S. 3.0
Sharon M. 2.0
Eduni C. 1.0

推荐阅读