首页 > 解决方案 > 使用美丽的汤进行网络抓取 - 我如何获得所有类别

问题描述

我怎样才能获得同一网站“ https://www.sfma.org.sg/member/category ”的每个列表页面上提到的所有类别。例如,当我在上述页面上选择酒精饮料类别时,该页面上提到的列表具有如下类别信息:-

Catergory: Alcoholic Beverage, Bottled Beverage, Spirit / Liquor / Hard Liquor, Wine, Distributor, Exporter, Importer, Supplier

我如何使用相同的变量提取此处提到的类别。

我为此编写的代码是:-

  category = soup_2.find_all('a', attrs ={'class' :'plink'})
  links = [links['href'] for links in category]

但它正在生成以下输出,这些输出是页面上的所有链接,而不是 href 中的文本:-

['http://www.sfma.org.sg/about/singapore-food-manufacturers-association',
 'http://www.sfma.org.sg/about/council-members',
 'http://www.sfma.org.sg/about/history-and-milestones',
 'http://www.sfma.org.sg/membership/',
 'http://www.sfma.org.sg/member/',
 'http://www.sfma.org.sg/member/alphabet/',
 'http://www.sfma.org.sg/member/category/',
 'http://www.sfma.org.sg/resources/sme-portal',
 'http://www.sfma.org.sg/resources/setting-up-food-establishments-in-singapore',
 'http://www.sfma.org.sg/resources/import-export-requirements-and-procedures',
 'http://www.sfma.org.sg/resources/labelling-guidelines',
 'http://www.sfma.org.sg/resources/wsq-continuing-education-modular-programmes',
 'http://www.sfma.org.sg/resources/holistic-industry-productivity-scorecard',
 'http://www.sfma.org.sg/resources/p-max',
 'http://www.sfma.org.sg/event/',
  .....]

如果问题似乎是新手,请原谅,我对python很陌生,

谢谢 !!!

标签: pythonhtmlpython-3.xweb-scrapingbeautifulsoup

解决方案


如果您只想要已经发布的结果中的链接,您可以这样获得:

import requests 
from bs4 import BeautifulSoup

page = "https://www.sfma.org.sg/member/category/manufacturer"
information = requests.get(page)
soup = BeautifulSoup(information.content, 'html.parser')
links = soup.find_all('a', attrs ={'class' :'plink'})
for link in links:
    print(link['href'])

输出:

../info/{{permalink}}
http://www.sfma.org.sg/about/singapore-food-manufacturers-association
http://www.sfma.org.sg/about/council-members
http://www.sfma.org.sg/about/history-and-milestones
http://www.sfma.org.sg/membership/
http://www.sfma.org.sg/member/
http://www.sfma.org.sg/member/alphabet/
http://www.sfma.org.sg/member/category/
http://www.sfma.org.sg/resources/sme-portal
http://www.sfma.org.sg/resources/setting-up-food-establishments-in-singapore
http://www.sfma.org.sg/resources/import-export-requirements-and-procedures
http://www.sfma.org.sg/resources/labelling-guidelines
http://www.sfma.org.sg/resources/wsq-continuing-education-modular-programmes
http://www.sfma.org.sg/resources/holistic-industry-productivity-scorecard
http://www.sfma.org.sg/resources/p-max
http://www.sfma.org.sg/event/
http://www.sfma.org.sg/news/
http://www.fipa.com.sg/
http://www.sfma.org.sg/stp
http://www.sgfoodgifts.sg/

但是,如果您想要网站上每个条目的链接,则需要将永久链接值与基本 url 连接起来。我已经从 nag 扩展了这个答案,以帮助从您正在查看的网站中获取您想要的数据。在第二个列表中出现了永久链接值,并且不起作用(食品/饮料类型,而不是公司),所以我将它们删除。

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import re


page = "https://www.sfma.org.sg/member/category/manufacturer"
information = requests.get(page)
soup = BeautifulSoup(information.content, 'html.parser')

url_list = []

script_sections = soup.find_all('script')
for i in range(len(script_sections)):
    if len(script_sections[i].contents) >= 1:
        txt = script_sections[i].contents[0]
        pattern = re.compile(r'permalink:\'(.*?)\'')
        permlinks = re.findall(pattern, txt)
        for i in permlinks:
            href = "../info/{{permalink}}"
            href = href.split('{')[0]+i
            full_url = urljoin(page, href)
            if full_url in url_list:
                # drop the repeat extras?
                url_list.remove(full_url)
            else:
                url_list.append(full_url)

for urls in url_list:
    print(urls)

输出(截断):

https://www.sfma.org.sg/member/info/1a-catering-pte-ltd
https://www.sfma.org.sg/member/info/a-linkz-marketing-pte-ltd
https://www.sfma.org.sg/member/info/aalst-chocolate-pte-ltd
https://www.sfma.org.sg/member/info/abb-pte-ltd
https://www.sfma.org.sg/member/info/ace-synergy-international-pte-ltd
https://www.sfma.org.sg/member/info/acez-instruments-pte-ltd
https://www.sfma.org.sg/member/info/acorn-investments-holding-pte-ltd
https://www.sfma.org.sg/member/info/ad-wright-communications-pte-ltd
https://www.sfma.org.sg/member/info/added-international-s-pte-ltd
https://www.sfma.org.sg/member/info/advance-carton-pte-ltd
https://www.sfma.org.sg/member/info/agroegg-pte-ltd
https://www.sfma.org.sg/member/info/airverclean-pte-ltd
...

推荐阅读