首页 > 解决方案 > 为什么 BeautifulSoup 不抓取整个网页?

问题描述

前提:我对 Python 和网络抓取完全陌生。我正在尝试在此页面上抓取有关品牌的数据:https ://www.interbrand.com/best-brands/best-global-brands/2018/ranking/ ,但 BeautifulSoup 仅提取 html 到某一点. 那里的 html 中似乎没有什么奇怪的,因为在 BeautifulSoup 提取的标签之前有五个几乎相等的标签,没有任何问题。

我已经尝试过使用三种不同的解析器(内置的,lxml 和 html5lib),但我总是得到相同的结果。

这是代码:

import requests
page = requests.get("https://www.interbrand.com/best-brands/best-global-brands/2018/ranking/")
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content , 'html5lib')
print(soup.prettify())

标签: pythonweb-scrapingbeautifulsoup

解决方案


使用 Css 选择器获取输出。

from bs4 import BeautifulSoup
import requests
page = requests.get("https://www.interbrand.com/best-brands/best-global-brands/2018/ranking/")
soup = BeautifulSoup(page.content , 'lxml')
Brand=[]
Country=[]
Region=[]
Sector=[]
for brnd in soup.select('div.brand-name'):
    Brand.append(brnd['title'])

for region in soup.select('div.brand-region'):
    Region.append(region['title'])

for county in soup.select('div.brand-country'):
    Country.append(county['title'])

for sector in soup.select('div.brand-sector'):
    Sector.append(sector['title'])

print(Brand)
print(Region)
print(Country)
print(Sector)

输出:

['Brand name: Apple', 'Brand name: Google', 'Brand name: Amazon', 'Brand name: Microsoft', 'Brand name: Coca-Cola', 'Brand name: Samsung', 'Brand name: Toyota', 'Brand name: Mercedes-Benz', 'Brand name: Facebook', "Brand name: McDonald's", 'Brand name: Intel', 'Brand name: IBM', 'Brand name: BMW', 'Brand name: Disney', 'Brand name: Cisco', 'Brand name: GE', 'Brand name: Nike', 'Brand name: Louis Vuitton', 'Brand name: Oracle', 'Brand name: Honda', 'Brand name: SAP', 'Brand name: Pepsi', 'Brand name: Chanel', 'Brand name: American Express', 'Brand name: Zara', 'Brand name: J.P. Morgan', 'Brand name: IKEA', 'Brand name: Gillette', 'Brand name: UPS', 'Brand name: H&M', 'Brand name: Pampers', 'Brand name: Hermès', 'Brand name: Budweiser', 'Brand name: Accenture', 'Brand name: Ford', 'Brand name: Hyundai', 'Brand name: NESCAFÉ', 'Brand name: eBay', 'Brand name: Gucci', 'Brand name: Nissan', 'Brand name: Volkswagen', 'Brand name: Audi', 'Brand name: Philips', 'Brand name: Goldman Sachs', 'Brand name: Citi', 'Brand name: HSBC', 'Brand name: AXA', "Brand name: L'Oréal", 'Brand name: Allianz', 'Brand name: adidas', 'Brand name: Adobe', 'Brand name: Porsche', "Brand name: Kellogg's", 'Brand name: HP', 'Brand name: Canon', 'Brand name: Siemens', 'Brand name: Starbucks', 'Brand name: Danone', 'Brand name: Sony', 'Brand name: 3M', 'Brand name: Visa', 'Brand name: Nestlé', 'Brand name: Morgan Stanley', 'Brand name: Colgate', 'Brand name: Hewlett Packard Enterprise', 'Brand name: Netflix', 'Brand name: Cartier', 'Brand name: Huawei', 'Brand name: Banco Santander', 'Brand name: Mastercard', 'Brand name: Kia', 'Brand name: FedEx', 'Brand name: PayPal', 'Brand name: LEGO', 'Brand name: Salesforce.com', 'Brand name: Panasonic', 'Brand name: Johnson & Johnson', 'Brand name: Land Rover', 'Brand name: DHL', 'Brand name: Ferrari', 'Brand name: Discovery', 'Brand name: Caterpillar', 'Brand name: Tiffany & Co.', "Brand name: Jack Daniel's", 'Brand name: Corona', 'Brand name: KFC', 'Brand name: Heineken', 'Brand name: John Deere', 'Brand name: Shell', 'Brand name: MINI', 'Brand name: Dior', 'Brand name: Spotify', 'Brand name: Harley-Davidson', 'Brand name: Burberry', 'Brand name: Prada', 'Brand name: Sprite', 'Brand name: Johnnie Walker', 'Brand name: Hennessy', 'Brand name: Nintendo', 'Brand name: Subaru']
['Region: The Americas', 'Region: The Americas', 'Region: The Americas', 'Region: The Americas', 'Region: The Americas', 'Region: Asia Pacific', 'Region: Asia Pacific', 'Region: Europe & Africa', 'Region: The Americas', 'Region: The Americas', 'Region: The Americas', 'Region: The Americas', 'Region: Europe & Africa', 'Region: The Americas', 'Region: The Americas', 'Region: The Americas', 'Region: The Americas', 'Region: Europe & Africa', 'Region: The Americas', 'Region: Asia Pacific', 'Region: Europe & Africa', 'Region: The Americas', 'Region: Europe & Africa', 'Region: The Americas', 'Region: Europe & Africa', 'Region: The Americas', 'Region: Europe & Africa', 'Region: The Americas', 'Region: The Americas', 'Region: Europe & Africa', 'Region: The Americas', 'Region: Europe & Africa', 'Region: The Americas', 'Region: The Americas', 'Region: The Americas', 'Region: Asia Pacific', 'Region: Europe & Africa', 'Region: The Americas', 'Region: Europe & Africa', 'Region: Asia Pacific', 'Region: Europe & Africa', 'Region: Europe & Africa', 'Region: Europe & Africa', 'Region: The Americas', 'Region: The Americas', 'Region: Europe & Africa', 'Region: Europe & Africa', 'Region: Europe & Africa', 'Region: Europe & Africa', 'Region: Europe & Africa', 'Region: The Americas', 'Region: Europe & Africa', 'Region: The Americas', 'Region: The Americas', 'Region: Asia Pacific', 'Region: Europe & Africa', 'Region: The Americas', 'Region: Europe & Africa', 'Region: Asia Pacific', 'Region: The Americas', 'Region: The Americas', 'Region: Europe & Africa', 'Region: The Americas', 'Region: The Americas', 'Region: The Americas', 'Region: The Americas', 'Region: Europe & Africa', 'Region: Asia Pacific', 'Region: Europe & Africa', 'Region: The Americas', 'Region: Asia Pacific', 'Region: The Americas', 'Region: The Americas', 'Region: Europe & Africa', 'Region: The Americas', 'Region: Asia Pacific', 'Region: The Americas', 'Region: Europe & Africa', 'Region: The Americas', 'Region: Europe & Africa', 'Region: The Americas', 'Region: The Americas', 'Region: The Americas', 'Region: The Americas', 'Region: The Americas', 'Region: The Americas', 'Region: Europe & Africa', 'Region: The Americas', 'Region: Europe & Africa', 'Region: Europe & Africa', 'Region: Europe & Africa', 'Region: Europe & Africa', 'Region: The Americas', 'Region: Europe & Africa', 'Region: Europe & Africa', 'Region: The Americas', 'Region: Europe & Africa', 'Region: Europe & Africa', 'Region: Asia Pacific', 'Region: Asia Pacific']
['Country: United States', 'Country: United States', 'Country: United States', 'Country: United States', 'Country: United States', 'Country: South Korea', 'Country: Japan', 'Country: Germany', 'Country: United States', 'Country: United States', 'Country: United States', 'Country: United States', 'Country: Germany', 'Country: United States', 'Country: United States', 'Country: United States', 'Country: United States', 'Country: France', 'Country: United States', 'Country: Japan', 'Country: Germany', 'Country: United States', 'Country: France', 'Country: United States', 'Country: Spain', 'Country: United States', 'Country: Sweden', 'Country: United States', 'Country: United States', 'Country: Sweden', 'Country: United States', 'Country: France', 'Country: United States', 'Country: United States', 'Country: United States', 'Country: South Korea', 'Country: Switzerland', 'Country: United States', 'Country: Italy', 'Country: Japan', 'Country: Germany', 'Country: Germany', 'Country: Netherlands', 'Country: United States', 'Country: United States', 'Country: United Kingdom', 'Country: France', 'Country: France', 'Country: Germany', 'Country: Germany', 'Country: United States', 'Country: Germany', 'Country: United States', 'Country: United States', 'Country: Japan', 'Country: Germany', 'Country: United States', 'Country: France', 'Country: Japan', 'Country: United States', 'Country: United States', 'Country: Switzerland', 'Country: United States', 'Country: United States', 'Country: United States', 'Country: United States', 'Country: France', 'Country: China', 'Country: Spain', 'Country: United States', 'Country: South Korea', 'Country: United States', 'Country: United States', 'Country: Denmark', 'Country: United States', 'Country: Japan', 'Country: United States', 'Country: United Kingdom', 'Country: United States', 'Country: Italy', 'Country: United States', 'Country: United States', 'Country: United States', 'Country: United States', 'Country: Mexico', 'Country: United States', 'Country: Netherlands', 'Country: United States', 'Country: Netherlands', 'Country: United Kingdom', 'Country: France', 'Country: Sweden', 'Country: United States', 'Country: United Kingdom', 'Country: Italy', 'Country: United States', 'Country: United Kingdom', 'Country: France', 'Country: Japan', 'Country: Japan']
['Sector: Technology', 'Sector: Technology', 'Sector: Retail', 'Sector: Technology', 'Sector: Beverages', 'Sector: Technology', 'Sector: Automotive', 'Sector: Automotive', 'Sector: Technology', 'Sector: Restaurants', 'Sector: Technology', 'Sector: Business Services', 'Sector: Automotive', 'Sector: Media', 'Sector: Technology', 'Sector: Diversified', 'Sector: Sporting Goods', 'Sector: Luxury', 'Sector: Technology', 'Sector: Automotive', 'Sector: Technology', 'Sector: Beverages', 'Sector: Luxury', 'Sector: Financial Services', 'Sector: Apparel', 'Sector: Financial Services', 'Sector: Retail', 'Sector: FMCG', 'Sector: Logistics', 'Sector: Apparel', 'Sector: FMCG', 'Sector: Luxury', 'Sector: Alcohol', 'Sector: Business Services', 'Sector: Automotive', 'Sector: Automotive', 'Sector: Beverages', 'Sector: Retail', 'Sector: Luxury', 'Sector: Automotive', 'Sector: Automotive', 'Sector: Automotive', 'Sector: Electronics', 'Sector: Financial Services', 'Sector: Financial Services', 'Sector: Financial Services', 'Sector: Financial Services', 'Sector: FMCG', 'Sector: Financial Services', 'Sector: Sporting Goods', 'Sector: Technology', 'Sector: Automotive', 'Sector: FMCG', 'Sector: Technology', 'Sector: Electronics', 'Sector: Diversified', 'Sector: Restaurants', 'Sector: FMCG', 'Sector: Electronics', 'Sector: Diversified', 'Sector: Financial Services', 'Sector: FMCG', 'Sector: Financial Services', 'Sector: FMCG', 'Sector: Technology', 'Sector: Media', 'Sector: Luxury', 'Sector: Technology', 'Sector: Financial Services', 'Sector: Financial Services', 'Sector: Automotive', 'Sector: Logistics', 'Sector: Financial Services', 'Sector: FMCG', 'Sector: Business Services', 'Sector: Electronics', 'Sector: FMCG', 'Sector: Automotive', 'Sector: Logistics', 'Sector: Automotive', 'Sector: Media', 'Sector: Diversified', 'Sector: Luxury', 'Sector: Alcohol', 'Sector: Alcohol', 'Sector: Restaurants', 'Sector: Alcohol', 'Sector: Diversified', 'Sector: Energy', 'Sector: Automotive', 'Sector: Luxury', 'Sector: Media', 'Sector: Automotive', 'Sector: Luxury', 'Sector: Luxury', 'Sector: Beverages', 'Sector: Alcohol', 'Sector: Alcohol', 'Sector: Electronics', 'Sector: Automotive']

推荐阅读