首页 > 解决方案 > 如何通过 Xpath 获取同一类的所有值?

问题描述

尝试学习 Xpath 抓取,但不能成功。

当我在 Chrome 中使用 Xpath 助手插件时,我可以得到这样的数据。大约 99 个端口,最后一个是“$PORT”

Xpath 信息截图

import requests
import csv
from lxml import etree

url = 'https://www.msccruisesusa.com/webapp/wcs/stores/servlet/MSC_SearchCruiseManagerRedirectCmd?storeId=12264&langId=-1004&catalogId=10001&monthsResult=&areaFilter=MED%40NOR%40&embarkFilter=&lengthFilter=&departureFrom=01.11.2020&departureTo=04.11.2020&ships=&category=&onlyAvailableCruises=true&packageTrf=false&packageTpt=false&packageCrol=false&packageCrfl=false&noAdults=2&noChildren=0&noJChildren=0&noInfant=0&dealsInput=false&tripSpecificationPanel=true&shipPreferencesPanel=false&dealsPanel=false'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Safari/537.36'}
source = requests.get(url,headers=headers).content.decode('UTF-8')

html = etree.HTML(source)

portList = html.xpath('//*[@class="cr-city-name"]')

for port in portList:
    print(port.xpath('string()'))

有了这个代码,只返回“$PORT”给我,我想知道为什么我不能从这个 Xpath 中获取其他 98 个端口的数据?

标签: pythonxpathscrapypython-requests

解决方案


Javascript使用from动态填充页面的数据JSON。但JSON不通过XHR. 您可以找到JSONinHTML并且您可以提取JSONusingRegex并转换JSONDictionary.

import re
import requests

url = 'https://www.msccruisesusa.com/webapp/wcs/stores/servlet/MSC_SearchCruiseManagerRedirectCmd?storeId=12264&langId=-1004&catalogId=10001&monthsResult=&areaFilter=MED%40NOR%40&embarkFilter=&lengthFilter=&departureFrom=01.11.2020&departureTo=04.11.2020&ships=&category=&onlyAvailableCruises=true&packageTrf=false&packageTpt=false&packageCrol=false&packageCrfl=false&noAdults=2&noChildren=0&noJChildren=0&noInfant=0&dealsInput=false&tripSpecificationPanel=true&shipPreferencesPanel=false&dealsPanel=false'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Safari/537.36'}
response = requests.get(url,headers=headers)

# Extract JSON from HTML.
json_data = re.findall(r"_ports = {\n\s\s(.+?)\n\s\s};", response.text)

# Convert String to Dictionary.
json_data = eval('{' + json_data[0] + '}')

print(json_data.values())

输出:

dict_values(['Aqaba, Jordan', 'Valencia, Spain', 'Cairs, Australia', 'Venice, Italy', 'Shekou, China', 'Shanghai, China', 'Goteborg, Sweden', 'Darwin, Australia', 'George Town, Cayman Islands', 'Siracusa, Italy', 'Genoa, Italy', 'Reykjavik, Iceland', 'Havana, Cuba', 'Singapore, Republic of Singapore', 'Arica, Chile', 'Hamburg,Germany', 'Kusadasi, Turkey', 'Yokohama, Japan', 'Valparaiso,Chile', 'Copenhagen, Denmark', 'Civitavecchia, Italy', 'Barcelona, Spain', 'Auckland, New Zealand', 'Livorno, Italy', 'Montevideo, Uruguay', 'Brindisi, Italy', 'Kiel,Germany', 'San Juan, Puerto Rico', 'Callao, Peru', 'Funchal, Portugal', 'Haifa, Israel', 'Lisbon, Portugal', 'Papeete, Tahiti', 'Trieste, Italy', 'Piraeus, Greece', 'Rio de Janeiro, Brazil', 'Keelung, Taiwan', 'Buenos Aires, Argentina', 'New York, United States', 'Salvador, Brazil', 'Tianjin, China', 'Valletta, Malta', 'Santos, Brazil', 'Cannes, France', 'Naples, Italy', 'Fukuoka, Japan', 'Ushuaia,Argentina', 'Philipsburg, St. Maarten', 'Zeebrugge, Belgium', 'Durban, South Africa', 'Istanbul, Turkey', 'Cagliari, Italy', 'Vigo, Spain', 'Dubai,U.Arab Emirates', 'Amsterdam, Netherlands', 'Tampa, United States', 'Doha, Qatar', 'Abu Dhabi,U.Arab Emirates', 'Itajai, Brazil', 'Port Kembla, Australia', 'Tokyo, Japan', 'Cartagena, Spain', 'Nassau, Bahamas', 'Messina, Italy', 'Benoa/Bali, Indonesia', 'Nansha,China', 'Heraklion, Greece', 'Mumbai/Bombay, India', 'Muscat, Oman', 'Wellington, New Zealand', 'Warnemunde,Germany', 'Fort de France, Martinique', 'Isafjordur, Iceland', 'Bridgetown, Barbados', 'Marseille, France', 'Sydney, Australia', 'Miami, Florida', 'Cozumel, Mexico', 'Rotterdam, Netherlands', 'Izmir, Turkey', 'Cape Town, South Africa', 'Qingdao, China', 'Palma de Mallorca, Spain', 'San Francisco, United states', 'Hobart, Australia', 'Malaga, Spain', 'Palermo, Italy', 'St Nazaire, France', 'Mindelo, Cape Verde', 'Pointe-a-Pitre, Guadeloupe', 'Hong Kong,Hong Kong', 'Le Havre, France', 'Ocean Cay MSC Marine Reserve', 'St Petersburg, Russian Fed.', 'Ilhabela, Brazil', 'Ancona, Italy', ......., 'Corner Brook, Canada', 'Brunsbuttel,Germany', 'Newcastle, Australia', 'Busan, Korea, Republic of', 'Maputo, Mozambique'])

或者您可以使用Selenium ChromeDriverwhich load Javascriptinto HTML. 因此,您可以使用lxml.

from selenium import webdriver
from lxml import etree

driver = webdriver.Chrome(executable_path=r"***YOUR_CHROME-DRIVER_PATH***")
driver.get('https://www.msccruisesusa.com/webapp/wcs/stores/servlet/MSC_SearchCruiseManagerRedirectCmd?storeId=12264&langId=-1004&catalogId=10001&monthsResult=&areaFilter=MED%40NOR%40&embarkFilter=&lengthFilter=&departureFrom=01.11.2020&departureTo=04.11.2020&ships=&category=&onlyAvailableCruises=true&packageTrf=false&packageTpt=false&packageCrol=false&packageCrfl=false&noAdults=2&noChildren=0&noJChildren=0&noInfant=0&dealsInput=false&tripSpecificationPanel=true&shipPreferencesPanel=false&dealsPanel=false')

html = etree.HTML(driver.page_source)
driver.close()

portList = html.xpath('//*[@class="cr-city-name"]')

for port in portList:
    print(port.xpath('string()'), end=' | ')

输出:

Marseille, France | Genoa, Italy | Civitavecchia, Italy | Palermo, Italy | Valletta, Malta | Barcelona, Spain | Marseille, France | MSC Grandiosa | Marseille, France | Genoa, Italy | Civitavecchia, Italy | Palermo, Italy | Valletta, Malta | Barcelona, Spain | Marseille, France | MSC Grandiosa | Genoa, Italy | Civitavecchia, Italy | Palermo, Italy | Valletta, Malta | Barcelona, Spain | Marseille, France | Genoa, Italy | MSC Grandiosa | Genoa, Italy | Civitavecchia, Italy | Palermo, Italy | Valletta, Malta | Barcelona, Spain | Marseille, France | Genoa, Italy | MSC Grandiosa | Civitavecchia, Italy | Palermo, Italy | Valletta, Malta | Barcelona, Spain | Marseille, France | Genoa, Italy | MSC Grandiosa | Civitavecchia, Italy | Palermo, Italy | Valletta, Malta | Barcelona, Spain | Marseille, France | Genoa, Italy | MSC Grandiosa | Palermo, Italy | Valletta, Malta | Barcelona, Spain | Marseille, France | Genoa, Italy | Civitavecchia, Italy | Palermo, Italy | MSC Grandiosa | Palermo, Italy | Valletta, Malta | Barcelona, Spain | Marseille, France | Genoa, Italy | Civitavecchia, Italy | Palermo, Italy | MSC Grandiosa | Barcelona, Spain | Marseille, France | Genoa, Italy | Civitavecchia, Italy | Palermo, Italy | Valletta, Malta | Barcelona, Spain | MSC Grandiosa | Barcelona, Spain | Marseille, France | Genoa, Italy | Civitavecchia, Italy | Palermo, Italy | Valletta, Malta | Barcelona, Spain | MSC Grandiosa | Civitavecchia, Italy | Genoa, Italy | Malaga, Spain | Funchal, Portugal | Santa Cruz de Tenerife, Spain | Tangier, Morocco | Cartagena, Spain | Civitavecchia, Italy | MSC Opera | Civitavecchia, Italy | Genoa, Italy | Malaga, Spain | Funchal, Portugal | Santa Cruz de Tenerife, Spain | Tangier, Morocco | Cartagena, Spain | Civitavecchia, Italy | MSC Opera | $PORT | 

您可以从这里下载 ChromeDriver 。


推荐阅读