python - 如何通过 Xpath 获取同一类的所有值?
问题描述
尝试学习 Xpath 抓取,但不能成功。
当我在 Chrome 中使用 Xpath 助手插件时,我可以得到这样的数据。大约 99 个端口,最后一个是“$PORT”
import requests
import csv
from lxml import etree
url = 'https://www.msccruisesusa.com/webapp/wcs/stores/servlet/MSC_SearchCruiseManagerRedirectCmd?storeId=12264&langId=-1004&catalogId=10001&monthsResult=&areaFilter=MED%40NOR%40&embarkFilter=&lengthFilter=&departureFrom=01.11.2020&departureTo=04.11.2020&ships=&category=&onlyAvailableCruises=true&packageTrf=false&packageTpt=false&packageCrol=false&packageCrfl=false&noAdults=2&noChildren=0&noJChildren=0&noInfant=0&dealsInput=false&tripSpecificationPanel=true&shipPreferencesPanel=false&dealsPanel=false'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Safari/537.36'}
source = requests.get(url,headers=headers).content.decode('UTF-8')
html = etree.HTML(source)
portList = html.xpath('//*[@class="cr-city-name"]')
for port in portList:
print(port.xpath('string()'))
有了这个代码,只返回“$PORT”给我,我想知道为什么我不能从这个 Xpath 中获取其他 98 个端口的数据?
解决方案
Javascript
使用from动态填充页面的数据JSON
。但JSON
不通过XHR
. 您可以找到JSON
inHTML
并且您可以提取JSON
usingRegex
并转换JSON
为Dictionary
.
import re
import requests
url = 'https://www.msccruisesusa.com/webapp/wcs/stores/servlet/MSC_SearchCruiseManagerRedirectCmd?storeId=12264&langId=-1004&catalogId=10001&monthsResult=&areaFilter=MED%40NOR%40&embarkFilter=&lengthFilter=&departureFrom=01.11.2020&departureTo=04.11.2020&ships=&category=&onlyAvailableCruises=true&packageTrf=false&packageTpt=false&packageCrol=false&packageCrfl=false&noAdults=2&noChildren=0&noJChildren=0&noInfant=0&dealsInput=false&tripSpecificationPanel=true&shipPreferencesPanel=false&dealsPanel=false'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Safari/537.36'}
response = requests.get(url,headers=headers)
# Extract JSON from HTML.
json_data = re.findall(r"_ports = {\n\s\s(.+?)\n\s\s};", response.text)
# Convert String to Dictionary.
json_data = eval('{' + json_data[0] + '}')
print(json_data.values())
输出:
dict_values(['Aqaba, Jordan', 'Valencia, Spain', 'Cairs, Australia', 'Venice, Italy', 'Shekou, China', 'Shanghai, China', 'Goteborg, Sweden', 'Darwin, Australia', 'George Town, Cayman Islands', 'Siracusa, Italy', 'Genoa, Italy', 'Reykjavik, Iceland', 'Havana, Cuba', 'Singapore, Republic of Singapore', 'Arica, Chile', 'Hamburg,Germany', 'Kusadasi, Turkey', 'Yokohama, Japan', 'Valparaiso,Chile', 'Copenhagen, Denmark', 'Civitavecchia, Italy', 'Barcelona, Spain', 'Auckland, New Zealand', 'Livorno, Italy', 'Montevideo, Uruguay', 'Brindisi, Italy', 'Kiel,Germany', 'San Juan, Puerto Rico', 'Callao, Peru', 'Funchal, Portugal', 'Haifa, Israel', 'Lisbon, Portugal', 'Papeete, Tahiti', 'Trieste, Italy', 'Piraeus, Greece', 'Rio de Janeiro, Brazil', 'Keelung, Taiwan', 'Buenos Aires, Argentina', 'New York, United States', 'Salvador, Brazil', 'Tianjin, China', 'Valletta, Malta', 'Santos, Brazil', 'Cannes, France', 'Naples, Italy', 'Fukuoka, Japan', 'Ushuaia,Argentina', 'Philipsburg, St. Maarten', 'Zeebrugge, Belgium', 'Durban, South Africa', 'Istanbul, Turkey', 'Cagliari, Italy', 'Vigo, Spain', 'Dubai,U.Arab Emirates', 'Amsterdam, Netherlands', 'Tampa, United States', 'Doha, Qatar', 'Abu Dhabi,U.Arab Emirates', 'Itajai, Brazil', 'Port Kembla, Australia', 'Tokyo, Japan', 'Cartagena, Spain', 'Nassau, Bahamas', 'Messina, Italy', 'Benoa/Bali, Indonesia', 'Nansha,China', 'Heraklion, Greece', 'Mumbai/Bombay, India', 'Muscat, Oman', 'Wellington, New Zealand', 'Warnemunde,Germany', 'Fort de France, Martinique', 'Isafjordur, Iceland', 'Bridgetown, Barbados', 'Marseille, France', 'Sydney, Australia', 'Miami, Florida', 'Cozumel, Mexico', 'Rotterdam, Netherlands', 'Izmir, Turkey', 'Cape Town, South Africa', 'Qingdao, China', 'Palma de Mallorca, Spain', 'San Francisco, United states', 'Hobart, Australia', 'Malaga, Spain', 'Palermo, Italy', 'St Nazaire, France', 'Mindelo, Cape Verde', 'Pointe-a-Pitre, Guadeloupe', 'Hong Kong,Hong Kong', 'Le Havre, France', 'Ocean Cay MSC Marine Reserve', 'St Petersburg, Russian Fed.', 'Ilhabela, Brazil', 'Ancona, Italy', ......., 'Corner Brook, Canada', 'Brunsbuttel,Germany', 'Newcastle, Australia', 'Busan, Korea, Republic of', 'Maputo, Mozambique'])
或者您可以使用Selenium
ChromeDriver
which load Javascript
into HTML
. 因此,您可以使用lxml
.
from selenium import webdriver
from lxml import etree
driver = webdriver.Chrome(executable_path=r"***YOUR_CHROME-DRIVER_PATH***")
driver.get('https://www.msccruisesusa.com/webapp/wcs/stores/servlet/MSC_SearchCruiseManagerRedirectCmd?storeId=12264&langId=-1004&catalogId=10001&monthsResult=&areaFilter=MED%40NOR%40&embarkFilter=&lengthFilter=&departureFrom=01.11.2020&departureTo=04.11.2020&ships=&category=&onlyAvailableCruises=true&packageTrf=false&packageTpt=false&packageCrol=false&packageCrfl=false&noAdults=2&noChildren=0&noJChildren=0&noInfant=0&dealsInput=false&tripSpecificationPanel=true&shipPreferencesPanel=false&dealsPanel=false')
html = etree.HTML(driver.page_source)
driver.close()
portList = html.xpath('//*[@class="cr-city-name"]')
for port in portList:
print(port.xpath('string()'), end=' | ')
输出:
Marseille, France | Genoa, Italy | Civitavecchia, Italy | Palermo, Italy | Valletta, Malta | Barcelona, Spain | Marseille, France | MSC Grandiosa | Marseille, France | Genoa, Italy | Civitavecchia, Italy | Palermo, Italy | Valletta, Malta | Barcelona, Spain | Marseille, France | MSC Grandiosa | Genoa, Italy | Civitavecchia, Italy | Palermo, Italy | Valletta, Malta | Barcelona, Spain | Marseille, France | Genoa, Italy | MSC Grandiosa | Genoa, Italy | Civitavecchia, Italy | Palermo, Italy | Valletta, Malta | Barcelona, Spain | Marseille, France | Genoa, Italy | MSC Grandiosa | Civitavecchia, Italy | Palermo, Italy | Valletta, Malta | Barcelona, Spain | Marseille, France | Genoa, Italy | MSC Grandiosa | Civitavecchia, Italy | Palermo, Italy | Valletta, Malta | Barcelona, Spain | Marseille, France | Genoa, Italy | MSC Grandiosa | Palermo, Italy | Valletta, Malta | Barcelona, Spain | Marseille, France | Genoa, Italy | Civitavecchia, Italy | Palermo, Italy | MSC Grandiosa | Palermo, Italy | Valletta, Malta | Barcelona, Spain | Marseille, France | Genoa, Italy | Civitavecchia, Italy | Palermo, Italy | MSC Grandiosa | Barcelona, Spain | Marseille, France | Genoa, Italy | Civitavecchia, Italy | Palermo, Italy | Valletta, Malta | Barcelona, Spain | MSC Grandiosa | Barcelona, Spain | Marseille, France | Genoa, Italy | Civitavecchia, Italy | Palermo, Italy | Valletta, Malta | Barcelona, Spain | MSC Grandiosa | Civitavecchia, Italy | Genoa, Italy | Malaga, Spain | Funchal, Portugal | Santa Cruz de Tenerife, Spain | Tangier, Morocco | Cartagena, Spain | Civitavecchia, Italy | MSC Opera | Civitavecchia, Italy | Genoa, Italy | Malaga, Spain | Funchal, Portugal | Santa Cruz de Tenerife, Spain | Tangier, Morocco | Cartagena, Spain | Civitavecchia, Italy | MSC Opera | $PORT |
您可以从这里下载 ChromeDriver 。
推荐阅读
- azure - 可以在没有 RDP 连接的情况下更新 Azure VM(Windows Server)吗?
- java - 静止物体和运动物体之间的碰撞
- javascript - 如何使用javascript获取两个日期的月份数
- aws-sdk - 是否有用于使用 VTL 的 AWS Appsync 的 Model.objects.update_or_create()?
- prolog - 路径重复超过特定数量
- ios - Swift 4:单击按钮时无法推送 ViewController。
- c# - AES 加密 - 相同密钥和 IV 的不同加密值
- api - Azure Api 管理和 kubernetes
- django - Django Wagtail 图像未显示在帖子详细信息页面上
- batch-file - Windows批处理文件在if语句中的变量名上出错