首页 > 解决方案 > 无法解析网页中的网站链接

问题描述

我用 selenium 在 python 中创建了一个脚本来抓取位于网站Contact details中的网站地址。但是,问题是没有与该链接关联的 url(不过我可以单击该链接)。

如何解析位于其中的网站链接Contact details

from selenium import webdriver

URL = 'https://www.truelocal.com.au/business/vitfit/sydney'

def get_website_link(driver,link):
    driver.get(link)
    website = driver.find_element_by_css_selector("[ng-class*='getHaveSecondaryWebsites'] > span").text
    print(website)

if __name__ == '__main__':
    driver = webdriver.Chrome()
    try:
        get_website_link(driver,URL)
    finally:
        driver.quit()

当我运行脚本时,我得到与该链接关联的可见文本,即Visit website.

标签: pythonpython-3.xseleniumselenium-webdriverweb-scraping

解决方案


带有“访问网站”文本的元素是 a span,它有vm.openLink(vm.getReadableUrl(vm.getPrimaryWebsite()),'_blank')javascript 而不是实际的 href。我的建议,如果你的目标是抓取而不是测试,你可以使用下面的解决方案和requests包来获取数据作为 json 并提取你需要的任何信息。
另一个实际上是单击,就像您所做的那样。

import requests
import re

headers = {
    'Referer': 'https://www.truelocal.com.au/business/vitfit/sydney',
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) '
                  'Chrome/73.0.3683.75 Safari/537.36',
    'DNT': '1',
}
response = requests.get('https://www.truelocal.com.au/www-js/configuration.constant.js?v=1552032205066',
                        headers=headers)
assert response.ok

# extract token from response text
token = re.search("token:\\s'(.*)'", response.text)[1]

headers['Accept'] = 'application/json, text/plain, */*'
headers['Origin'] = 'https://www.truelocal.com.au'

response = requests.get(f'https://api.truelocal.com.au/rest/listings/vitfit/sydney?&passToken={token}', headers=headers)
assert response.ok
# use response.text to get full json as text and see what information can be extracted.

contact = response.json()["data"]["listing"][0]["contacts"]["contact"]
website = list(filter(lambda x: x["type"] == "website", contact))[0]["value"]
print(website)

print("the end")

推荐阅读