首页 > 解决方案 > 使用 python 解析 URL

问题描述

我想解析以下 URL: Espacenet 链接

我想获取与文本对应的 URL:

  1. 具有新颖结构的母线的电池组

我正在使用python,但我对javascript并不熟悉。我怎样才能完成工作?

到目前为止,我已经看到 requests_html 并尝试了以下代码:

from requests_html import HTMLSession
from bs4 import BeautifulSoup

publication_number_to_scrape = "EP2814089"
url = "https://worldwide.espacenet.com/searchResults?ST=singleline&locale=fr_EP&submitted=true&DB=&query=ep2814089" + publication_number_to_scrape
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
headers = {'User-Agent': user_agent}

# create an HTML Session object
session = HTMLSession()

# Use the object above to connect to needed webpage
resp = session.get(url, headers=headers)
print(resp.content)

# Run JavaScript code on webpage
html2 = resp.html.render()

soup = BeautifulSoup(resp.content, 'html.parser')
print(soup)

在打印的结果中,我看到了这部分:

</li>
<li class="bendractive"><a accesskey="b" href="">Liste de résultats</a></li>
<li class="bendr"><a accesskey="c" class="ptn" href="/mydocumentslist?submitted=true&amp;locale=fr_EP" id="menuPnStar">Ma liste de brevets (<span id="menuPnCount"></span>)</a></li>
<li class="bendr"><a accesskey="d" href="/queryHistory?locale=fr_EP">Historique des requêtes</a></li>
<li class="spacer"></li>
<li class="bendl"><a accesskey="e" href="/settings?locale=fr_EP">Paramètres</a></li>
<li class="bendl last">
<a accesskey="f" href="/help?locale=fr_EP&amp;method=handleHelpTopic&amp;topic=index">Aide</a>
</li>

我的目标是从结果中获取以下 URL: Wanted URL

我的最终目标是获取一个列表,其中包含该 URL 中出现的每个文档的字符串:

我需要在字符串列表中解析的文档

我不需要所述文档的 URL,只需要以下列表:

result = ['EP2814089 (A4)', 'EP2814089 (B1)', ....]

标签: pythonparsing

解决方案


我认为这将完成这项工作:

import requests
from bs4 import BeautifulSoup

cookies = {
    'JSESSIONID': '9ULYIsd9+RmCkgzGPoLdCWMP.espacenet_levelx_prod_1',
    'org.springframework.web.servlet.i18n.CookieLocaleResolver.LOCALE': 'fr_EP',
    'menuCurrentSearch': '%2F%2Fworldwide.espacenet.com%2FsearchResults%3FDB%3D%26ST%3Dsingleline%26locale%3Dfr_EP%26query%3Dep2814089',
    'currentUrl': 'https%3A%2F%2Fworldwide.espacenet.com%2FsearchResults%3FDB%3D%26ST%3Dsingleline%26locale%3Dfr_EP%26query%3Dep2814089',
    'PGS': '10',
}

headers = {
    'Connection': 'keep-alive',
    'Upgrade-Insecure-Requests': '1',
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.79 Safari/537.36',
    'Sec-Fetch-User': '?1',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
    'Sec-Fetch-Site': 'none',
    'Sec-Fetch-Mode': 'navigate',
    'Accept-Encoding': 'gzip, deflate, br',
    'Accept-Language': 'tr,tr-TR;q=0.9',
}

params = (
    ('DB', ''),
    ('ST', 'singleline'),
    ('locale', 'fr_EP'),
    ('query', 'ep2814089'),
)

response = requests.get('https://worldwide.espacenet.com/searchResults', headers=headers, params=params, cookies=cookies)

soup = BeautifulSoup(response.text, 'html.parser')

推荐阅读