python - 使用 python 解析 URL
问题描述
我想解析以下 URL: Espacenet 链接
我想获取与文本对应的 URL:
- 具有新颖结构的母线的电池组
我正在使用python,但我对javascript并不熟悉。我怎样才能完成工作?
到目前为止,我已经看到 requests_html 并尝试了以下代码:
from requests_html import HTMLSession
from bs4 import BeautifulSoup
publication_number_to_scrape = "EP2814089"
url = "https://worldwide.espacenet.com/searchResults?ST=singleline&locale=fr_EP&submitted=true&DB=&query=ep2814089" + publication_number_to_scrape
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
headers = {'User-Agent': user_agent}
# create an HTML Session object
session = HTMLSession()
# Use the object above to connect to needed webpage
resp = session.get(url, headers=headers)
print(resp.content)
# Run JavaScript code on webpage
html2 = resp.html.render()
soup = BeautifulSoup(resp.content, 'html.parser')
print(soup)
在打印的结果中,我看到了这部分:
</li>
<li class="bendractive"><a accesskey="b" href="">Liste de résultats</a></li>
<li class="bendr"><a accesskey="c" class="ptn" href="/mydocumentslist?submitted=true&locale=fr_EP" id="menuPnStar">Ma liste de brevets (<span id="menuPnCount"></span>)</a></li>
<li class="bendr"><a accesskey="d" href="/queryHistory?locale=fr_EP">Historique des requêtes</a></li>
<li class="spacer"></li>
<li class="bendl"><a accesskey="e" href="/settings?locale=fr_EP">Paramètres</a></li>
<li class="bendl last">
<a accesskey="f" href="/help?locale=fr_EP&method=handleHelpTopic&topic=index">Aide</a>
</li>
我的目标是从结果中获取以下 URL: Wanted URL
我的最终目标是获取一个列表,其中包含该 URL 中出现的每个文档的字符串:
我不需要所述文档的 URL,只需要以下列表:
result = ['EP2814089 (A4)', 'EP2814089 (B1)', ....]
解决方案
我认为这将完成这项工作:
import requests
from bs4 import BeautifulSoup
cookies = {
'JSESSIONID': '9ULYIsd9+RmCkgzGPoLdCWMP.espacenet_levelx_prod_1',
'org.springframework.web.servlet.i18n.CookieLocaleResolver.LOCALE': 'fr_EP',
'menuCurrentSearch': '%2F%2Fworldwide.espacenet.com%2FsearchResults%3FDB%3D%26ST%3Dsingleline%26locale%3Dfr_EP%26query%3Dep2814089',
'currentUrl': 'https%3A%2F%2Fworldwide.espacenet.com%2FsearchResults%3FDB%3D%26ST%3Dsingleline%26locale%3Dfr_EP%26query%3Dep2814089',
'PGS': '10',
}
headers = {
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.79 Safari/537.36',
'Sec-Fetch-User': '?1',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Sec-Fetch-Site': 'none',
'Sec-Fetch-Mode': 'navigate',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'tr,tr-TR;q=0.9',
}
params = (
('DB', ''),
('ST', 'singleline'),
('locale', 'fr_EP'),
('query', 'ep2814089'),
)
response = requests.get('https://worldwide.espacenet.com/searchResults', headers=headers, params=params, cookies=cookies)
soup = BeautifulSoup(response.text, 'html.parser')