python-3.x - 从特定作者处获取所有href
问题描述
您好,我正在尝试获取将我链接到特定作者的文章摘要的所有 /pubmed/ 数字。问题是,当我尝试这样做时,我只能一遍又一遍地获得相同的数字,直到 for 循环结束。
我试图获取它的 href 应该取自for line in lines
循环的输出(具体的 href 在输出示例中)。该循环似乎运行良好,但是for abstract in abstracts
循环仅重复相同的href。
任何建议或想法我错过了什么或做错了什么。我对 bs4 没有太多经验,所以可能我没有很好地使用这个库。
#Obtain all the papers of a scientific author and write its abstract in a new file
from bs4 import BeautifulSoup
import re
import requests
url="https://www.ncbi.nlm.nih.gov/pubmed/?term=valvano"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
lines = soup.find_all("div",{"class": "rslt"})
authors= soup.find_all("p",{"class": "desc"})
scientist=[]
for author in authors:
#print('\n', author.text)
scientist.append(author.text)
s=[]
for i in scientist:
L=i.split(',')
s.append(L)
n=0
for line in lines:
if ' Valvano MA' in s[n] or 'Valvano MA' in s[n] :
print('\n',line.text)
#part of one output:
<a **href="/pubmed/32146294"** ...
found = soup.find("a",{"class": "status_icon nohighlight"})
web_abstract='https://www.ncbi.nlm.nih.gov{}'.format(found['href'])
response0 = requests.get(web_abstract)
sopa = BeautifulSoup(response0.content, 'lxml')
abstracts = sopa.find("div",{"class": "abstr"})
for abstract in abstracts:
#print (abstract.text)
print('https://www.ncbi.nlm.nih.gov{}'.format(found['href']))
#output:
https://www.ncbi.nlm.nih.gov/pubmed/31919170
https://www.ncbi.nlm.nih.gov/pubmed/31919170
https://www.ncbi.nlm.nih.gov/pubmed/31919170
https://www.ncbi.nlm.nih.gov/pubmed/31919170
https://www.ncbi.nlm.nih.gov/pubmed/31919170
https://www.ncbi.nlm.nih.gov/pubmed/31919170
https://www.ncbi.nlm.nih.gov/pubmed/31919170
https://www.ncbi.nlm.nih.gov/pubmed/31919170
https://www.ncbi.nlm.nih.gov/pubmed/31919170
https://www.ncbi.nlm.nih.gov/pubmed/31919170
https://www.ncbi.nlm.nih.gov/pubmed/31919170
n=n+1
else:
n=n+1
#expected output:
https://www.ncbi.nlm.nih.gov/pubmed/32146294
https://www.ncbi.nlm.nih.gov/pubmed/32064693
https://www.ncbi.nlm.nih.gov/pubmed/31978399
https://www.ncbi.nlm.nih.gov/pubmed/31919170
https://www.ncbi.nlm.nih.gov/pubmed/31896348
https://www.ncbi.nlm.nih.gov/pubmed/31866961
https://www.ncbi.nlm.nih.gov/pubmed/31722994
https://www.ncbi.nlm.nih.gov/pubmed/31350337
https://www.ncbi.nlm.nih.gov/pubmed/31332863
https://www.ncbi.nlm.nih.gov/pubmed/31233657
https://www.ncbi.nlm.nih.gov/pubmed/31133642
https://www.ncbi.nlm.nih.gov/pubmed/30913267
解决方案
鉴于使用 URL https://www.ncbi.nlm.nih.gov/pubmed/?term=valvano+MA返回正确的结果,您可以使用以下正则表达式示例。
from bs4 import BeautifulSoup
import re
import requests
url = "https://www.ncbi.nlm.nih.gov/pubmed/?term=valvano+MA"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
for a in soup.select('div.rprt p a'):
if re.match('^/pubmed/[0-9]*$', a['href']) is not None:
print('https://www.ncbi.nlm.nih.gov{}'.format(a['href']))
这将获得所有 20 个结果以及结果 17 的勘误表。如果您不希望此勘误表将第 10 行更改为
if re.match('^/pubmed/[0-9]*$', a['href']) is not None and a.get('ref') is not None:
推荐阅读
- c - 如何一般分配传递给C中函数的指针
- c++ - 如何修复 qt 标签中的错误调整大小?
- javascript - Jquery On keyup 函数获取当前字符
- git - 创建 git commit C,它是 B 和 A 之差的倒数
- pandas - 将以下日期转换为日期时间,然后使用 .dt 模块更改其显示格式
- angular - 双向数据绑定无法在 ngModel 中使用索引
- python - GCP 实例为 Ajax 127.0.0.1 路由返回 ERR_CONNECTION_REFUSED
- python - pandas 按多列分组并根据多个条件删除行
- getstream-io - Getstream-如何过滤假喜欢?
- generics - Dart:如何将泛型方法绑定到嵌套的泛型参数?