首页 > 解决方案 > 如何基于beautifulsoup修复我关于web scraper的python代码?

问题描述

我是 Python 的初学者。

我尝试制作一些网络爬虫(尤其是 PubMed)。

使用我的代码,我想打印不仅有论文标题的结果,还有 doi(或论文的任何加入链接),如下所示。

标题:ABCD ABCD ABCD ABCD [http://~~~~]

标题:ABCD ABCD ABCD ABCD [http://~~~~]

标题:ABCD ABCD ABCD ABCD [http://~~~~]

……

但是,在最后阶段,

我无法同时显示标题和链接。

当我分别打印每个因素时,它就起作用了。

另外,我不知道如何使用'for'。

我非常感谢您考虑我的问题。

谢谢。

import requests
from bs4 import BeautifulSoup
from pprint import pprint

search = str(input("Search: "))
arttype = str(input("Is ir Review ? (y/n): "))
perpage = str(input("How many results do you want ? (10/20/50/100/200): "))
sort = str(input("Which options do you want ? (date/match): "))

if arttype == "y":
    arttype_in = "&filter=pubt.review"
else:
    arttype_in = ""

if sort == "data":
    sort2 = "&sort=data"
else:
    sort2 = ""

url = "https://pubmed.ncbi.nlm.nih.gov/?term=" + search + arttype_in + "&format=abstract" + sort2 + "&size=" + perpage
req = requests.get(url)
html = req.text
status = req.status_code


if status != 200:
    print ("")
else:
    print ("Stuck")
    

soup = BeautifulSoup(html, "html.parser")

contain_amount = soup.find ("div", {"class":"search-results"})
specific_amount = contain_amount.find ("div", {"class":"results-amount"}).text

print("Number of papers: " + str(specific_amount))

list_titles = soup.find_all ("div", {"class":"short-view"})
list_dois = soup.find_all ("a", {"class":"link-item dialog-focus"})


for i in list_dois:
    for j in list_titles:
        titles = j.find ("h1", {"class":"heading-title"}).text
        print ("Title: " + str(titles))
    dois = i.attrs["href"]
    print ("[" + str(dois) + "]")

标签: pythonweb-scrapingbeautifulsoup

解决方案


更改选择器。你的代码有一半是正确的

import requests
from bs4 import BeautifulSoup
from pprint import pprint

search = str(input("Search: "))
arttype = str(input("Is ir Review ? (y/n): "))
perpage = str(input("How many results do you want ? (10/20/50/100/200): "))
sort = str(input("Which options do you want ? (date/match): "))

if arttype == "y":
    arttype_in = "&filter=pubt.review"
else:
    arttype_in = ""

if sort == "data":
    sort2 = "&sort=data"
else:
    sort2 = ""

url = "https://pubmed.ncbi.nlm.nih.gov/?term=" + search + arttype_in + "&format=abstract" + sort2 + "&size=" + perpage
print(url)
req = requests.get(url)
html = req.text
status = req.status_code


if status != 200:
    print ("Stuck")
    

soup = BeautifulSoup(html, "html.parser")

search_divs = soup.find_all("div", class_="results-article")

for div in search_divs:
    print("Title - {}".format(div.find("h1", class_="heading-title").get_text(strip=True)))
    print("Link - {}".format("https://pubmed.ncbi.nlm.nih.gov" + div.find("a")["href"]))
    print("---" * 25)

print("Number of papers - {}".format(soup.find("div", class_="results-amount").get_text(strip=True)))

输出:

Search: corona
Is ir Review ? (y/n): n
How many results do you want ? (10/20/50/100/200): 20
Which options do you want ? (date/match): match
https://pubmed.ncbi.nlm.nih.gov/?term=corona&format=abstract&size=20
Title - The history and epidemiology of Middle East respiratory syndrome corona virus
Link - https://pubmed.ncbi.nlm.nih.gov/?term=%22Multidiscip+Respir+Med%22%5Bjour%5D
---------------------------------------------------------------------------
Title - Personalized protein corona on nanoparticles and its clinical implications
Link - https://pubmed.ncbi.nlm.nih.gov/?term=%22Biomater+Sci%22%5Bjour%5D
---------------------------------------------------------------------------
Title - Nanoparticle-Protein Interaction: The Significance and Role of Protein Corona
Link - https://pubmed.ncbi.nlm.nih.gov/?term=%22Adv+Exp+Med+Biol%22%5Bjour%5D
---------------------------------------------------------------------------
Title - Gold nanoparticle should understand protein corona for being a clinical nanomaterial
Link - https://pubmed.ncbi.nlm.nih.gov/?term=%22J+Control+Release%22%5Bjour%5D
---------------------------------------------------------------------------
Title - The impact of protein corona on the behavior and targeting capability of nanoparticle-based delivery system
Link - https://pubmed.ncbi.nlm.nih.gov/?term=%22Int+J+Pharm%22%5Bjour%5D
---------------------------------------------------------------------------
Title - Liposome protein corona characterization as a new approach in nanomedicine
Link - https://pubmed.ncbi.nlm.nih.gov/?term=%22Anal+Bioanal+Chem%22%5Bjour%5D
---------------------------------------------------------------------------
Title - Shell-corona microgels from double interpenetrating networks
Link - https://pubmed.ncbi.nlm.nih.gov/?term=%22Soft+Matter%22%5Bjour%5D
---------------------------------------------------------------------------
Title - Protein corona: Opportunities and challenges
Link - https://pubmed.ncbi.nlm.nih.gov/?term=%22Int+J+Biochem+Cell+Biol%22%5Bjour%5D
---------------------------------------------------------------------------
Title - Biomolecular Corona Dictates Aβ Fibrillation Process
Link - https://pubmed.ncbi.nlm.nih.gov/?term=%22ACS+Chem+Neurosci%22%5Bjour%5D
---------------------------------------------------------------------------
Title - A health concern regarding the protein corona, aggregation and disaggregation
Link - https://pubmed.ncbi.nlm.nih.gov/?term=%22Biochim+Biophys+Acta+Gen+Subj%22%5Bjour%5D
---------------------------------------------------------------------------
Title - Formation and Characterization of Protein Corona Around Nanoparticles: A Review
Link - https://pubmed.ncbi.nlm.nih.gov/?term=%22J+Nanosci+Nanotechnol%22%5Bjour%5D
---------------------------------------------------------------------------
Title - Silver nanoparticle protein corona and toxicity: a mini-review
Link - https://pubmed.ncbi.nlm.nih.gov/?term=%22J+Nanobiotechnology%22%5Bjour%5D
---------------------------------------------------------------------------
Title - The prevalence and morphology of the corona mortis (Crown of death): A meta-analysis with implications in abdominal wall and pelvic surgery
Link - https://pubmed.ncbi.nlm.nih.gov/?term=%22Injury%22%5Bjour%5D
---------------------------------------------------------------------------
Title - Possibilities and Limitations of Different Separation Techniques for the Analysis of the Protein Corona
Link - https://pubmed.ncbi.nlm.nih.gov/?term=%22Angew+Chem+Int+Ed+Engl%22%5Bjour%5D
---------------------------------------------------------------------------
Title - Translating Current Bioanalytical Techniques for Studying Corona Activity
Link - https://pubmed.ncbi.nlm.nih.gov/?term=%22Trends+Biotechnol%22%5Bjour%5D
---------------------------------------------------------------------------
Title - The Crown and the Scepter: Roles of the Protein Corona in Nanomedicine
Link - https://pubmed.ncbi.nlm.nih.gov/?term=%22Adv+Mater%22%5Bjour%5D
---------------------------------------------------------------------------
Title - Protein corona - from molecular adsorption to physiological complexity
Link - https://pubmed.ncbi.nlm.nih.gov/?term=%22Beilstein+J+Nanotechnol%22%5Bjour%5D
---------------------------------------------------------------------------
Title - Understanding the nanoparticle-protein corona complexes using computational and experimental methods
Link - https://pubmed.ncbi.nlm.nih.gov/?term=%22Int+J+Biochem+Cell+Biol%22%5Bjour%5D
---------------------------------------------------------------------------
Title - Structure of corona radiata and tapetum fibers in ventricular surgery
Link - https://pubmed.ncbi.nlm.nih.gov/?term=%22J+Clin+Neurosci%22%5Bjour%5D
---------------------------------------------------------------------------
Title - A protein corona primer for physical chemists
Link - https://pubmed.ncbi.nlm.nih.gov/?term=%22J+Chem+Phys%22%5Bjour%5D
---------------------------------------------------------------------------

Number of papers - 954results

推荐阅读