python - 使用 BioPython,如何在单行(逗号分隔)中打印给定搜索词对的 DOI 引用,而不是多行?
问题描述
StackOverflow 的人们,首先,感谢您的耐心等待。我知道这是我关于这个主题的第三个主题,但是由于我无处可去,而且我什至不知道从哪里开始(我不知道我不知道什么),我想我会在这里问反正。我正在尝试使用 Biopython 从 PMC 中提取引用,以写回 CSV 文件,其中包括植物名称、它治愈的相关疾病/病症/其药用作用以及引用的 DOI URL给定的植物-疾病对。经过很多小时试图了解该怎么做,并与比我更有经验的人讨论代码后,这就是最终在 Visual Studio Code 中输入的内容:
for plant, disease in plant_disease_list:
search_query = generate_search_query(plant, disease)
handle1 = Entrez.esearch(db="pmc", term=search_query, retmax="10")
record1 = Entrez.read(handle1)
pubmed_ids = record1.get("IdList")
if len(pubmed_ids)==0:
print("{}, {}, None".format(plant, disease))
else:
for pubmed_id in pubmed_ids:
handle2 = Entrez.esummary(db="pmc", id=pubmed_id)
records = Entrez.read(handle2)
for record in records:
doi = record.get("DOI")
if doi is None:
print(("{}, {}".format(plant, disease)))
else:
doi_main = doi.split()
string = "http://doi.org/"
to_add = (",").join((string + x) for x in doi_main)
print("{}, {},".format(plant, disease), to_add, sep="")
其中 generate_search_query 先前定义为:
def generate_search_query(plant, disease):
search_query = '"{}" AND "{}"'.format(plant, disease)
return search_query
这是我得到的输出:
Asystasia salicifalia, Puerperal illness, None
Asystasia salicifalia, Puerperium, None
Asystasia salicifalia, Puerperal disorder, None
Barleria strigosa, Tonic
Justicia procumbens, Lumbago, None
Justicia procumbens, Itching,http://doi.org/10.1673/031.012.0501
Strobilanthes auriculata, Malnutrition, None
Thunbergia laurifolia, Detoxificant, None
Thunbergia similis, Tonic, None
Lannea coromandelica, Dizziness,http://doi.org/10.3897/phytokeys.102.24380
Lannea coromandelica, Dizziness,http://doi.org/10.1186/s13002-016-0089-8
Lannea coromandelica, Dizziness,http://doi.org/10.1186/s13002-015-0033-3
Spondias pinnata, Flatulence,http://doi.org/10.1016/j.heliyon.2019.e02768
Spondias pinnata, Flatulence,http://doi.org/10.1186/s13002-019-0287-2
Spondias pinnata, Flatulence,http://doi.org/10.1186/s13002-018-0248-1
Spondias pinnata, Flatulence,http://doi.org/10.3897/phytokeys.102.24380
Spondias pinnata, Flatulence,http://doi.org/10.1155/2018/5382904
Spondias pinnata, Flatulence,http://doi.org/10.1186/s13002-016-0089-8
Spondias pinnata, Flatulence,http://doi.org/10.1186/s13002-015-0033-3
Spondias pinnata, Flatulence,http://doi.org/10.1186/1472-6882-13-243
Spondias pinnata, Flatulence,http://doi.org/10.1186/1472-6882-10-77
Holarrhena pubescens, Diarrhoea,http://doi.org/10.5455/javar.2019.f379
Holarrhena pubescens, Diarrhoea,http://doi.org/10.1155/2019/2321961
Holarrhena pubescens, Diarrhoea,http://doi.org/10.1186/s12906-018-2348-9
Traceback (most recent call last):
File "scraperscript_python.py", line 33, in <module>
handle2 = Entrez.esummary(db="pmc", id=pubmed_id)
File "C:\Users\ASUS\AppData\Local\Programs\Python\Python37\lib\site-packages\Bio\Entrez\__init__.py", line 334, in esummary
return _open(cgi, variables)
File "C:\Users\ASUS\AppData\Local\Programs\Python\Python37\lib\site-packages\Bio\Entrez\__init__.py", line 569, in _open
handle = _urlopen(cgi)
File "C:\Users\ASUS\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 222, in urlopen
return opener.open(url, data, timeout)
File "C:\Users\ASUS\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 525, in open
response = self._open(req, data)
File "C:\Users\ASUS\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 543, in _open
'_open', req)
File "C:\Users\ASUS\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 503, in _call_chain
result = func(*args)
File "C:\Users\ASUS\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 1362, in https_open
context=self._context, check_hostname=self._check_hostname)
File "C:\Users\ASUS\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 1319, in do_open
encode_chunked=req.has_header('Transfer-encoding'))
File "C:\Users\ASUS\AppData\Local\Programs\Python\Python37\lib\http\client.py", line 1252, in request
self._send_request(method, url, body, headers, encode_chunked)
File "C:\Users\ASUS\AppData\Local\Programs\Python\Python37\lib\http\client.py", line 1298, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "C:\Users\ASUS\AppData\Local\Programs\Python\Python37\lib\http\client.py", line 1247, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "C:\Users\ASUS\AppData\Local\Programs\Python\Python37\lib\http\client.py", line 1026, in _send_output
self.send(msg)
File "C:\Users\ASUS\AppData\Local\Programs\Python\Python37\lib\http\client.py", line 966, in send
self.connect()
File "C:\Users\ASUS\AppData\Local\Programs\Python\Python37\lib\http\client.py", line 1422, in connect
server_hostname=server_hostname)
File "C:\Users\ASUS\AppData\Local\Programs\Python\Python37\lib\ssl.py", line 423, in wrap_socket
session=session
File "C:\Users\ASUS\AppData\Local\Programs\Python\Python37\lib\ssl.py", line 870, in _create
self.do_handshake()
File "C:\Users\ASUS\AppData\Local\Programs\Python\Python37\lib\ssl.py", line 1139, in do_handshake
self._sslobj.do_handshake()
KeyboardInterrupt
其余的输出被我打断了,因为我不希望它在整个数据上运行,因为它以不正确的形式打印它。正如您在Spondias pinnata和胀气的示例中看到的那样,您可以看到它在不同的行中打印不同的 DOI URL。问题是我不希望它像那样打印,因为将它放回原始数据将非常困难。例如,这个 CSV 文件只有 65 个条目,但有超过 8000 个条目的数据集,这使其成为一项非常困难的工作。例如,我希望实现的输出应该如下所示(当我们考虑上述植物-疾病对时):
Spondias pinnata, Flatulence, http://doi.org/10.1016/j.heliyon.2019.e02768, http://doi.org/10.1186/s13002-019-0287-2, http://doi.org/10.1186/s13002-018-0248-1, http://doi.org/10.3897/phytokeys.102.24380, http://doi.org/10.1155/2018/5382904, http://doi.org/10.1186/s13002-016-0089-8, http://doi.org/10.1186/s13002-015-0033-3, http://doi.org/10.1186/1472-6882-13-243, http://doi.org/10.1186/1472-6882-10-77
我的家人建议我使用嵌套字典,但我不知道这会有什么帮助,而且我不知道将它放在代码中的哪个位置,以及对已经大量嵌套的循环进行哪些更改。对此的任何帮助将不胜感激。谢谢你。
解决方案
以下代码:
from Bio import Entrez
import csv
Entrez.email = "theofficialvelocifaptor@gmail.com"
botanical_names = ['Asystasia salicifalia', 'Asystasia salicifalia', 'Asystasia salicifalia', 'Barleria strigosa', 'Justicia procumbens', 'Justicia procumbens', 'Strobilanthes auriculata', 'Thunbergia laurifolia', 'Thunbergia similis', 'Lannea coromandelica', 'Spondias pinnata']
diseases = ['Puerperal illness', 'Puerperium', 'Puerperal disorder', 'Tonic', 'Lumbago', 'Itching', 'Malnutrition', 'Detoxificant', 'Tonic', 'Dizziness', 'Flatulence']
assert len(botanical_names) == len(diseases)
plant_disease_list = zip(botanical_names, diseases)
with open('plant_diseases.csv', 'w', newline='') as csvfile:
fieldnames = ['plant', 'disease', 'dois']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
for plant, disease in plant_disease_list:
result = {'plant': plant,
'disease': disease}
search_query = '"{}" AND "{}"'.format(plant, disease)
handle1 = Entrez.esearch(db="pmc", term=search_query, retmax="10")
record1 = Entrez.read(handle1)
pubmed_ids = record1.get("IdList")
if pubmed_ids:
handle2 = Entrez.esummary(db="pmc", id=','.join(pubmed_ids))
records = Entrez.read(handle2)
dois = [record.get("DOI") for record in records if record.get("DOI") is not None]
prefix = "http://doi.org/"
dois = ','.join([prefix + doi for doi in dois])
result['dois'] = dois
writer.writerow(result)
将以下输出写入文件plant_diseases.csv
:
plant,disease,dois
Asystasia salicifalia,Puerperal illness,
Asystasia salicifalia,Puerperium,
Asystasia salicifalia,Puerperal disorder,
Barleria strigosa,Tonic,
Justicia procumbens,Lumbago,
Justicia procumbens,Itching,http://doi.org/10.1673/031.012.0501
Strobilanthes auriculata,Malnutrition,
Thunbergia laurifolia,Detoxificant,
Thunbergia similis,Tonic,
Lannea coromandelica,Dizziness,"http://doi.org/10.3897/phytokeys.102.24380,http://doi.org/10.1186/s13002-016-0089-8,http://doi.org/10.1186/s13002-015-0033-3"
Spondias pinnata,Flatulence,"http://doi.org/10.1016/j.heliyon.2019.e02768,http://doi.org/10.1186/s13002-019-0287-2,http://doi.org/10.1186/s13002-018-0248-1,http://doi.org/10.3897/phytokeys.102.24380,http://doi.org/10.1155/2018/5382904,http://doi.org/10.1186/s13002-016-0089-8,http://doi.org/10.1186/s13002-015-0033-3,http://doi.org/10.1186/1472-6882-13-243,http://doi.org/10.1186/1472-6882-10-77"
请注意,我已使用该csv
模块创建有效的 CSV文件。这包括在以逗号分隔的 DOI 列表周围添加双引号,以将它们与用于描述植物和疾病的逗号分开。此外,如果您没有 DOI,则无需添加 None 占位符。由于第一行包含一个标题,csv
模块知道它应该在那里每行查找三个字段。
另外,不要string
用作变量名,因为它是标准库中 Python 模块的名称。
推荐阅读
- gnuplot - 从不同的数据文件 gnuplot 中分离文件图
- jboss - jboss-deployment-structure.xml 中是否需要全局模块作为依赖项?
- javascript - React redux 不同步 mapStateToProps 中的 props
- sql - 如何从 SQL 中的计算列中减去 1 年
- php - 如何从 PHP 中的 API 读取 XML 数据?
- javascript - 修复 ScriptTransformer.js 上的 SyntaxError 从 jest 23.6 更新到 24.1
- java - 使用 Mockito 的 doThrow 方法时不会抛出异常
- java - 如何阻止我的文件成为无法识别的文件?
- c# - DataTable Load 方法是否通过 sql 查询冻结或不冻结(具有相同的数据)
- git - 队友用过时的代码覆盖提交的代码