首页 > 解决方案 > pandas read_html,当行中有数据时为什么说NaN

问题描述

我正在尝试抓取页面。

我写了这段代码:

import pandas as pd
output_file = open('neuropep.txt', 'a')
for i in range(1,2):
       number = '{:05}'.format(i)
       url = 'http://isyslab.info/NeuroPep/search_info?pepNum=NP' + str(number)
       tables = pd.read_html(url)
       print(tables[0][1])

输出是:

0                           NP00001
1     7B2 C-terminal peptide (5-13)
2                 Rattus norvegicus
3                             10116
4                               NaN
5                               7B2
6                               NaN
7                                 9
8                               NaN
9                               NaN
10                        FSEEEKEPE
11                             View
12                              NaN
13                              NaN
Name: 1, dtype: object

但我可以从链接中看到,第 13 行应该说:

Karlsson O, Kultima K, Wadensten H, Nilsson A, Roman E, Andrén PE, Brittebo EB Neurotoxin-induced neuropeptide perturbations in striatum of neonatal rats J Proteome Res 2013 Apr 5;12(4):1678-90
PMID: 23410195

我无法解决差异?我试图弄乱打印表格的不同部分,但我不确定如何找出丢失的数据在哪里。我实际上并不需要整个参考,只需要 PubMed ID。

编辑1:尝试使用beautifulsoup:

for i in range(1,2):
                number = '{:05}'.format(i)
                url = 'http://isyslab.info/NeuroPep/search_info?pepNum=NP' + str(number)
                res = requests.get(url)
                soup = BeautifulSoup(res.content, 'lxml')
                table = soup.find_all('li')
                print(table)

标签: pythonpandas

解决方案


推荐阅读