python-3.x - 之前工作的 Scrape 现在失败了,Data-frame 显示 NAN 为什么会发生这种情况?
问题描述
这是早期工作的,这是遗留代码。我在进行网络抓取方面的经验非常少这是早期工作的遗留代码,最近我遇到了错误。该表位于以下链接中。
这是完整的代码和错误行,这是有效的
import requests
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
from nsepy import get_history
from datetime import date
from datetime import datetime
Base_url =("https://www.nseindia.com/live_market/dynaContent/"+
"live_watch/option_chain/optionKeys.jsp?symbolCode=2541&symbol=ITC&"+
"symbol=UBL&instrument=OPTSTK&date=-&segmentLink=17&segmentLink=17")
page = requests.get(Base_url)
page.status_code
page.content
soup = BeautifulSoup(page.content, 'html.parser')
print(soup.prettify())
#Added this code for checking the underlying value
table_1 =soup.find_all(style="float:right; font-size:1.2em;")
for table in table_1:
underlysingscript = table.select('span')
table_it = soup.find_all(class_="opttbldata")
table_cls_1 = soup.find_all(id="octable")
col_list = []
# The code given below will pull the headers of the Option Chain table
for mytable in table_cls_1:
table_head = mytable.find('thead')
try:
rows = table_head.find_all('tr')
for tr in rows:
cols = tr.find_all('th')
for th in cols:
er = th.text
ee = er.encode('utf8')
ee = str(ee, 'utf-8')
col_list.append(ee)
except:
print ("no thead")
col_list_fnl = [e for e in col_list if e not in ('CALLS','PUTS','Chart','\xc2\xa0','\xa0')]
table_cls_2 = soup.find(id="octable")
all_trs = table_cls_2.find_all('tr')
req_row = table_cls_2.find_all('tr')
new_table = pd.DataFrame(index=range(0,len(req_row)-3) , columns=col_list_fnl)
在这一行中,数据变为 NAN。为什么和什么问题?可以更好地编写此代码以避免此问题吗?
解决方案
您正在创建一个空数据框。创建数据框时,您可以通过index=
指定行数,通过 指定列数columns=
,但绝不会将任何值/数据放入其中。因此,行和列的数据框,以及每个单元格中带有 NAN 的值。
我还要说你工作太努力了才能得到那张桌子。Pandas 可以为您完成这项工作。然后只需要操作数据框以适应您想要的:
import pandas as pd
import requests
Base_url =("https://www.nseindia.com/live_market/dynaContent/"+
"live_watch/option_chain/optionKeys.jsp?symbolCode=2541&symbol=ITC&"+
"symbol=UBL&instrument=OPTSTK&date=-&segmentLink=17&segmentLink=17")
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36"}
response = requests.get(Base_url, headers = headers)
tables = pd.read_html(response.text)
df = tables[1][2:].reset_index(drop=True)
df.columns = tables[1].iloc[1]
df = df.drop('Chart', axis=1)
输出:
print (df)
1 OI Chng in OI Volume IV ... IV Volume Chng in OI OI
0 - - - - ... - - - -
1 - - - - ... - - - -
2 - - - - ... - - - -
3 - - - - ... - - - -
4 - - - - ... - - - -
5 - - - - ... - - - -
6 - - - - ... - - - -
7 4800 - - - ... 43.15 93 57600 141600
8 - - - - ... 37.31 20 26400 48000
9 2400 - 1 62.34 ... 33.31 996 770400 3427200
10 4800 - 1 35.32 ... 30.48 705 621600 1003200
11 88800 31200 47 32.52 ... 28.13 4476 1572000 2510400
12 184800 117600 116 30.78 ... 25.50 3284 873600 1545600
13 3165600 2085600 2481 29.05 ... 24.67 7621 979200 2769600
14 3184800 2265600 3308 29.73 ... 24.77 4472 424800 1497600
15 7579200 5167200 10875 32.14 ... 25.98 4858 1142400 2409600
16 5640000 3117600 9499 35.38 ... 33.36 1029 96000 302400
17 10680000 4572000 16342 38.06 ... 46.34 128 -4800 182400
18 2848800 1161600 3523 41.92 ... 47.42 4 -2400 24000
19 6751200 2724000 5638 44.97 ... - 1 - 45600
20 405600 -26400 562 48.56 ... - - - 2400
21 1077600 319200 1113 47.61 ... - 21 -2400 249600
22 21600 4800 17 54.15 ... - - - 4800
23 252000 170400 196 55.95 ... - 2 -4800 57600
24 - - - - ... - - - -
25 - - 1 60.47 ... - - - 7200
26 - - - - ... - - - -
27 4800 - - - ... - - - 40800
28 - - - - ... - - - -
29 - - - - ... - - - -
30 - - - - ... - - - -
31 - - - - ... - 7 -14400 4800
32 41896800 NaN 53720 NaN ... NaN NaN NaN NaN
[33 rows x 21 columns]
BeautifulSoup 选项:
我不得不更改一些代码,以确保列数和排队的夜晚/相等。我也不以任何方式清理它(即替换"-"
、删除空列或行、重置索引等)。但希望能给你一些工作:
import requests
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
#from nsepy import get_history
from datetime import date
from datetime import datetime
Base_url =("https://www.nseindia.com/live_market/dynaContent/"+
"live_watch/option_chain/optionKeys.jsp?symbolCode=2541&symbol=ITC&"+
"symbol=UBL&instrument=OPTSTK&date=-&segmentLink=17&segmentLink=17")
page = requests.get(Base_url)
page.status_code
page.content
soup = BeautifulSoup(page.content, 'html.parser')
print(soup.prettify())
#Added this code for checking the underlying value
table_1 =soup.find_all(style="float:right; font-size:1.2em;")
for table in table_1:
underlysingscript = table.select('span')
table_it = soup.find_all(class_="opttbldata")
table_cls_1 = soup.find_all(id="octable")
col_list = []
# The code given below will pull the headers of the Option Chain table
for mytable in table_cls_1:
table_head = mytable.find('thead')
try:
rows = table_head.find_all('tr')
for tr in rows:
cols = tr.find_all('th')
for th in cols:
er = th.text
ee = er.encode('utf8')
ee = str(ee, 'utf-8')
col_list.append(ee)
except:
print ("no thead")
col_list_fnl = [e for e in col_list if e not in ('CALLS','PUTS','\xc2\xa0','\xa0')]
table_cls_2 = soup.find(id="octable")
req_row = table_cls_2.find_all('tr')
new_table = pd.DataFrame()
for row in req_row:
td = row.find_all('td')
data = [ ele.text for ele in td ]
temp_df = pd.DataFrame([data])
new_table = new_table.append(temp_df)
new_table.columns = col_list_fnl
输出:
print (new_table)
Chart OI Chng in OI ... Chng in OI OI Chart
0 NaN NaN NaN ... NaN NaN NaN
0 NaN NaN NaN ... NaN NaN NaN
0 - - ... - -
0 - - ... - -
0 - - ... - -
0 - - ... - -
0 - - ... - -
0 - - ... 2,400 2,400
0 - - ... - -
0 4,800 - ... 110,400 194,400
0 - - ... 40,800 62,400
0 4,800 2,400 ... 165,600 2,822,400
0 4,800 - ... 576,000 957,600
0 86,400 28,800 ... 1,276,800 2,215,200
0 220,800 153,600 ... 664,800 1,336,800
0 3,151,200 2,071,200 ... 708,000 2,498,400
0 3,204,000 2,284,800 ... 314,400 1,387,200
0 7,231,200 4,819,200 ... 770,400 2,037,600
0 5,469,600 2,947,200 ... 81,600 288,000
0 10,612,800 4,504,800 ... -2,400 184,800
0 2,632,800 945,600 ... -4,800 21,600
0 6,523,200 2,496,000 ... - 45,600
0 391,200 -40,800 ... - 2,400
0 1,072,800 314,400 ... -2,400 249,600
0 21,600 4,800 ... - 4,800
0 249,600 168,000 ... -4,800 57,600
0 - - ... - -
0 2,400 2,400 ... - 7,200
0 - - ... - -
0 4,800 - ... - 40,800
0 - - ... - -
0 - - ... - -
0 - - ... - -
0 - - ... -14,400 4,800
0 Total 40,888,800 ... NaN NaN NaN
[35 rows x 23 columns]
推荐阅读
- jsx - 在 Oak 中使用 JSX 作为视图引擎
- php - 我没有获得所需的 URLRewrite
- python - 更新到 VS Code 1.55.0 但没有改进 python notebook
- q - Merlin - 使用 dnsmasq 将 *.localdomain 等同于 *.mydomain.com
- windows - Azure DevOps Windows 与 Ubuntu 托管的跨平台应用程序代理?
- java - 如何在项目中使用一些 libGDX 库
- python - Django:带有用于创建相关对象的字段的表单
- angular - 在 Angular 中使用守卫的更好方法
- r - 如何使用一个变量的值作为另一个变量的名称
- python-3.x - 龙卷风中的多处理