首页 > 技术文章 > pandas 常用函数

wutongyuhou 2017-05-17 22:54 原文

1.主要讲的是当index存在重复值的时候, 可以用 obj.index.is_unique 判断,获取重复index的值的时候obj['a'],返回的所有重复的index的值。
2.dataframe 常用的算术统计函数,https://chrisalbon.com/python/pandas_dataframe_descriptive_stats.html
函数list 参见, python 数据分析, P139 ,table 5-10
3.import pandas_datareader as web 可以采集股票数据作为统计样本,支持的web及使用方式,见下表。
https://pandas-datareader.readthedocs.io/en/latest/
(1)series 和 series 
returns.MSFT.corr(returns.IBM) 相关系数
returns.MSFT.cov(returns.IBM) 协方差

(2)frame 自相关
returns.corr()
returns.cov()
(3)frame 和 series 相关
returns.corrwith(returns.IBM)
(4)frame 和 frame 相关
returns.corrwith(volumn)


import numpy as np
from pandas import DataFrame , Series
print ("Axis indexes with duplicate values")
obj=Series(range(5),index =['a','a','b','b','c'])
print("obj is \n", obj)
print("obj.index.is_unique is ",obj.index.is_unique)
print("obj['a'] is \n", obj['a'])
print("obj['b'] is \n",obj['b'])

df=DataFrame(np.random.randn(4,3),index=['a','a','b','b'])
print("df is \n",df)
print("df.ix['b'] is \n ",df.ix['b'])

df = DataFrame([[1.4, np.nan], [7.1, -4.5], [np.nan, np.nan], [0.75, -1.3]],index=['a', 'b', 'c', 'd'],columns=['one','two'])
print("df is \n",df)
print("Calling dafaframe's sum method returns a Series containing column sums")
print("df.sum() is \n",df.sum())
print("passing axis=1 sums over the rows instead")
print("df.sum(axis=1) \n", df.sum(axis=1))
print("NA values are excluded unless the entire slice is NA.this can be disabled using the skipna option")
print("df.mean(axis=1,skipna=False \n ",df.mean(axis=1,skipna=False))

print("df.idxmax() return indirect statistics like the index value where the maximum values are attained \n",df.idxmax())
print("df.cumsum() return cumulative sum of values \n",df.cumsum())
print("df.describe() return multiple summary statistics in one shot \n",df.describe())
obj=Series(['a','a','b','c']*4)
print("obj is \n",obj)
print("obj.describe() return alternate summary statistics \n",obj.describe())

import pandas_datareader as web

https://pandas-datareader.readthedocs.io/en/latest/

all_data={}
for ticker in ['AAPL','IBM', 'MSFT', 'GOOG']:
all_data[ticker] = web.get_data_google(ticker,'1/1/2016','1/1/2017')
print("all data is \n ", all_data)

price = DataFrame({tic: data['Close']
for tic, data in all_data.items()})
volume = DataFrame({tic: data['Volume']
for tic, data in all_data.items()})

returns = price.pct_change()
print("returns.tail()\n",returns.tail())

print("returns.MSFT.corr(returns.IBM) \n",returns.MSFT.corr(returns.IBM))
print("returns.MSFT.cov(returns.IBM) \n", returns.MSFT.cov(returns.IBM))

print("returns.corr() \n", returns.corr())
print("returns.cov() \n", returns.cov())

print("returns.corrwith(returns.IBM) \n",returns.corrwith(returns.IBM))

print("volumn is \n",volume)
print("returns.corrwith(volumn) \n",returns.corrwith(volume))

推荐阅读