python - (effeciently) Iterating recursively over multiple dataframes in pandas
问题描述
So basically I'm working with large datasets of market data (100,000 rows). A simplified version of the dataset columns look like this:
[Timestamp] [Price] [Shares] [orders] [side]
111.239 $23.28 200 2 B
111.240 $23.59 200 1 S
Etc etc. This data is what comes out of our market parsing software after feeding it a pcap file. Now I need to compare the output of the same market data from two different sources to make sure our market data parser is working correctly and not dropping orders or behaving inconsistently. The only problem is that the timestamps are slightly different because the data is from two different sources.
So my current approach is to implement these datasets as lists of dictionaries, each dictionary representing one of these orders. I have dictionary A and dictionary B, each representing one of the two sources, and each ordered by timestamp. Then I choose a 'fuzz-factor' of time, in this example I will use 2 seconds. Here is how I do my comparison in pseduo-code:
for item1 in dictionaryA:
for item 2 in dictionaryB
if the item2[timestamp] is over 2 seconds before item1[timestamp]:
remove item from dictionaryB
elif item2[timestamp] is over 2 seconds after item1[timestamp]:
mark item1 as not matched
break
else: # We are in 2 second fuzz factor:
compare the items, if a match is found:
Mark item 1 in dictionaryA as matched
Remove item2 from dictionary
break
So as you can see I speed up processing by constantly removing items from dictionary B as I cycle through dictionary A. Since most of the items match this speeds things up considerably. However I'm not sure how to do something like this in pandas. The apply() function seems to be the fastest way to iterate through a dataset, but it iterates the entire dataset, not until a certain condition is met like the timestamp fuzzing I do above. Furthermore I'm not sure how fast dropping rows is in pandas.
Something to note:
- Timestamps are at nanosecond precision in floating point format, but several orders may have the same timestamp.
- Several orders may appear to look exactly the same but appear with different timestamps
- Several orders may appear to look exactly the same AND appear with the same timestamp, both orders will need to find separate matches
So what do you guys think? What functions would I use to re-implement this algorithm in pandas? And because I'm shifting to pandas should I retool the algorithm itself? I've been playing with iterrows but that seems slow and I was wondering if there was some way I could apply vectorized operators here.
Thanks for your help and let me know if you have any questions.
解决方案
Pandas does Series and DataFrame operations recursively by default. You can corral all the necessary data into a single DataFrame, then use boolean indexing with the desired filters to leave the data you'd like. This should simplify your operations a bit because you won't have to deal with loops to iterate through your market data. On another note, when you do use pandas, don't use the 'Timestamp' column as the index; you will find that an error may occur with some datasets. If you have any more questions feel free to ask and I'll help to the best of my ability.
推荐阅读
- javascript - 从 API 获取图像的反应问题
- go - kafka sarama lib如何知道一个cosumergroup中有多少cosumer
- php - 最佳图像缩放
- django - Django 中的 pgettext_lazy 是什么?
- javascript - 如果已单击链接,则 Jquery 从本地存储中删除项目
- python - 扩展自定义用户模型 - Django 中的多种用户类型
- python - Python在列表中的字典中查找值
- html - 输入和文本区域由软键盘悬停
- jvm - Java 运行时环境检测到一个致命错误:SIGBUS while install ZAP proxy in parrot home OS
- gpu - 在 Arch linux 上使用英特尔图形 4400 设置 OpenCL?