首页 > 解决方案 > (effeciently) Iterating recursively over multiple dataframes in pandas

问题描述

So basically I'm working with large datasets of market data (100,000 rows). A simplified version of the dataset columns look like this:

[Timestamp] [Price] [Shares] [orders] [side]

111.239 $23.28 200 2 B

111.240 $23.59 200 1 S

Etc etc. This data is what comes out of our market parsing software after feeding it a pcap file. Now I need to compare the output of the same market data from two different sources to make sure our market data parser is working correctly and not dropping orders or behaving inconsistently. The only problem is that the timestamps are slightly different because the data is from two different sources.

So my current approach is to implement these datasets as lists of dictionaries, each dictionary representing one of these orders. I have dictionary A and dictionary B, each representing one of the two sources, and each ordered by timestamp. Then I choose a 'fuzz-factor' of time, in this example I will use 2 seconds. Here is how I do my comparison in pseduo-code:

for item1 in dictionaryA:
    for item 2 in dictionaryB
        if the item2[timestamp] is over 2 seconds before item1[timestamp]:
            remove item from dictionaryB
        
        elif item2[timestamp] is over 2 seconds after item1[timestamp]:
            mark item1 as not matched
            break           

        else: # We are in 2 second fuzz factor:
            compare the items, if a match is found:
                Mark item 1 in dictionaryA as matched
                Remove item2 from dictionary
                break

So as you can see I speed up processing by constantly removing items from dictionary B as I cycle through dictionary A. Since most of the items match this speeds things up considerably. However I'm not sure how to do something like this in pandas. The apply() function seems to be the fastest way to iterate through a dataset, but it iterates the entire dataset, not until a certain condition is met like the timestamp fuzzing I do above. Furthermore I'm not sure how fast dropping rows is in pandas.

Something to note:

  1. Timestamps are at nanosecond precision in floating point format, but several orders may have the same timestamp.
  2. Several orders may appear to look exactly the same but appear with different timestamps
  3. Several orders may appear to look exactly the same AND appear with the same timestamp, both orders will need to find separate matches

So what do you guys think? What functions would I use to re-implement this algorithm in pandas? And because I'm shifting to pandas should I retool the algorithm itself? I've been playing with iterrows but that seems slow and I was wondering if there was some way I could apply vectorized operators here.

Thanks for your help and let me know if you have any questions.

标签: pythonpandasdataframerecursionoptimization

解决方案


Pandas does Series and DataFrame operations recursively by default. You can corral all the necessary data into a single DataFrame, then use boolean indexing with the desired filters to leave the data you'd like. This should simplify your operations a bit because you won't have to deal with loops to iterate through your market data. On another note, when you do use pandas, don't use the 'Timestamp' column as the index; you will find that an error may occur with some datasets. If you have any more questions feel free to ask and I'll help to the best of my ability.


推荐阅读