首页 > 解决方案 > Skipping x rows in an iterable of a subset of a dataframe

问题描述

I have a dataframe with a total number of 154529 rows through which I iterate by grouping it based on one of its columns.

During my iteration I look for a specific correlating row y with regards to the current row x (the one, the iterable is currently at). As soon as I found the row y, I want to skip the iteration until one row/index after row y.

To do so, I'm using the next(islice(...)) functionality. However, the islice method always skips to the wrong index. My assumption is, that this is because of my iteration on subsets only but the indices are still relative to the whole dataframe.

I already tried to solve my problem b< applying reset_index() on the sub-dataframe, but as I need the original indices for some assignments that are done during the looping, this approach doesn't work. Can anybody help me with the finding of the correct Start parameter for the islice() method?

Here are some example indices for deeper investigations. (I wasn't able to find a pattern in the offsets of the actual new indices.) deep dive on indices

And here is my code

from itertools import islice

case_started = False

for session_id, session_df in labeled_data.groupby('SessionId'):
    
    session_iterations = session_df.iterrows()
   
    start_end_pairs = [] #store all start-end-pairs for each session
    next_start_index = ''

    for index, row in session_iterations:
        # doing stuff to find row y 
        # doing some assignemnts with row y index and current row index

        start_end_pairs.append((index, row_y))
        next_start_index = case_end + 1
        if next_start_index < session_df.index[-1]:   
                skip = case_end - index #skipping relative to current index
                next(islice(session_iterations, skip, None), 'Stop') #skipping to next start index
            else:
                break

Thanks in advance for any kind of help or hints!

标签: pythonloopsindexingiteratoritertools

解决方案


问题似乎出在 的第二个参数中islice,请尝试将其设置为skip

例子:

dataset['C'] = np.arange(len(dataset)) # just to validate iterator does not break
rowiter = dataset.iterrows()
for a, b in rowiter:
  print("idx", a, "row number", b.C)
  if a % 5 == 0:
    next(islice(rowiter, 4, 4), None) # skipping the next four rows
  if a > 10:
    break

结果是:

idx 0 row number 0.0
idx 5 row number 5.0
idx 10 row number 10.0
idx 15 row number 15.0

这是预期的输出。


推荐阅读