python - 如何(有效地)找到足够有效数字所需的最小序列长度?
问题描述
我有一个包含多个列的时间序列数据框,其中包含彼此独立的 NaN。
而且我有一个给定的长度“LEN”,每个有效元素序列至少应该有。(“序列”是指之前收集索引中的值。)
迭代的时间效率极低,但看起来类似于:
LEN = 100
maximum_sequence_len = 0
for i in range(len(df)): # for every index
for col in df.columns: # for every column
df_ = df[col].iloc[:i].dropna()
seq_end_ix = i
seq_start_ix = get_seq_start_where_every_col_has_enough_valids(
df,seq_end,LEN)
necessary_len = len( df.loc[seq_start_ix:seq_end_ix] )
if maximum_sequence_len < necessary_len :
maximum_sequence_len = necessary_len
get_seq_start_where_every_col_has_enough_valids(df,seq_end_ix,LEN)
# determine the index where every column contains at least "LEN" valid elements
first_SEQ_LEN_Sample_start_ix = start_ix
for col in df.columns:
col_df = df[col].dropna()
temp = col_df[col_df.index <= seq_end_ix ].index[-(LEN)]
if temp < first_SEQ_LEN_Sample_start_ix:
first_SEQ_LEN_Sample_start_ix = temp
seq_start_ix = first_SEQ_LEN_Sample_start_ix
return seq_start_ix
一个例子:
LEN = 6 # in this example we have to have at least 6 valid elements in the frame of rows before
print(df)
>>>>
A B C D E F
index
0 1 1 1 1 1 1
1 1 1 1 1 1 1
2 1 1 1 1 1 | 1
3 NaN 1 1 NaN 1 | 1
4 NaN 1 1 NaN 1 | 1
5 1 1 1 1 1 | 1
6 1 1 1 1 NaN | 1
7 NaN 1 1 NaN 1 | 1
8 NaN 1 1 1 1 | 1
9 1 1 1 1 NaN | 1
10 1 1 1 1 NaN | 1
11 1 1 1 NaN NaN | 1
12 1 1 1 1 NaN | 1
13 1 1 1 1 NaN | 1
14 1 NaN 1 1 NaN |* 1
16 1 1 1 1 1 NaN
17 NaN 1 1 1 1 1
18 NaN 1 1 1 1 NaN
19 1 1 1 1 1 1
# ==> Result: 13
# *here, longest sequence necessary to get minimum 6 valids in EVERY column has a length of 13. note, that if the other columns contained more NaNs in the marked indices, then it would probably have taken more than 13.
问题是我想创建序列样本,但不知道它们需要多长时间才能使每个样本在每列中至少具有“LEN”有效元素。
解决方案
本质上,您需要维护一个向量计数器,每列一个计数器。
如果所有计数器至少为 6,则向量计数器应发出“窗口就绪”信号。如果窗口(start_index,end_index)已准备好,您可以发出窗口中的所有行并将窗口的 start_index、end_index 重置为当前行并重置所有计数器归零。
重复直到数据结束。
Algorithm get_windows(data[][])
counters: array of integers of length = data.cols, values initialized to 0
Begin
window_start_index = 0
window_end_index = 0
for each row in data
for each col in row
if(value(col) != NaN)
counters[index(col)]++;
end if
next // col
// check if row causes window to continue
continue_flag = false;
for each counter in counters
if(counter != 6)
continue_flag = true
exit for loop
end if
next // counter
if(continue_flag)
window_end_index++;
else
// we have a window (window_start_index, window_end_index)
// both inclusive
// do something with the window
// reset counters
for each counter in counters
counter = 0
next
end if
next // row
End Algorithm
这个单通算法是你需要的吗?
推荐阅读
- java - 如何在 e clipse 中配置生成的 Sources
- algorithm - 如何识别一些二叉搜索树的遍历属于后序还是中序?
- mixed-models - MCMCglmm 问题:多物种和超测量树
- html - 在 AutoCAD 中嵌入 html
- javascript - 在猫鼬中使用重命名集合
- python - 如何使用 SpaCy 和 NLTK 进行自定义 NER 标记?
- javascript - onclick 事件:在追加 HTML 脚本之前刷新页面
- html - 并排获取列 div
- ms-access - Ms-Access - IF 复选框 = true THEN(编译错误)
- python - 如何使用 Pandas 替换 DataFrame 中的列条目并创建字典新旧值