首页 > 解决方案 > 仅访问 DataFrame 中一列的内容与另一列相交的行

问题描述

我经常对如何访问数据框感到困惑。我有一个这样的数据框(我们称之为df):

id_one id_two
1 123 {234, 345, 546, ...}
2 -234 {123, 234, 645, ...}
... ... ...

......价值2500万行。

我想过滤数据框,只显示“集合”中的集合与另一个集合相交的行,我们称之为reference_set = {542345、423、64564、435,...等}。后来我想量化这个交叉点,这就是为什么我需要交叉点的长度。

这不起作用:

df.loc[
    len(
        df['sets'].intersection(reference_set) 
    ) > 0]

它给出“AttributeError:'Series'对象没有属性'intersection'”

它不应该提供一个可供选择的布尔列表吗?我没有正确遵循这个吗?

谢谢你的建议!

标签: pythonpandasdataframeset

解决方案


您可以使用 pandasapply来获取参考集和数据框每一行上的集之间的交集。然后,将该len函数应用于新创建的列 ( Intersect) 以量化每个交叉点。这将使您能够过滤数据框并仅显示该集合sets与另一个集合 ( df['Len'] > 0) 相交的行。

df用作输入

   id_one  id_two                                                                                         sets
0    7575     527          {1, 4, 6, 7, 8, 13, 16, 20, 24, 31, 40, 47, 50, 52, 57, 61, 64, 69, 80, 88, 91, 96}
1     574    1555      {7, 18, 19, 22, 23, 24, 30, 39, 43, 47, 50, 58, 62, 64, 72, 76, 77, 83, 84, 86, 87, 96}
2    7831    8823  {5, 14, 15, 20, 23, 28, 30, 32, 35, 36, 40, 41, 44, 52, 54, 59, 60, 62, 63, 84, 87, 90, 96}
3     688    6860           {2, 9, 20, 24, 27, 28, 30, 38, 46, 57, 59, 60, 64, 65, 69, 71, 80, 84, 88, 91, 95}
4    8843     596    {6, 7, 8, 24, 25, 27, 30, 33, 47, 50, 54, 56, 57, 61, 64, 66, 69, 74, 78, 81, 85, 88, 99}
5    1269    7546             {11, 22, 24, 25, 33, 35, 45, 48, 49, 54, 57, 59, 61, 68, 70, 75, 86, 87, 94, 95}
6    1362    4860           {2, 5, 14, 19, 23, 32, 37, 38, 47, 48, 58, 62, 65, 68, 70, 72, 73, 77, 82, 88, 91}
7    7994    7192      {2, 3, 4, 7, 9, 11, 12, 13, 15, 17, 20, 24, 25, 29, 40, 50, 57, 64, 71, 78, 89, 95, 99}
8     748    6271  {1, 12, 19, 26, 30, 34, 45, 48, 52, 60, 67, 72, 73, 74, 76, 80, 82, 84, 89, 94, 96, 98, 99}
9     553    4068  {9, 12, 15, 20, 35, 39, 40, 41, 44, 45, 50, 57, 65, 67, 68, 69, 72, 73, 79, 87, 88, 97, 98}
REFERENCE_SET = {10, 18, 23, 55, 90, 92}

df['Intersect'] = df['sets'].apply(lambda row: REFERENCE_SET.intersection(row))
df['Len'] = df['Intersect'].apply(len)

df_filtered = df[df['Len'] > 0]

df_filtered的输出

   id_one  id_two                                                                                         sets Intersect  Len
1     574    1555      {7, 18, 19, 22, 23, 24, 30, 39, 43, 47, 50, 58, 62, 64, 72, 76, 77, 83, 84, 86, 87, 96}  {18, 23}    2
2    7831    8823  {5, 14, 15, 20, 23, 28, 30, 32, 35, 36, 40, 41, 44, 52, 54, 59, 60, 62, 63, 84, 87, 90, 96}  {90, 23}    2
6    1362    4860           {2, 5, 14, 19, 23, 32, 37, 38, 47, 48, 58, 62, 65, 68, 70, 72, 73, 77, 82, 88, 91}      {23}    1

推荐阅读