首页 > 解决方案 > python中列中的模式匹配

问题描述

我有两个数据框 df 和 df1。我想根据 df1 中给出的值在 df 中搜索模式。数据帧如下:

    import pandas as pd
    data={"id":["I983","I873","I526","I721","I536","I327","I626","I213","I625","I524"],
"coltext":[ "I could take my comment back, I would do so in a second. I have addressed my teammates and coaches and while many understand my actions were totall",                                                                                                "We’re just trying to see if he can get on the field as a football player, and then we’ll make decision",
                                                                                                 "TextNow offers low-cost, international calling to over 230 countries. Stay connected longer with rates starting at less than",
                                                                                                 "Wi-Fi can provide you with added coverage in places where cell networks don't always work - like basements and apartments. No roaming fees for Wi-Fi connection",
                                                                                                 "Send messages and make calls on your compute",
                                                                                                 "even have a free, Wi-Fi only version of TextNow, available for download on you",
                                                                                                 "the rest of the players accepted apologies this spring and are welcoming him back",
                                                                                                 "was really looking at him and watching how much this really means to him and how much he really missed us",
                                                                                                 "I’ll deal with the problem and I’ll remedy the problem",
                                                                                                 "The first step was for him to be able to complete what we call our bottom line program which has been completed"]}
df=pd.DataFrame(data=data)
data1={"col1":["addressed teammates coaches","football player decision","watching really missed", "bottom line program","meassges make calls"],
     "col2":["international calling over","download on you","rest players accepted","deal problem remedy","understand actions totall"],
     "col3":["first step him","Wi-Fi only version","cell network works","accepted apologies","stay connected longer"]}
df1=pd.DataFrame(data=data1)

例如,df1['col1'] 中的第一个元素“addressed teammates coaches”位于 df['coltext'] 中的第一个元素中,同样我想在 df['coltext'] 中搜索 df1 中每一列的每个元素。如果找到模式,则在 df 中创建第三个 col。

期望的输出:

id  coltext                                 patternMatch
I983  I could take my comment back,               col1, col2
I873  We’re just trying to see if he can              col1
I526  TextNow offers low-cost,                    col3, col2
I721  Wi-Fi can provide you with                      col3
I536  Send messages and make calls                    col1

标签: pythonregexstringdataframesearch

解决方案


可能还有其他有效的方法,一种方法可能如下:

# create dictionary of data1 such that values and keys are reversed
my_dict = {item:k for k, v in data1.items() for item in v}
# for column in df check if all words are in 'coltext' for each key in dictionary
df['patternMatch'] = df['coltext'].apply(lambda row: 
                                         {v for k, v in my_dict.items() 
                                                if all(word in row for word in k.split())})

推荐阅读