首页 > 解决方案 > Match strings in a particular column to another column even though it occurs in different patterns

问题描述

There could be many patterns of a string present in a text. How can we match them?

import pandas as pd
import re

lookup = ['100050', '123456', '100045']
lookup_df = pd.DataFrame(lookup, columns = ['Lookup'])

   Lookup
0  100050
1  123456
2  100045
`

text = ['abc 100 050', 'abc 123456','def 100045 ghij','abfgdgh100050','adcb 100 0 50','adc b100050']
text_df = pd.DataFrame(text, columns = ['Text'])


              Text   
0      abc 100 050        
1       abc 123456  
2  def 100045 ghij
3    abfgdgh100050
4    adcb 100 0 50
5    adc b100050

pat = '|'.join(r"\b{}\b".format(re.escape(x)) for x in lookup_df['Lookup'])
text_df['Match'] = text_df['Text'].str.findall(pat).apply(lambda x : x[0] if (len(x)>0) else '' )

Current Output:
Out: 
              Text      Match
0      abc 100 050      
1       abc 123456      123456
2  def 100045 ghij      100045
3    abfgdgh100050       
4    adcb 100 0 50
5    adc b100050

If it's noticed, 100050 is present with a space in between in the Text[0] and also it is combined as a whole text in Text[3] which should not be identified as it is not a whole word. if "abfgdgh100050" is present in the lookup column, then that has to be identified. else, it should not be identified.

Expected Output:
Out: 
              Text      Match
0      abc 100 050      100050
1       abc 123456      123456
2  def 100045 ghij      100045
3    abfgdgh100050      NA/BLANK
4    adcb 100 0 50      100050
5    adc b100050        NA/BLANK

标签: pythonregexpandasdataframe

解决方案


对您的代码进行一些修改:

  1. 更改模式以删除中断
  2. 将文本转换为忽略所有空格以进行模式匹配
pat = '|'.join(r"{}".format(re.escape(x)) for x in lookup_df['Lookup'])

text_df['matching_text'] = text_df['Text'].apply(lambda x: x.replace(" ", ""))
text_df['Match'] = text_df['matching_text'].str.findall(pat).apply(lambda x : x[0] if (len(x)>0) else '' )

输出:

              Text  matching_text   Match
0      abc 100 050      abc100050  100050
1       abc 123456      abc123456  123456
2  def 100045 ghij  def100045ghij  100045
3    abfgdgh100050  abfgdgh100050  100050

编辑

在 OP 修改问题后,不需要做任何花哨的正则表达式。

这是工作代码:

import pandas as pd
import numpy as np

lookup = {'100050', '123456', '100045'}

print(lookup - (lookup - {'asd', 'sdkj', '100050'}))
lookup_df = pd.DataFrame(lookup, columns=['Lookup'])

text = ['abc 100 050', 'abc 123456','def 100045 ghij','abfgdgh100050 89 289s']
text_df = pd.DataFrame(text, columns = ['Text'])

text_df['Match'] = text_df['Text'].apply(lambda x: list(lookup - (lookup - set(x.split())))).apply(lambda x: x[0] if len(x) > 0 else np.NaN)

print(text_df)

输出:

  
                    Text   Match
0            abc 100 050     NaN
1             abc 123456  123456
2        def 100045 ghij  100045
3  abfgdgh100050 89 289s     NaN

推荐阅读