python - Match strings in a particular column to another column even though it occurs in different patterns
问题描述
There could be many patterns of a string present in a text. How can we match them?
import pandas as pd
import re
lookup = ['100050', '123456', '100045']
lookup_df = pd.DataFrame(lookup, columns = ['Lookup'])
Lookup
0 100050
1 123456
2 100045
`
text = ['abc 100 050', 'abc 123456','def 100045 ghij','abfgdgh100050','adcb 100 0 50','adc b100050']
text_df = pd.DataFrame(text, columns = ['Text'])
Text
0 abc 100 050
1 abc 123456
2 def 100045 ghij
3 abfgdgh100050
4 adcb 100 0 50
5 adc b100050
pat = '|'.join(r"\b{}\b".format(re.escape(x)) for x in lookup_df['Lookup'])
text_df['Match'] = text_df['Text'].str.findall(pat).apply(lambda x : x[0] if (len(x)>0) else '' )
Current Output:
Out:
Text Match
0 abc 100 050
1 abc 123456 123456
2 def 100045 ghij 100045
3 abfgdgh100050
4 adcb 100 0 50
5 adc b100050
If it's noticed, 100050 is present with a space in between in the Text[0]
and also it is combined as a whole text in Text[3]
which should not be identified as it is not a whole word. if "abfgdgh100050" is present in the lookup column, then that has to be identified. else, it should not be identified.
Expected Output:
Out:
Text Match
0 abc 100 050 100050
1 abc 123456 123456
2 def 100045 ghij 100045
3 abfgdgh100050 NA/BLANK
4 adcb 100 0 50 100050
5 adc b100050 NA/BLANK
解决方案
对您的代码进行一些修改:
- 更改模式以删除中断
- 将文本转换为忽略所有空格以进行模式匹配
pat = '|'.join(r"{}".format(re.escape(x)) for x in lookup_df['Lookup'])
text_df['matching_text'] = text_df['Text'].apply(lambda x: x.replace(" ", ""))
text_df['Match'] = text_df['matching_text'].str.findall(pat).apply(lambda x : x[0] if (len(x)>0) else '' )
输出:
Text matching_text Match
0 abc 100 050 abc100050 100050
1 abc 123456 abc123456 123456
2 def 100045 ghij def100045ghij 100045
3 abfgdgh100050 abfgdgh100050 100050
编辑
在 OP 修改问题后,不需要做任何花哨的正则表达式。
这是工作代码:
import pandas as pd
import numpy as np
lookup = {'100050', '123456', '100045'}
print(lookup - (lookup - {'asd', 'sdkj', '100050'}))
lookup_df = pd.DataFrame(lookup, columns=['Lookup'])
text = ['abc 100 050', 'abc 123456','def 100045 ghij','abfgdgh100050 89 289s']
text_df = pd.DataFrame(text, columns = ['Text'])
text_df['Match'] = text_df['Text'].apply(lambda x: list(lookup - (lookup - set(x.split())))).apply(lambda x: x[0] if len(x) > 0 else np.NaN)
print(text_df)
输出:
Text Match
0 abc 100 050 NaN
1 abc 123456 123456
2 def 100045 ghij 100045
3 abfgdgh100050 89 289s NaN
推荐阅读
- sql - 如何使用 distinct 或 group max sql 进行分组
- c# - 小数点后保留一位,不四舍五入
- postgresql - Postgres解释计划对于具有不同参数的相同查询是不同的
- docker - 需要在每个 docker start (不是 docker run )期间将参数传递给 docker entrypoint.sh 。这样的事情可能吗?
- node.js - 如何构建离线 Reactjs 和 Nodejs 环境来开发应用程序?
- amazon-web-services - 从 AWS S3 使用 get() 的格式是什么?
- javascript - React - 在渲染之前加载数据
- php - 创建子文件夹,更改命名空间,但它不起作用
- php - 我已经实现了一个 FCM 服务器端,服务器端没有交付
- java - ResultSetMetaData.getColumnLabel 可以返回 null 吗?