首页 > 解决方案 > 使用正则表达式将 pandas 列值与文本文件中的单词进行比较

问题描述

我有一个df这样的数据框:

product      name     description0 description1 description2 description3
  A          plane         flies        air      passengers     wings
  B          car           rolls        road        NaN          NaN
  C          boat          floats       sea      passengers      NaN

我想要做的是比较要在 txt 文件中搜索的描述列中的每个值。

假设我的test.txt文件是:

flies到伦敦,然后越过sea纽约到达纽约。

结果将如下所示:

product      name     description0 description1 description2 description3 Match
  A          plane         flies        air      passengers     wings     Match
  B          car           rolls        road        NaN          NaN      No match
  C          boat          floats       sea      passengers      NaN      Match

我知道主要结构,但其余部分我有点迷路

with open ("test.txt", 'r') as searchfile:
    for line in searchfile:
        print line
        if re.search() in line:
            print(match)

标签: pythonpandasmatchre

解决方案


您可以使用搜索输入文本,str.find()因为您正在搜索字符串文字。re.search()似乎是矫枉过正。

一个快速而肮脏的解决方案,使用.apply(axis=1)

数据

# df as given
input_text = "He flies to London then crosses the sea to reach New-York."

代码

input_text_lower = input_text.lower()

def search(row):
    for el in row:  # description 0,1,2,3
        #  skip non-string contents and if the search is successful
        if isinstance(el, str) and (input_text_lower.find(el.lower()) >= 0):
            return True
    return False

df["Match"] = df[[f"description{i}" for i in range(4)]].apply(search, axis=1)

结果

print(df)
  product   name description0 description1 description2 description3  Match
0       A  plane        flies          air   passengers        wings   True
1       B    car        rolls         road          NaN          NaN  False
2       C   boat       floats          sea   passengers          NaN   True

笔记

原始问题中不考虑单词边界、标点符号和连字符。在实际情况下,可能需要额外的预处理步骤。这超出了原始问题的范围。


推荐阅读