首页 > 解决方案 > 如何在 python 中创建一个数组,在字符串中搜索特定标签并将输出放入返回中

问题描述

目前我在python中有以下代码:

def get(x):
    up, up1, up2, up3, up4 = "" ,"" ,"","" , ""


    x = x.split(", ")
    for i in x:
        if "Up_" in i:
            # print(i)
            up = str(i) + ', '
        if "Up1_" in i:
            # print(i)
            up1 = str(i) + ', '
        if "Up2_" in i:
            # print(i)
            up2 = str(i) + ', '
        if "Up3_" in i:
            # print(i)
            up3 = str(i) + ', '
        if "Up4_" in i:
            # print(i)
            up4 = str(i) + ', '

    return (str(up) + str(up1) + str(up2) + str(up3) + str(up4))[:-2]

尽管这对于我目前所拥有的功能来说很好,但如果要添加的任何标签包含 Up_5 到 Up10_,那么该功能将停止工作。

我想要做的是组合一个函数,该函数在“标签”列中搜索包含“Up_”和“Up*_”*的任何标签(在SQL术语中将返回任何具有Up&之间值的内容。不确定是否在 python 中有一个功能)然后将数组找到的任何内容放在另一个仅包含 Up和 Up*_ 标签的数组中,然后将其应用于另一列。

+---+----------+-------+------------+-----------+--------------+
| product_id |  sku  | total_sold |   tags    | total_images |
+---+----------+-------+------------+-----------+--------------+
| geggre     | rgerg |        456 | Up1_, Up2 |            5 |
+---+----------+-------+------------+-----------+--------------+

希望它看起来像:

+---+----------+-------+------------+-----------+--------------+-------+
| product_id |  sku  | total_sold |   tags    | total_images | Count |
+---+----------+-------+------------+-----------+--------------+-------+
| ggeggre    | rgerg |        456 | Up1_, Up2 |            5 |     2 |
+---+----------+-------+------------+-----------+--------------+-------+

感谢另一个用户,我已经有了 count 标签:

data["total_tags"] = data["tags"].apply(lambda x : len(x.split(',')))

我只需要知道如何创建上述数组来简化我的 if 语句并让它包含最多 Up10_ 标签。

这也是我的python,它使用get并附加“tags”列以仅包含Up标签:

data['tags'] = data['tags'].apply(get)

上下文的完整脚本:


# impoting padas module with an alias of pd
import pandas as pd


# get function assigned to x (x values: up, up1, up2, up3, up4)
def get(x):
    up, up1, up2, up3, up4 = "" ,"" ,"","" , ""


    x = x.split(", ")
    for i in x:
        if "Up_" in i:
            # print(i)
            up = str(i) + ', '
        if "Up1_" in i:
            # print(i)
            up1 = str(i) + ', '
        if "Up2_" in i:
            # print(i)
            up2 = str(i) + ', '
        if "Up3_" in i:
            # print(i)
            up3 = str(i) + ', '
        if "Up4_" in i:
            # print(i)
            up4 = str(i) + ', '
    # returns the values within a string if each maches, it also removed -2 characters    
    return (str(up) + str(up1) + str(up2) + str(up3) + str(up4))[:-2]
# data contains the content of the data200.csv file using pandas read_csv function
data = pd.read_csv('data200.csv')

#defines the tags column to equal what up_ tags are in the tags column using the get function
data['tags'] = data['tags'].apply(get)

#
data = data[ (data['tags'] == "") == False]

#creates a new column called total_tags and returns a count of how many elements are between commas
data["total_tags"] = data["tags"].apply(lambda x : len(x.split(',')))

# prints first 5 lines of csv
print(data.head())
# exports everything to test.csv and removes the index column
data.to_csv("test.csv", index = False)

标签: pythonarraysfunctionlambda

解决方案


您可以为此使用正则表达式:

import re

def get(x):
    x = x.split(", ")
    out_str = ''
    for tag in x:
        if re.search("^Up\d*_", tag):
            t = re.match("^Up\d*_", tag)
            t = t.group(0)
            out_str += t + ','
    return out_str[:-1]
print(get("Up1_, AS3_, Up2_, Up_, AS_"))

输出:

Up1_,Up2_,Up_

这是你要找的吗?如果您只想要标记中的数字 0-9,您可以*将正则表达式中的 更改为 a ?

if re.search("^Up\d?_", tag):
     t = re.match("^Up\d?_", tag)

编辑:

在您进行编辑后,我更清楚您的意思,您可以简单地执行以下操作:

data['tags'] = data['tags'].apply(lambda x : ",".join(re.findall("Up\d*_", x)))

或者:

data['tags'] = data['tags'].apply(lambda x : ",".join(re.findall("Up\d?_", x)))

取决于您是否只希望在 和 之间最多只有一位数字Up_或者是否允许任意数量的数字。请注意,在findall()方法^中删除了,因为我们不仅从字符串的开头搜索,而且在整个字符串中搜索所有出现的地方。

编辑2:

好的总结这些评论获得的评论和其他信息,你可能想要这样的东西:

data['tags'] = data['tags'].apply(lambda x : ",".join(re.findall("[Uu]p\d?_\S*(?=,)", x)))

推荐阅读