首页 > 解决方案 > 如何使用 pandas .apply() 方法更改列值

问题描述

我在尝试从 CSV 文件中获取数据、组装密钥然后创建一个仅包含必要数据的新 CSV 文件时偶然发现了一个问题。

示例数据:

ID1 数据1 数据2 价格1 颜色键 ID2 数据3 价格2
12345/6 950/000 Pd950 996 G 4/20017/6 4/20017/6 950/000 1108
12345/6 333/000 Pd333 402 G 4/20017/6 4/20017/6 333/000 501
12345/6 500/000 Pd500 550 G 4/20017/6 4/20017/6 500/000 577
12345/6 950/000 PT950 690 PT 4/20017/6 4/20017/6 950/000 779
12345/6 600/000 Pt600 424 Pt 4/20017/6 4/20017/6 600/000 482

我的代码:

file_original = pd.read_csv(path_to_file, sep=";")

def getNum(itemPair):

    # Uses data to construct the output
    # Returns a string e.g. "4-12345-6_12345-6_XX-333"


def processItems(row):
    
    first_nr = row["ID1"].replace("/", "-")
    second_nr = row["ID2"].replace("/", "-")

    color = str(row["Color"]).lower()
    alloy = row["Data2"]

    Price1 = row["Price1"]
    Price2 = row["Price2"]

    alloys = []

    '''
    Performs a RegEx search on the alloy information. The pattern checks for
    two alloy data sets within one string.
    If result is none, there is only one data set and the next RegEx pattern is used.
    '''
    regEx = re.search(r"(\w{0,2}\d{3}).*?(\w{0,2}\d{3})", legierung)
    if regEx:
        if regEx.group(1):
            alloys.append(regEx.group(1))
        if regEx.group(2):
            alloys.append(regEx.group(2))
    else:
        regEx = re.search(r"(\d{3})", legierung)
        if regEx.group(1):
            alloys.append(regEx.group(1))


    alloys = sorted(alloys, key=sortOrderAlloy)

    # Strips away any letters from the very first alloy entry
    if len(alloys[0]) > 3:
        alloys[0] = alloys[0][2:]


    colors = []
    i = 0
    while i < len(color):
        if color[i] == "p" and (i < (len(color) - 1) and (color[i + 1] == "t" or color[i + 1] == "d")):
            # If the current letter is "p" and the next letter is "t" or "d" consider them one color
            result = color[i] + color[i + 1]
            i += 1

            if result not in colors:
                # If the color isn't yet part of the list, append it
                colors.append(result)

        else:
            if color[i] not in colors:
                # If the color isn't yet part of the list, append it
                colors += color[i]
        i += 1


    colors = sorted(colors, key=sortOrderColor)


    # Check if there are more colors than alloys which means there are multiple colors per alloy
    if len(colors) > len(alloys):
        # If there is no color starting with "p", the alloy can be applied to every color
        if ("pt" not in colors) and ("pd" not in colors):
            # Counter variable for while loop
            c = len(alloys)

            while c < len(colors):
                alloys.append(alloys[0])
                c = len(alloys)
        
        else:
            # The amount of missing alloys for while loop
            count_diff = len(colors) - len(alloys)
            
            while count_diff != 0:
                alloys.insert(0, alloys[0])
                count_diff -= 1


    materials = []

    # Count variable for while loop
    i = 0

    while i < len(colors):
        materials.append(getMaterial(colors[i]))
        i += 1

    # Do some stuff to get the necessary data while assigning the following variables:
    # first_nr, second_nr, materials, alloys, price1, price2

    row["idNum"] = getNum(first_nr, second_nr, materials, alloys, price1, price2)

    row["price"] = (price1 + price2)

file_original["idNum"] = ""
file_original["price"] = 0

file_original.apply(processItems, axis=1)

file_processed = file_original[["idNum", "price"]]

file_processed.to_csv(path_output, sep=";", index=False, encoding="utf-8-sig")

预期的结果将是一个只有两列“idNum”和“price”的 CSV 文件,并且这些列按预期进行了限制,但是这些行都以我用来创建这两个新列的空数据结束,所以是一个空字符串和数字 0。

通过谷歌搜索后,我发现使用 .apply() 不允许您直接更改数据,而是需要重新分配它,但是当我更改相应的行时,我得到一个错误。

# Old line
file_original.apply(processItems, axis=1)

# New line
file_original = file_original.apply(processItems, axis=1)

错误如下:

回溯(最近一次通话最后):
  文件“prepareImport.py”,第 278 行,在
    file_processed = file_original[["idNum", "price"]]
  文件“C:\Users\MY-USER\anaconda3\lib\site-packages\pandas\core\series.py”,第 910 行,在 __getitem__
    返回 self._get_with(key)
  文件“C:\Users\MY-USER\anaconda3\lib\site-packages\pandas\core\series.py”,第 958 行,在 _get_with
    返回 self.loc[key]
  文件“C:\Users\MY-USER\anaconda3\lib\site-packages\pandas\core\indexing.py”,第 1768 行,在 __getitem__
    return self._getitem_axis(maybe_callable, axis=axis)
  文件“C:\Users\MY-USER\anaconda3\lib\site-packages\pandas\core\indexing.py”,第 1954 行,在 _getitem_axis
    返回self._getitem_iterable(键,轴=轴)
  _getitem_iterable 中的文件“C:\Users\MY-USER\anaconda3\lib\site-packages\pandas\core\indexing.py”,第 1595 行
    keyarr, indexer = self._get_listlike_indexer(key, axis, raise_missing=False)
  _get_listlike_indexer 中的文件“C:\Users\MY-USER\anaconda3\lib\site-packages\pandas\core\indexing.py”,第 1552 行
    self._validate_read_indexer(
  _validate_read_indexer 中的文件“C:\Users\MY-USER\anaconda3\lib\site-packages\pandas\core\indexing.py”,第 1640 行
    raise KeyError(f"[{key}] 中没有 [{axis_name}]")
KeyError:“[Index(['idNum', 'price'], dtype='object')] 中没有一个在 [index] 中”
Line 278 => file_processed = file_original[["idNum", "price"]]

据我所知,数据框以某种方式丢失了我在重新分配之前添加的两列。
所以我决定将它打印到控制台而不是尝试提取两列,结果表明数据框显然已转换为系列,因此它只有一列填充了数据类型“对象”的“无”,尽管它保持全长10550。

我发现了一些其他可能的解决方案来解决我的原始问题,但它们通常解决了用户犯的错误。当然,我检查了我的代码是否有类似的错误,但我自己找不到,所以我希望有人能帮助我解决这个问题。

非常感谢您!如果我忘记提供任何重要信息,请告诉我,以便我添加。

最好的问候
Eisman333

编辑:预期输出:

idNum 价格
4-12345-6_12345-6_XX-333 250
4-12345-6_12345-6_XX-585_YY-950 2130
4-12345-6_12345-6_XX-333_ZZ-500 1560

标签: pythonpython-3.xpandas

解决方案


您的逻辑/程序不足以让我复制您的 processItems 函数的所有方面,但我建议您考虑分离各个列更新。您可以按照以下方式实现一些东西,其中使用原始数据框调用此方法:

def groom_data(idf):
    idf['ID1'] = idf.apply(lambda row: row.ID1.replace("/", '-'), axis= 1)
    idf['ID2'] = idf.apply(lambda row: row.ID2.replace("/", '-'), axis= 1)
    idf['Color'] = idf.apply(lambda row: row.Color.lower(), axis= 1)
    idf['Alloys'] = idf.apply(lambda row: process_alloys(row), axis= 1)
    idf['Materials'] = idf.apply(lambda row: process_items(row), axis= 1)
    idf['IdNum']= idf.apply(lambda row: getNum(row.ID1, row.ID2, row.Materials, row.Alloys, row.Price1, row.Price2))
    idf['Total_Price'] = idf.apply(lambda row: row.Price1 + row.Price2)
    return idf 

这需要将识别合金和材料的逻辑分成两个通道,我建议这样做,因为稍后生成 IdNum 列时需要同时访问材料和合金。一旦您拥有原始数据框中的所有数据,您就可以创建一个新的输出帧,其中仅包含最终结果所需的信息。


推荐阅读