首页 > 解决方案 > Pandas 合并数据帧导致 x 和 y 后缀

问题描述

我正在为 Uni 项目创建自己的数据集。我经常使用合并功能,它总是很完美。这次我得到了我无法理解的 x 和 y 后缀。我知道 pandas 这样做是因为 ->两个数据框中与指定列匹配的行被提取并连接在一起。如果有多个匹配项,则所有可能的匹配项各贡献一行。但我真的不明白为什么。我认为这与我之前收到的警告有关:

试图在 DataFrame 中的切片副本上设置一个值。尝试使用 .loc[row_indexer,col_indexer] = value 查看文档中的注意事项:https ://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a -copy unique_website_user['PurchaseOnWebsite'] = [apply_website_user() for x in unique_website_user.index]

我试图合并它们明显匹配的“CustomerID”列上的数据框。我真的不明白错误。这是我的代码:我首先要删除相关列是 CustomerID 和 WebsiteID 的重复行然后我想应用一个函数,该函数随机返回 true 或 false 作为字符串。到目前为止,生成的数据框看起来不错。我得到的唯一警告是我之前描述的警告。最后我想合并它们,它会产生一个比原始数据框大的数据框。我真的不明白这一点。

import numpy as np
import pandas as pd
from numpy.random import choice

df = pd.DataFrame()

df['AdID'] = np.random.randint(1,1000001, size=100000)

df['CustomerID'] = np.random.randint(1,1001, size=len(df))

df["Datetime"] = choice(pd.date_range('2015-01-01', '2020-12-31'), len(df))


def check_weekday(date):

    res = len(pd.bdate_range(date, date))

    if res == 0:
        result = "Weekend"
    else:
        result = "Working Day"

    return result


df["Weekend"] = df["Datetime"].apply(check_weekday)

def apply_age():

    age = choice([16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36],
    p=[.00009, .00159, .02908, .06829, .09102, .10043, .10609, .10072, .09223, .08018, .06836, .05552,
    .04549,.03864, .03009, .02439, .01939, .01586, .01280, .01069, .00905])
    return age

def apply_income_class():

    income_class = choice([np.random.randint(50,501),np.random.randint(502,1001), np.random.randint(1002,1501),np.random.randint(1502,2001)],
                   p=[.442, .387, .148, .023])
    return income_class

def apply_gender():

    gender = choice(['male', 'female'], p=[.537, .463])
    return gender

unique_customers = df[['CustomerID']].drop_duplicates(keep="first")


unique_customers['Age'] = [apply_age() for x in unique_customers.index]

unique_customers['Gender'] = [apply_gender() for x in unique_customers.index]

unique_customers['Monthly Income'] = [apply_income_class() for x in unique_customers.index]

unique_customers['Spending Score'] = [np.random.randint(1,101) for x in unique_customers.index]

df = df.merge(unique_customers, on=['CustomerID'], how='left')

df['WebsiteID'] = np.random.randint(1,31, len(df))

df['OfferID'] = np.random.randint(1,2001, len(df))

df['BrandID'] = np.random.randint(1,10, len(df))


unique_offers = df[['OfferID']].drop_duplicates(keep="first")
print(len(unique_offers))


unique_offers['CategoryID'] = [np.random.randint(1,501) for x in unique_offers.index]

unique_offers['NPS'] = [np.random.randint(1, 101) for x in unique_offers.index]

df = df.merge(unique_offers, on=['OfferID'], how='left')

def apply_website_user():

    purchase = np.random.choice(['True', 'False'])
    return purchase

unique_website_user = df.drop_duplicates(subset=['CustomerID', 'WebsiteID'], keep="first").copy()
unique_website_user['PurchaseOnWebsite'] = [apply_website_user() for x in unique_website_user.index]
print(unique_website_user.head())
df = df.merge(unique_website_user[['CustomerID','PurchaseOnWebsite']], on='CustomerID', how='left')

#df['PurchaseOnWebsite']= df.groupby(['CustomerID', 'WebsiteID']).apply(apply_website_user)


print(df.head)

#Erstellen der csv-Datei
#df.to_csv(r'/Users/alina/Desktop/trainingsdaten.csv', sep=',', #index=False)

数据框 df -> 1000000 行 x 13 列

数据框 unique_website_user

结果数据框-> 2897120 行 x 26 列

标签: pythonpandasdataframe

解决方案


最好粘贴数据,而不是提供图像,所以这只是指导,因为我无法测试它。您有几个问题,我认为它们不相关。

  1. 复制或切片警告。您也许可以摆脱这两种方式。一是重新配置线路:

    unique_website_user['PurchaseOnWebsite'] = [apply_website_user() for x in unique_website_user.index]
    

    到它建议的格式。另一种可能更简单的方法是.copy()在它之前使用。您正在删除重复项,然后对其进行修改,而 pandas 只是警告您正在修改原始切片或视图。尝试这个:

    unique_website_user = df.drop_duplicates(subset=['CustomerID', 'WebsiteID'], keep="first").copy()
    
  2. 如果您只想合并该一列并减少列数,请尝试以下操作:

    df = df.merge(unique_website_user[['CustomerID','PurchaseOnWebsite']], on='CustomerID', how='left')
    
  3. 另一种替代方法是在 and方法中使用groupby()和应用您的 True/False 函数。apply就像是:

    df.groupby(['CustomerID']).apply(yourfunctionhere)
    

    这摆脱了创建和合并数据框。如果您发布所有代码实际数据框,我们可以更具体。

更新:看到您的评论,您找到了自己的答案。此外,这比调用 weekday 函数要快得多。

df["Weekend"] = df['Datetime'].apply(lambda x: 'Weekend' if (x.weekday() == 5 or x.weekday() == 6) else 'Working Day')

推荐阅读