首页 > 解决方案 > 我正在研究一个python问题来优化脚本

问题描述

  1. 使 option_labels 列的行值成为列标题
  2. 如果特定 user_id 存在 option_labels,我将在创建的新列中应用 option_values 的值,否则为 0。

样本数据是:(data.csv)

 user_id       country        option_values        option_labels

 abc456         Germany        256gb                  SSD
 abc123         Brazil         i5                    intel 
 xyz456         France         128gb                  SSD
 xyz123         Turkey         i7                    intel 
 abc123         Brazil         2gb                   nvidia
 abc456         Germany        32gb                   RAM
 xyz123         Turkey         4gb                   nvidia
 xyz456         France         16gb                   RAM

样本输出将是:

 user_id       country        option_values     option_labels     intel         nvidia       SSD        RAM 

 abc456         Germany        256gb             SSD                0              0        256gb        0
 abc123         Brazil         i5                intel              i5             0          0          0
 xyz456         France         256gb             SSD                0              0        128gb        0
 xyz123         Turkey         i7                intel              i7             0          0          0
 abc123         Brazil         2gb               nvidia             0              2gb        0          0  
 abc456         Germany        32gb              RAM                0              0          0          32gb
 xyz123         Turkey         4gb               nvidia             0              4gb        0          0
 xyz456         France         16gb              RAM                0              0          0          16gb

我用下面的示例代码完成了这个过程,

 import pandas as pd
 import numpy as np

 data_sample = pd.read_csv("data.csv")
 feature_list = data_sample["option_label"].unique().tolist()
 user_list = data_sample["user_id"].unique().tolist()
 country_list = data_sample["country"].unique().tolist()
 opt_val_list = data_sample["opt_val"].unique().tolist()

 def filterd_id(check_id):
     single_id_data= data_sample[data_sample['user_id'] == check_id]
     return single_id_data

 def finding_features(single_id_data):
     user_features = single_id_data["option_labels"].unique().tolist()
     return user_features

 def check_feature(feature_list, user_features): 
     feature_prs_not = []
     for i in feature_list:
         if(i in user_features):
             result = opt_val_list
         else:
             result = 0 
         feature_prs_not.append(result)          
     return feature_prs_not 

 user_id = []
 country = []

 for i in user_list: 
     check_id = i
     user_id.append(i)
     single_id_data = filterd_id(check_id)
     c = single_id_data["country"].unique().tolist()
     country.append(c)
     user_features = finding_features(single_id_data)
     feature_prst_not = check_feature(feature_list,user_features)    
     df = pd.DataFrame([feature_prst_not], columns = feature_list)
     df_feature = df_feature.append(df)
 df_user_id = pd.DataFrame(user_id, columns=['all_user_id'])
 df_country = pd.DataFrame(country, columns=['country_name'])

对于我近 100k ids 的原始数据,它需要更多的时间来运行(例如 8-9 小时)。我仍处于 Python 的学习阶段,我现在正在尝试优化以减少脚本的运行时间。

标签: python-3.xjupyterhub

解决方案


如果你想要它更快,你需要矢量化。我相信这段代码产生与你相同的输出

import numpy as np

for val in df['option_labels'].unique():
    df[val] = np.where(df['option_labels']==val, df['option_values'], 0)

我就是这样复制你的数据的

from io import StringIO

df = pd.read_csv(StringIO(''' 
"user_id","country","option_values","option_labels"
"abc456","Germany","256gb","SSD"
"abc123","Brazil","i5","intel" 
"xyz456","France","128gb","SSD"
"xyz123","Turkey","i7","intel" 
"abc123","Brazil","2gb","nvidia"
"abc456","Germany","32gb","RAM"
"xyz123","Turkey","4gb","nvidia"
"xyz456","France","16gb","RAM"'''))

推荐阅读