python-3.x - 我正在研究一个python问题来优化脚本
问题描述
- 使 option_labels 列的行值成为列标题
- 如果特定 user_id 存在 option_labels,我将在创建的新列中应用 option_values 的值,否则为 0。
样本数据是:(data.csv)
user_id country option_values option_labels
abc456 Germany 256gb SSD
abc123 Brazil i5 intel
xyz456 France 128gb SSD
xyz123 Turkey i7 intel
abc123 Brazil 2gb nvidia
abc456 Germany 32gb RAM
xyz123 Turkey 4gb nvidia
xyz456 France 16gb RAM
样本输出将是:
user_id country option_values option_labels intel nvidia SSD RAM
abc456 Germany 256gb SSD 0 0 256gb 0
abc123 Brazil i5 intel i5 0 0 0
xyz456 France 256gb SSD 0 0 128gb 0
xyz123 Turkey i7 intel i7 0 0 0
abc123 Brazil 2gb nvidia 0 2gb 0 0
abc456 Germany 32gb RAM 0 0 0 32gb
xyz123 Turkey 4gb nvidia 0 4gb 0 0
xyz456 France 16gb RAM 0 0 0 16gb
我用下面的示例代码完成了这个过程,
import pandas as pd
import numpy as np
data_sample = pd.read_csv("data.csv")
feature_list = data_sample["option_label"].unique().tolist()
user_list = data_sample["user_id"].unique().tolist()
country_list = data_sample["country"].unique().tolist()
opt_val_list = data_sample["opt_val"].unique().tolist()
def filterd_id(check_id):
single_id_data= data_sample[data_sample['user_id'] == check_id]
return single_id_data
def finding_features(single_id_data):
user_features = single_id_data["option_labels"].unique().tolist()
return user_features
def check_feature(feature_list, user_features):
feature_prs_not = []
for i in feature_list:
if(i in user_features):
result = opt_val_list
else:
result = 0
feature_prs_not.append(result)
return feature_prs_not
user_id = []
country = []
for i in user_list:
check_id = i
user_id.append(i)
single_id_data = filterd_id(check_id)
c = single_id_data["country"].unique().tolist()
country.append(c)
user_features = finding_features(single_id_data)
feature_prst_not = check_feature(feature_list,user_features)
df = pd.DataFrame([feature_prst_not], columns = feature_list)
df_feature = df_feature.append(df)
df_user_id = pd.DataFrame(user_id, columns=['all_user_id'])
df_country = pd.DataFrame(country, columns=['country_name'])
对于我近 100k ids 的原始数据,它需要更多的时间来运行(例如 8-9 小时)。我仍处于 Python 的学习阶段,我现在正在尝试优化以减少脚本的运行时间。
解决方案
如果你想要它更快,你需要矢量化。我相信这段代码产生与你相同的输出
import numpy as np
for val in df['option_labels'].unique():
df[val] = np.where(df['option_labels']==val, df['option_values'], 0)
我就是这样复制你的数据的
from io import StringIO
df = pd.read_csv(StringIO('''
"user_id","country","option_values","option_labels"
"abc456","Germany","256gb","SSD"
"abc123","Brazil","i5","intel"
"xyz456","France","128gb","SSD"
"xyz123","Turkey","i7","intel"
"abc123","Brazil","2gb","nvidia"
"abc456","Germany","32gb","RAM"
"xyz123","Turkey","4gb","nvidia"
"xyz456","France","16gb","RAM"'''))
推荐阅读
- node.js - 尝试在快速车把文件中呈现车把模板(未捕获的 ReferenceError:未定义车把)
- php - Codeigniter-检查用户是否存在于月份
- java - 未使用的变量警告
- grafana - 在普罗米修斯查询的右侧使用度量
- c++ - 类对象在 int main() 中为“未定义”
- keras - 功能 API/多输入模型超参数优化
- ruby-on-rails - 找到收入超过 10 年的公司
- javascript - 将厘米转换为英尺和英寸
- python - 使用 Jupyter notebook 在 Matplotlib 中执行 python 脚本文件,但它总是不显示数字。为什么?
- java - 在 Java 8 中为什么我们不能使用方法引用将 Math.random() 转换为 Math::random