python - 使用熊猫将列拆分为 csv
问题描述
只是一个快速的问题。
我有一个 CSV,有很多列。我有 1 列名为:美食,有很多价值。
name,Cuisine
Real Talent Cafe,"Italian, American, Pizza, Mediterranean, European, Fusion"
Dogma,"International, Mediterranean, Barbecue, Spanish, Fusion"
Taberna El Callejon,"Mediterranean, European, Spanish"
Astor,"International, Mediterranean, European, Fusion"
La Gaditana Castellana,"Spanish, Seafood, International, Diner, Wine Bar"
我想从这个 CSV 制作一个新的 CSV,有 2 列: - 名称 - 美食(通过拆分第一个 CSV)
这是我创建的脚本,我只选择了我感兴趣的 2 列:名称和美食:
# -*- coding: utf-8 -*-
from itertools import chain
import numpy as np
import pandas as pd
df = pd.read_csv('res_madrid.csv', usecols=['name','Cuisine'])
items_count = df["Cuisine"].str.count(",") +1
pd.DataFrame({"name": np.repeat(df["name"], items_count),
"Cuisine": list(chain.from_iterable(df["Cuisine"].str.split(",")))})
我收到以下错误:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib64/python3.6/site-packages/numpy/core/fromnumeric.py", line 471, in repeat
return _wrapfunc(a, 'repeat', repeats, axis=axis)
File "/usr/lib64/python3.6/site-packages/numpy/core/fromnumeric.py", line 56, in _wrapfunc
return getattr(obj, method)(*args, **kwds)
File "/usr/lib64/python3.6/site-packages/pandas/core/series.py", line 1157, in repeat
new_index = self.index.repeat(repeats)
File "/usr/lib64/python3.6/site-packages/pandas/core/indexes/base.py", line 862, in repeat
return self._shallow_copy(self._values.repeat(repeats))
ValueError: count < 0
请注意,如果您进行测试,复制我分享给您的数据,它将起作用...... 当我加载包含更多列的 CSV 文件并使用“usecols”参数时会出现问题。
预期结果如下:
name Cuisine
0 Real Talent Cafe Italian
0 Real Talent Cafe American
0 Real Talent Cafe Pizza
0 Real Talent Cafe Mediterranean
0 Real Talent Cafe European
0 Real Talent Cafe Fusion
1 Dogma International
1 Dogma Mediterranean
1 Dogma Barbecue
1 Dogma Spanish
1 Dogma Fusion
2 Taberna El Callejon Mediterranean
2 Taberna El Callejon European
2 Taberna El Callejon Spanish
3 Astor International
3 Astor Mediterranean
3 Astor European
3 Astor Fusion
4 La Gaditana Castellana Spanish
4 La Gaditana Castellana Seafood
4 La Gaditana Castellana International
4 La Gaditana Castellana Diner
4 La Gaditana Castellana Wine Bar
编辑:错误出现是因为我在美食列中有空值。我怎么能避免呢?
感谢您的帮助 :) 问候亚历山大
解决方案
data = pd.read_csv(#path to txt file)
数据
name Cuisine
0 Real Talent Cafe Italian, American, Pizza, Mediterranean, Europ...
1 Dogma International, Mediterranean, Barbecue, Spanis...
2 Taberna El Callejon Mediterranean, European, Spanish
3 Astor International, Mediterranean, European, Fusion
4 La Gaditana Castellana Spanish, Seafood, International, Diner, Wine Bar
利用
data.set_index('name')['Cuisine'].apply(lambda x: x.split(',')).apply(pd.Series).stack().reset_index().drop('level_1', axis=1)
data.columns = ['name', 'cusisine']
输出
data.head()
name cusisine
0 Real Talent Cafe Italian
1 Real Talent Cafe American
2 Real Talent Cafe Pizza
3 Real Talent Cafe Mediterranean
4 Real Talent Cafe European
推荐阅读
- java - 使用 servlet 和数据库限制登录尝试
- php - 当我输入错误密码时,此 PDO 登录表单也会记录我。如何解决?
- apache-spark - 将容量调度程序从默认更改为 ambari 上的 DominantResourceCalculator 后无法重新启动 spark2 thriftserver、spark-shell、sparksql
- mysql - SELECT 列表的表达式 #3 不在 GROUP BY 子句中,并且在尝试创建视图时包含非聚合列
- javascript - Paypal SDK 问题得到错误 [ERR_HTTP_HEADERS_SENT]: Cannot set headers after they are sent to the client
- python - 单个 FMU 的多核仿真
- java - Java DTO 和最终字段
- ffmpeg - 使用 FFMPEG 计算的 SSIM 值不是我所期望的
- requirejs - 如何使用 requireJS 加载车把
- xml - For each in XSLT 仅循环通过第一条记录