首页 > 解决方案 > 使用熊猫将列拆分为 csv

问题描述

只是一个快速的问题。

我有一个 CSV,有很多列。我有 1 列名为:美食,有很多价值。

name,Cuisine
Real Talent Cafe,"Italian, American, Pizza, Mediterranean, European, Fusion"
Dogma,"International, Mediterranean, Barbecue, Spanish, Fusion"
Taberna El Callejon,"Mediterranean, European, Spanish"
Astor,"International, Mediterranean, European, Fusion"
La Gaditana Castellana,"Spanish, Seafood, International, Diner, Wine Bar"

我想从这个 CSV 制作一个新的 CSV,有 2 列: - 名称 - 美食(通过拆分第一个 CSV)

这是我创建的脚本,我只选择了我感兴趣的 2 列:名称和美食

# -*- coding: utf-8 -*-
from itertools import chain
import numpy as np
import pandas as pd

df = pd.read_csv('res_madrid.csv', usecols=['name','Cuisine'])
items_count = df["Cuisine"].str.count(",") +1

pd.DataFrame({"name": np.repeat(df["name"], items_count),
    "Cuisine": list(chain.from_iterable(df["Cuisine"].str.split(",")))})

我收到以下错误:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib64/python3.6/site-packages/numpy/core/fromnumeric.py", line 471, in repeat
    return _wrapfunc(a, 'repeat', repeats, axis=axis)
  File "/usr/lib64/python3.6/site-packages/numpy/core/fromnumeric.py", line 56, in _wrapfunc
    return getattr(obj, method)(*args, **kwds)
  File "/usr/lib64/python3.6/site-packages/pandas/core/series.py", line 1157, in repeat
    new_index = self.index.repeat(repeats)
  File "/usr/lib64/python3.6/site-packages/pandas/core/indexes/base.py", line 862, in repeat
    return self._shallow_copy(self._values.repeat(repeats))
ValueError: count < 0

请注意,如果您进行测试,复制我分享给您的数据,它将起作用...... 当我加载包含更多列的 CSV 文件并使用“usecols”参数时会出现问题。

预期结果如下:

                     name         Cuisine
0        Real Talent Cafe         Italian
0        Real Talent Cafe        American
0        Real Talent Cafe           Pizza
0        Real Talent Cafe   Mediterranean
0        Real Talent Cafe        European
0        Real Talent Cafe          Fusion
1                   Dogma   International
1                   Dogma   Mediterranean
1                   Dogma        Barbecue
1                   Dogma         Spanish
1                   Dogma          Fusion
2     Taberna El Callejon   Mediterranean
2     Taberna El Callejon        European
2     Taberna El Callejon         Spanish
3                   Astor   International
3                   Astor   Mediterranean
3                   Astor        European
3                   Astor          Fusion
4  La Gaditana Castellana         Spanish
4  La Gaditana Castellana         Seafood
4  La Gaditana Castellana   International
4  La Gaditana Castellana           Diner
4  La Gaditana Castellana        Wine Bar

编辑:错误出现是因为我在美食列中有空值。我怎么能避免呢?

感谢您的帮助 :) 问候亚历山大

标签: pythonpandascsv

解决方案


data = pd.read_csv(#path to txt file)

数据

                     name                                            Cuisine
0        Real Talent Cafe  Italian, American, Pizza, Mediterranean, Europ...
1                   Dogma  International, Mediterranean, Barbecue, Spanis...
2     Taberna El Callejon                   Mediterranean, European, Spanish
3                   Astor     International, Mediterranean, European, Fusion
4  La Gaditana Castellana   Spanish, Seafood, International, Diner, Wine Bar

利用

data.set_index('name')['Cuisine'].apply(lambda x: x.split(',')).apply(pd.Series).stack().reset_index().drop('level_1', axis=1)
data.columns = ['name', 'cusisine']

输出

 data.head()


               name        cusisine
0  Real Talent Cafe         Italian
1  Real Talent Cafe        American
2  Real Talent Cafe           Pizza
3  Real Talent Cafe   Mediterranean
4  Real Talent Cafe        European

推荐阅读