首页 > 解决方案 > Python:处理字符串

问题描述

我正在做一个数据分析项目来分析一些 Spotify 数据。我处于数据清理阶段,正在处理字符串值列。基本上,我有一系列这样结构的歌曲名称。

df_spot['name'].head(10)

0    Piano Concerto No. 3 in D Minor, Op. 30: III. ...
1                              Clancy Lowered the Boom
2                                            Gati Bali
3                                            Danny Boy
4                          When Irish Eyes Are Smiling
5                                         Gati Mardika
6                             The Wearing of the Green
7    Morceaux de fantaisie, Op. 3: No. 2, Prélude i...
8                          La Mañanita - Remasterizado
9                                    Il Etait Syndiqué
Name: name, dtype: object

我想要对这一列做的是将每个单词分成单行,取前 10 个最常出现的单词,并用二进制值替换该列,这首歌是否包含在前 10 个单词中的单词。(当然,我不会考虑诸如“a”、“are”、“is”或数字等词)首先,我创建了一个仅包含轨道名称的新数据框,并删除了一些不必要的单词,并插入冒号作为分隔符。这就是我所做的。

df_words = df_words.str.replace(' ', ',')
df_words = df_words.str.replace('  ', ',')
df_words = df_words.str.replace('.', ',')
df_words = df_words.str.replace(':', ',')
df_words = df_words.str.replace('-', ',')
df_words = df_words.str.replace("'", ',')
df_words = df_words.str.replace('"', ',')
df_words = df_words.str.replace('?', ',')
df_words = df_words.str.replace('!', ',')
df_words = df_words.str.replace('(', ',')
df_words = df_words.str.replace(')', ',')
df_words = df_words.str.replace('[', ',')
df_words = df_words.str.replace(']', ',')
df_words = df_words.str.replace('&', ',')
df_words = df_words.str.replace('/', ',')
df_words = df_words.str.replace('1', ',')
df_words = df_words.str.replace('2', ',')
df_words = df_words.str.replace('3', ',')
df_words = df_words.str.replace('4', ',')
df_words = df_words.str.replace('5', ',')
df_words = df_words.str.replace('6', ',')
df_words = df_words.str.replace('7', ',')
df_words = df_words.str.replace('8', ',')
df_words = df_words.str.replace('9', ',')
df_words = df_words.str.replace('0', ',')

这种方法最终会在每行之间连续插入多个逗号,因为歌曲名称可以有多个字符连续替换。那么这是我的第一个问题。有没有更好的方法来做我想要实现的目标?此外,有没有办法以比上述代码更少重复和更可重现的方式做到这一点?我的第二个问题是,在所有单词都用逗号完美分隔之后,有哪些方法可以将每个单词扩展为单独的列表/向量元素,以便计算每个单词在数据中出现的次数?

标签: pythonstringdataframereplacedata-cleaning

解决方案


您可以使用.isalpha()检查字符是否为字母(.isalnum()检查字符是字母还是数字)。

这样,您可以将字符串拆分为单词,遍历单词并仅保留字母字符

df = pd.DataFrame({'song': ['Clancy Lowered the Boom', 'Gati Bali', 'Danny Boy',
                   'Piano Concerto No. 3 in D Minor, Op. 30: III.']})

# create column to add words later
df['words'] = None

lettersOnly = []

# iterate through each song title
for i, song in enumerate(df['song']):

# this returns the song title as a list of words (or list of items in between the spaces)
song = song.split(" ")

  lettersOnly = []

  # iterate through each word in the song title
  for word in song:
    # only keep the character if it is a letter
    lettersOnly += ["".join(char for char in word if char.isalpha())]

  # adds the list to the correct dataframe cell (row, col)
  df.iloc[i, 1] = lettersOnly


# print(df)
'''
    song                                            words
0   Clancy Lowered the Boom                         [Clancy, Lowered, the, Boom]
1   Gati Bali                                       [Gati, Bali]
2   Danny Boy                                       [Danny, Boy]
3   Piano Concerto No. 3 in D Minor, Op. 30: III.   [Piano, Concerto, No, , in, D, Minor, Op, , III]
'''

推荐阅读