首页 > 解决方案 > Replace a word or set of letters from a string in a dataframe only if the string starts with that word

问题描述

Assuming I have the following toy model df:

Line          Sentence

1             A MAN TAUGHT ME HOW TO DANCE.
2             WE HAVE TO CHOOSE A CAKE. 
3             X RAYS CAN BE HARMFUL.
4             MY HERO IS MALCOLM X FROM THE USA.
5             THE BEST ACTOR IS JENNIFER A FULTON. 
6             A SOUND THAT HAS A BIG IMPACT. 

If I were to do the following:

df['Sentence'] = df['Sentence'].str.replace('A ',' ')

This would remove all characters 'A ' from all sentences. However, I only need the 'A ' removed from string sentences that start with 'A '. Similarly, I would like to remove the 'X ' from Line 3, and not from Malcolm X in Line 4.

The final output df should look like the following:

Line          Sentence

1             MAN TAUGHT ME HOW TO DANCE.
2             WE HAVE TO CHOOSE A CAKE. 
3             RAYS CAN BE HARMFUL.
4             MY HERO IS MALCOLM X FROM THE USA.
5             THE BEST ACTOR IS JENNIFER A FULTON. 
6             SOUND THAT HAS A BIG IMPACT. 

标签: pythonpython-3.xpandasdataframe

解决方案


You can use regular expression:


df["Sentence"] = df["Sentence"].str.replace(r"^(?:A|X)(?=\s)", "", regex=True)
print(df)

Prints:

   Line                              Sentence
0     1           MAN TAUGHT ME HOW TO DANCE.
1     2             WE HAVE TO CHOOSE A CAKE.
2     3                  RAYS CAN BE HARMFUL.
3     4    MY HERO IS MALCOLM X FROM THE USA.
4     5  THE BEST ACTOR IS JENNIFER A FULTON.
5     6          SOUND THAT HAS A BIG IMPACT.

推荐阅读