首页 > 解决方案 > Repeated vowels and consonants in words in pandas

问题描述

I have the following dataset:

a_df = pd.DataFrame({'id':[1,2,3,4,5],'text':['This was fuuuuun','aaaawesome','Hiiigh altitude','Oops','See you']})

a_df
    id  text
0   1   This was fuuuuun
1   2   aaaawesome
2   3   Hiiigh altitude
3   4   Oops
4   5   See you

Some words are misspelled. One rule to apply is to that, if I see three or more vowels or consonants, then I could be somehow sure that there is a misspelled word, so I replace that repetition with ''.

So I have tried this:

a_df['corrected_text'] = a_df['text'].str.replace(r'([a-zA-Z])\\3+','')

But there is no change. My logic was to try to capture letters that were repeated, but I must be doing something wrong. Please, any help will be greatly appreciated.

标签: pythonregexpandas

解决方案


You can use

a_df['text'] = a_df['text'].str.replace(r'([a-zA-Z])\1{2,}', r'\1', regex=True)

Details:

  • ([a-zA-Z]) - capturing group with ID 1
  • \1{2,} - two or more occurrences (so, three or more letters together with the previous pattern) of Group 1 value (\1 is a replacement backreference to Group 1 value, make sure to use it in a raww string literal, else you would have to double backslashes).

推荐阅读