python - 读取 CSV 文件时,删除 pandas/Python 中以 '\x' 开头的表情符号
问题描述
在 Python 中使用 pandas 读取 csv 文件时,如何删除以 '\x' 开头的表情符号?CSV 文件的文本中有很多表情符号,我想删除它们。但是,表情符号的正常模式匹配正则表达式不适用于它。这是一个例子:
Thx WP for performing key democratic function. Trump wants to live in post truth world where words don't matter. D\xe2\x80\xa6 |\xef\xbf\xa3\xef\xbf\xa3\xef\xbf\xa3\xef\xbf\xa3\xef\xbf\xa3\xef\xbf\xa3\xef\xbf\xa3\xef\xbf\xa3\xef\xbf\xa3\xef\xbf\xa3|\n ME LA PELAS \n DONALD TRUMP \n|\xef\xbc\xbf\xef\xbc\xbf\xef\xbc\xbf\xef\xbc\xbf\xef\xbc\xbf\xef\xbc\xbf\xef\xbc\xbf\xef\xbc\xbf\xef\xbc\xbf\xef\xbc\xbf| \n (\\__/) ||\n (\xe2\x80\xa2\xe3\x85\x85\xe2\x80\xa2) ||\n / \xe3\x80\x80 \xe3\x81\xa5
这是适用于普通表情符号但不适用于这些表情符号的代码示例:
import re
text = u'This dog \xe2\x80\x9d \xe2\x80\x9c'
print(text) # with emoji
emoji_pattern = re.compile("["
u"\U0001F600-\U0001F64F" # emoticons
u"\U0001F300-\U0001F5FF" # symbols & pictographs
u"\U0001F680-\U0001F6FF" # transport & map symbols
u"\U0001F1E0-\U0001F1FF" # flags (iOS)
"]+", flags=re.UNICODE)
print(emoji_pattern.sub(r'', text)) # no emoji
因此,以下代码有效:
import unicodedata
from unidecode import unidecode
def deEmojify(inputString):
returnString = ""
for character in inputString:
try:
character.encode("ascii")
returnString += character
except UnicodeEncodeError:
returnString += ''
return returnString
print(deEmojify("I'm loving all the trump hate on Twitter right now \xf0\x9f\x99\x8c"))
但是,当我使用 pandas 从 csv 读取数据时,它不起作用并且表情符号不会被删除:
import pandas as pd
df = pd.read_csv("Trump834.csv", encoding="utf-8")
import unicodedata
from unidecode import unidecode
def deEmojify(inputString):
returnString = ""
for character in inputString:
try:
character.encode("ascii")
returnString += character
except UnicodeEncodeError:
returnString += ''
return returnString
for i in range(df.shape[0]):
print(df.iloc[i]['Tweet'])
print(deEmojify(df.iloc[i]['Tweet']))
print("****************************************")
解决方案
主要问题是您的源文件解码不正确。cp1252
使用不正确的编码(可能是或)重新编码它们,latin
并将它们正确解码为utf8
.
例如:
>>> s = u'This dog \xe2\x80\x9d \xe2\x80\x9c'
>>> s.encode('latin1').decode('utf8')
'This dog ” “'
>>> s = u'''Thx WP for performing key democratic function. Trump wants to live in post truth world where words don't matter. D\xe2\x80\xa6 |\xef\xbf\xa3\xef\xbf\xa3\xef\xbf\xa3\xef\xbf\xa3\xef\xbf\xa3\xef\xbf\xa3\xef\xbf\xa3\xef\xbf\xa3\xef\xbf\xa3\xef\xbf\xa3|\n ME LA PELAS \n DONALD TRUMP \n|\xef\xbc\xbf\xef\xbc\xbf\xef\xbc\xbf\xef\xbc\xbf\xef\xbc\xbf\xef\xbc\xbf\xef\xbc\xbf\xef\xbc\xbf\xef\xbc\xbf\xef\xbc\xbf| \n (\\__/) ||\n (\xe2\x80\xa2\xe3\x85\x85\xe2\x80\xa2) ||\n / \xe3\x80\x80 \xe3\x81\xa5'''
>>> print(s.encode('latin1').decode('utf8'))
Thx WP for performing key democratic function. Trump wants to live in post truth world where words don't matter. D… | ̄ ̄ ̄ ̄ ̄ ̄ ̄ ̄ ̄ ̄|
ME LA PELAS
DONALD TRUMP
|__________|
(\__/) ||
(•ㅅ•) ||
/ づ
>>> s="I'm loving all the trump hate on Twitter right now \xf0\x9f\x99\x8c"
>>> s.encode('latin1').decode('utf8')
"I'm loving all the trump hate on Twitter right now "
然后你的表情符号删除应该工作。