首页 > 解决方案 > 删除以 \ud 开头的部分字符串

问题描述

我正在尝试删除以 \ud 开头的任何内容

My text:
onceuponadollhouse: "Iconic apart and better together \ud83d\udc6fâ€â™€ï¸The  Caboodles® x Barbieâ„¢ collection has us thinking about our Doll Code \ud83c\udf80 We stand for one another by sharing our lessons

The answer I am looking for:
onceuponadollhouse: "Iconic apart and better together â€â™€ï¸The  Caboodles® x Barbieâ„¢ collection has us thinking about our Doll Code We stand for one another by sharing our lessons

标签: pythonrsymbolsemoji

解决方案


因此,理想的方法是退后一步,找出编码在哪里被破坏,然后修复它。不知何故,您得到了 (a) 代理对,即以 \ud 开头的字符对;(b) UTF-8 被解释为 Latin-1 或一些类似的编码,例如“芭比娃娃”之后的 â„¢。

退后一步,确保您的输入文本被正确解释是理想的;在这里,您正在失去表情符号“兔耳朵的女人”和“丝带”;另一次可能是某人的名字或其他重要信息。


如果您处于无法正确执行此操作的情况,并且需要剥离代理对,则可以使用re.sub

import re

text = 'onceuponadollhouse: "Iconic apart and better together \ud83d\udc6fâ€â™€ï¸The  Caboodles® x Barbieâ„¢ collection has us thinking about our Doll Code \ud83c\udf80 We stand for one another by sharing our lessons'

stripped = re.sub('[\ud800-\udfff]+', '', text)

print(stripped)

根据您的目的,将这些字符替换为占位符可能会很有用;由于它们总是成对出现,因此您可以执行以下操作:

import re

text = 'onceuponadollhouse: "Iconic apart and better together \ud83d\udc6fâ€â™€ï¸The  Caboodles® x Barbieâ„¢ collection has us thinking about our Doll Code \ud83c\udf80 We stand for one another by sharing our lessons'

stripped = re.sub('[\ud800-\udfff]{2}', '<unknown character>', text)

print(stripped)

推荐阅读