python - 删除以 \ud 开头的部分字符串

问题描述

我正在尝试删除以 \ud 开头的任何内容

My text:
onceuponadollhouse: "Iconic apart and better together \ud83d\udc6fâ€â™€ï¸The  CaboodlesÂ® x Barbieâ„¢ collection has us thinking about our Doll Code \ud83c\udf80 We stand for one another by sharing our lessons

The answer I am looking for:
onceuponadollhouse: "Iconic apart and better together â€â™€ï¸The  CaboodlesÂ® x Barbieâ„¢ collection has us thinking about our Doll Code We stand for one another by sharing our lessons

标签： pythonrsymbolsemoji

因此，理想的方法是退后一步，找出编码在哪里被破坏，然后修复它。不知何故，您得到了 (a) 代理对，即以 \ud 开头的字符对；(b) UTF-8 被解释为 Latin-1 或一些类似的编码，例如“芭比娃娃”之后的 â„¢。

退后一步，确保您的输入文本被正确解释是理想的；在这里，您正在失去表情符号“兔耳朵的女人”和“丝带”；另一次可能是某人的名字或其他重要信息。

如果您处于无法正确执行此操作的情况，并且需要剥离代理对，则可以使用re.sub：

import re

text = 'onceuponadollhouse: "Iconic apart and better together \ud83d\udc6fâ€â™€ï¸The  CaboodlesÂ® x Barbieâ„¢ collection has us thinking about our Doll Code \ud83c\udf80 We stand for one another by sharing our lessons'

stripped = re.sub('[\ud800-\udfff]+', '', text)

print(stripped)

根据您的目的，将这些字符替换为占位符可能会很有用；由于它们总是成对出现，因此您可以执行以下操作：

import re

text = 'onceuponadollhouse: "Iconic apart and better together \ud83d\udc6fâ€â™€ï¸The  CaboodlesÂ® x Barbieâ„¢ collection has us thinking about our Doll Code \ud83c\udf80 We stand for one another by sharing our lessons'

stripped = re.sub('[\ud800-\udfff]{2}', '<unknown character>', text)

print(stripped)

python - 删除以 \ud 开头的部分字符串

问题描述

解决方案

推荐阅读