首页 > 解决方案 > 使用随机“昵称”匿名化 pandas 名称列

问题描述

假设我有一个熊猫数据框和一个“名称”列。我想匿名化列并隐藏身份。我可以做类似的事情,

df['nickname'] = 'P ' + pd.Series(pd.factorize(df['name'])[0] + 1).astype(str)

但它给了我这个:

name       nickname  
frank miller   P 1       
john cena      P 2       
john cena      P 2       
rock           P 3       

以上是可接受的匿名化,但不是我需要的。有没有办法可以在下面获得所需的表格?也许是一个内置的 python 函数或已经实现了类似的东西的人?

所需表(具有随机昵称,但相同输入的相同输出):

name       nickname  
frank miller   Tiko       
john cena      Bozo       
john cena      Bozo       
the rock       Hana       

标签: pythonpandasdataframeencryption

解决方案


You can use the Faker package for this which generates a dummy name for you.

Installation:

# pip
pip install Faker

# anaconda
conda install -c conda-forge faker

Example:

from faker import Faker
faker = Faker()
# seed the random generator to produce the same results
Faker.seed(4321)

dict_names = {name: faker.name() for name in df['name'].unique()}
df['nickname'] = df['name'].map(dict_names)

Output

           name     nickname
0  frank miller  Jason Brown
1     john cena  Jacob Stein
2     john cena  Jacob Stein
3          rock   Cody Brown

You can also initialize Faker with names from certain countries:

faker = Faker(['it_IT', 'de_DE', 'sv_SE'])

dict_names = {name: faker.name() for name in df['name'].unique()}
df['nickname'] = df['name'].map(dict_names)

Output

           name           nickname
0  frank miller    Nadeschda Finke
1     john cena      Marcus Warmer
2     john cena      Marcus Warmer
3          rock  Sophia Squarcione

推荐阅读