首页 > 解决方案 > How to create a new column in pandas and set its values according to whether a second column includes a string from various lists of strings

问题描述

I have a dataframe with values for Turkish provinces:

df['province']
2078982        Adana
2078983        Adana
2078984        Adana
2078985        Adana
2078986        Adana
   
2210113    Zonguldak
2210114    Zonguldak
2210115    Zonguldak
2210116    Zonguldak
2210117    Zonguldak

I want to write an if loop or a function that can create a new column that would categorize each of these provinces by regions. Therefore, I create 7 lists which contain the provinces that are included in each of the 7 regions:

aegean = ['Izmir', 'Aydin', 'Manisa', 'Uşak', 'Afyonkarahisar', 'Denizli', 'Kütahya', 'Muğla']
blacksea = ['Amasya', 'Gümüşhane', 'Bartın', 'Bolu', 'Giresun', 'Kastamonu', 'Karabük','Ordu', 'Rize', 'Samsun',
            'Sinop', 'Tokat', 'Trabzon', 'Zonguldak', 'Artvin', 'Bayburt', 'Çorum', 'Düzce']
cen_ana= ['Aksaray', 'Kırıkkale', 'Kırşehir', 'Nevşehir', 'Ankara', 'Çankırı', 'Eskisehir', 'Karaman', 'Kayseri', 'Konya', 'Sivas', 'Yozgat']
eas_ana= ['Ağrı', 'Bingöl', 'Elazığ', 'Hakkari', 'Iğdır', 'Kars', 'Tunceli', 'Van', 'Ardahan', 'Erzurum','Şırnak']
marmara=['Edirne', 'Istanbul', 'Kırklareli', 'Kocaeli', 'Tekirdağ', 'Yalova', 'Balıkesir', 'Bilecik', ' Bursa','Çanakkale','Sakarya' ]
medite=['Adana', 'Antalya', 'Mersin', 'Burdur', 'Hatay', 'Isparta', 'Osmaniye','Kahramanmaraş' ]
sou_ana=['Adiyaman', 'Batman','Diyarbakır', 'Gaziantep', 'Siirt', 'Mardin',  'Şanlıurfa']

After having done that, I loop through the dataset with a for and if loop:


for i, row in df.iterrows():
    df['Region']='something'
    if any(e in df["province"] for e in aegean):
        df['Region']=="Aegean Region"
    elif any(q in df["province"] for q in blacksea):
        df['Region']=="Black Sea Region"
    elif any(s in df["province"] for s in cen_ana):
        df['Region']=="Central Anatolia"
    elif any(c in df["province"] for c in eas_ana):
        df['Region']=="Eastern Anatolia"
    elif any(v in df["province"] for v in sou_ana):
        df['Region']=="Southern Anatolia"
    elif any(g in df["province"] for g in marmara):
       df['Region']=="Marmara"
    elif any(h in df["province"] for h in medite):
        df['Region']=="Mediterranean"
    else:
        df['Region']=="Other"

But all I end up getting is all my columns with values "something" for some reason.


df['Region']
Out[148]: 
2078982    something
2078983    something
2078984    something
2078985    something
2078986    something
   
2210113    something
2210114    something
2210115    something
2210116    something
2210117    something
Name: Region, Length: 15901, dtype: object

I tried some examples which suggest using a function instead:

def regionaler(x):
    if any(e in df["province"] for e in aegean):
        return "Aegean Region"
    elif any(e in df["province"] for e in blacksea):
        return "Black Sea Region"
    elif any(e in df["province"] for e in cen_ana):
        return "Central Anatolia"
    elif any(e in df["province"] for e in eas_ana):
        return "Eastern Anatolia"
    elif any(e in df["province"] for e in sou_ana):
        return "Southern Anatolia"
    elif any(e in df["province"] for e in marmara):
        return "Marmara"
    elif any(e in df["province"] for e in medite):
        return "Mediterranean"
    else:
        return "Other"

But the result is similarly off for me:



df['Region'] = df.apply(regionaler,axis=1)
df['Region']
Out[151]: 
2078982    Other
2078983    Other
2078984    Other
2078985    Other
2078986    Other
 
2210113    Other
2210114    Other
2210115    Other
2210116    Other
2210117    Other
Name: Region, Length: 15901, dtype: object

I have the feeling that I am doing some seriously stupid mistake which can be easily fixed but can't figure it out. Would be very grateful to anyone who could help!

标签: pythonpython-3.xpandasstringdataframe

解决方案


You can do this better by using Series.map:

Create a dict with the region lists like below(I am using only a sample):

In [2511]: medite=['Adana', 'Antalya', 'Mersin']
In [2508]: blacksea = ['Amasya', 'Gümüşhane', 'Bartın','Zonguldak']

In [2512]: province_map = {'medite': medite, 'blacksea':blacksea}

In [2513]: print(province_map)
Out[2513]: 
{'medite': ['Adana', 'Antalya', 'Mersin'],
 'blacksea': ['Amasya', 'Gümüşhane', 'Bartın', 'Zonguldak']}

Now, convert province_map values to keys, like below:

In [2514]: d = {i: k for k,v in province_map.items() for i in v}

In [2515]: print(d)
Out[2515]: 
{'Adana': 'medite',
 'Antalya': 'medite',
 'Mersin': 'medite',
 'Amasya': 'blacksea',
 'Gümüşhane': 'blacksea',
 'Bartın': 'blacksea',
 'Zonguldak': 'blacksea'}

Now use Series.map to create your new column in dataframe:

In [2518]: df['Region'] = df.province.map(d)

In [2519]: df
Out[2519]: 
          province    Region
2078982      Adana    medite
2078983      Adana    medite
2078984      Adana    medite
2078985      Adana    medite
2078986      Adana    medite
2210113  Zonguldak  blacksea
2210114  Zonguldak  blacksea
2210115  Zonguldak  blacksea
2210116  Zonguldak  blacksea
2210117  Zonguldak  blacksea

推荐阅读