首页 > 解决方案 > 使用地理从描述中提取国家信息

问题描述

问题:我想从用户描述中提取国家信息。到目前为止,我正在尝试使用 geograpy 包。我喜欢当输入不是很清楚时的行为,例如在 Evesham 或 Rochdale 中,但是,Zaragoza, Spain当用户清除说它的位置在西班牙时,包将一些字符串解释为两次提及。不过,我不知道为什么阿姆斯特丹不给荷兰作为输出……我怎样才能提高输出?我错过了什么重要的东西吗?有没有更好的方案来实现这一目标?

数据:我的数据示例是:

                   user_location
2  Socialist Republic of Alachua
3                Hérault, France
4                 Gwalior, India
5                Zaragoza,España
7                     amsterdam 
8                        Evesham
9                       Rochdale

我想得到这样的东西:

                   user_location country
2  Socialist Republic of Alachua ['USSR', 'United States']
3                Hérault, France ['France']
4                 Gwalior, India ['India'] 
5                Zaragoza,España ['Spain']
7                     amsterdam  ['Holland']
8                        Evesham ['United Kingdom']
9                       Rochdale ['United Kingdom', 'United States']

代表:

import pandas as pd
import geograpy3

df = pd.DataFrame.from_dict({'user_location': {2: 'Socialist Republic of Alachua', 3: 'Hérault, France', 4: 'Gwalior, India', 5: 'Zaragoza,España', 7: 'amsterdam ', 8: 'Evesham', 9: 'Rochdale'}})

df['country'] = df['user_location'].apply(lambda x: geograpy.get_place_context(text=x).countries if pd.notnull(x) else x)

print(df)
#>                    user_location                                            country
#> 2  Socialist Republic of Alachua  [USSR, Union of Soviet Socialist Republics, Al...
#> 3                Hérault, France                                  [France, Hérault]
#> 4                 Gwalior, India   [British Indian Ocean Territory, Gwalior, India]
#> 5                Zaragoza,España             [Zaragoza, España, Spain, El Salvador]
#> 7                     amsterdam                                                  []
#> 8                        Evesham                          [Evesham, United Kingdom]
#> 9                       Rochdale          [Rochdale, United Kingdom, United States]

reprexpy 包于 2020-06-02 创建

标签: pythongeolocationcountryinput-sanitizationgeograpy

解决方案


geograpy3 在国家/地区查找方面的行为不再正确,因为它没有检查 pycountry 是否返回了 None 。作为提交者,我刚刚解决了这个问题。我已将您稍作修改的示例(以避免导入熊猫)添加为单元测试用例:

def testStackoverflow62152428(self):
        '''
        see https://stackoverflow.com/questions/62152428/extracting-country-information-from-description-using-geograpy?noredirect=1#comment112899776_62152428
        '''
        examples={2: 'Socialist Republic of Alachua', 3: 'Hérault, France', 4: 'Gwalior, India', 5: 'Zaragoza,España', 7: 'amsterdam ', 8: 'Evesham', 9: 'Rochdale'}  
        for index,text in examples.items():
            places=geograpy.get_geoPlace_context(text=text)
            print("example %d: %s" % (index,places.countries))

现在的结果是:

example 2: ['United States']
example 3: ['France']
example 4: ['British Indian Ocean Territory', 'India']
example 5: ['Spain', 'El Salvador']
example 7: []
example 8: ['United Kingdom']
example 9: ['United Kingdom', 'United States']

确实有改进的余地,例如 5。我添加了一个问题https://github.com/somnathrakshit/geograpy3/issues/7 - 请继续关注...


推荐阅读