首页 > 解决方案 > 熊猫合并具有相似值的行中的值

问题描述

我已经使用 pandas 从这个url读取 html 文件提取了这个表:


+---------+---------------------+-----------+
|         |       country       | accidents |
+---------+---------------------+-----------+
|       0 | Afghanistan         |         3 |
|       1 | Airport             |         2 |
|       2 | Algeria             |         2 |
|       3 | Angola              |         5 |
|       4 | Antigua and Barbuda |         1 |
|     ... | ...                 |       ... |
|     117 | United Kingdom      |         7 |
|     118 | United Knigdom      |         1 |
|     119 | USA                 |        66 |
|     120 | Venezuela           |         3 |
|     121 | Yemen               |         3 |
+---------+---------------------+-----------+

这需要一些操作来清理它:

  1. 删除非国家,如airport, Indian Ocean,Pacific Ocean
  2. 合并名称略有不同的重复国家/地区的任何行,并总结这些值。例如United Kingdom,并且United Knigdom应该与 7 + 1 次事故合并为 1 行,以此类推其他重复国家/地区,例如Congo/Congo Democratic RepublicTanzania/Tanzanie

执行这些操作的最简单方法是什么?我知道第一部分我可以df.loc用来过滤 df,但我在第二部分苦苦挣扎。有什么帮助吗?

生成df的字典代码:

{'country': {0: 'Afghanistan', 1: 'Airport', 2: 'Algeria', 3: 'Angola', 4: 'Antigua and Barbuda', 5: 'Argentina', 6: 'Armenia', 7: 'Atlantic Ocean', 8: 'Australia', 9: 'Austria', 10: 'Bahamas', 11: 'Bahrain', 12: 'Bangladesh', 13: 'Belgium', 14: 'Botswana', 15: 'Brazil', 16: 'Cameroon', 17: 'Canada', 18: 'Chile', 19: 'China', 20: 'Colombia', 21: 'Comoros', 22: 'Congo', 23: 'Congo Democratic Republic', 24: 'Costa Rica', 25: 'Croatia', 26: 'Cuba', 27: 'Czech Republic', 28: 'Dominican Republic', 29: 'Ecuador', 30: 'Egypt', 31: 'Ethiopia', 32: 'Finland', 33: 'France', 34: 'Gabon', 35: 'Germany', 36: 'Ghana', 37: 'Greece', 38: 'Greenland', 39: 'Guam', 40: 'Guatemala', 41: 'Guinea', 42: 'Guyana', 43: 'Honduras', 44: 'Hong Kong', 45: 'India', 46: 'Indian Ocean', 47: 'Indonesia', 48: 'Iran', 49: 'Iraq', 50: 'Ireland', 51: 'Israel', 52: 'Italy', 53: 'Ivory Coast', 54: 'Jamaica', 55: 'Japan', 56: 'Kazakhstan', 57: 'Kenya', 58: 'Kuwait', 59: 'Kyrgyzstan', 60: 'Laos', 61: 'Lebanon', 62: 'Libya', 63: 'Lybia', 64: 'Macedonia', 65: 'Malaysia', 66: 'Mali', 67: 'Malta', 68: 'Mexico', 69: 'Morocco', 70: 'Mozambique', 71: 'Myanmar', 72: 'Namibia', 73: 'Nepal', 74: 'Netherlands', 75: 'Nicaragua', 76: 'Niger', 77: 'Nigeria', 78: 'Norway', 79: 'Oman', 80: 'Pacific Ocean', 81: 'Pakistan', 82: 'Panama', 83: 'Papua New Guinea', 84: 'Peru', 85: 'Philippines', 86: 'Poland', 87: 'Portugal', 88: 'Puerto Rico', 89: 'Qatar', 90: 'Romania', 91: 'Russia', 92: 'Rwanda', 93: 'San Vicente', 94: 'Saudi Arabia', 95: 'Senegal', 96: 'Serbia and Montenegro', 97: 'Sierra Leone', 98: 'Slovakia', 99: 'Solomon Islands', 100: 'Somalia', 101: 'South Africa', 102: 'South Korea', 103: 'Spain', 104: 'Sri Lanka', 105: 'Sudan', 106: 'Sweden', 107: 'Switzerland', 108: 'Taiwan', 109: 'Tanzania', 110: 'Tanzanie', 111: 'Thailand', 112: 'Tunisia', 113: 'Turkey', 114: 'Uganda', 115: 'Ukraine', 116: 'United Arab Emirates', 117: 'United Kingdom', 118: 'United Knigdom', 119: 'USA', 120: 'Venezuela', 121: 'Yemen'}, 'accidents': {0: 3, 1: 2, 2: 2, 3: 5, 4: 1, 5: 7, 6: 1, 7: 1, 8: 1, 9: 1, 10: 2, 11: 1, 12: 1, 13: 5, 14: 3, 15: 18, 16: 4, 17: 12, 18: 3, 19: 13, 20: 7, 21: 3, 22: 2, 23: 3, 24: 1, 25: 1, 26: 1, 27: 1, 28: 1, 29: 4, 30: 5, 31: 1, 32: 1, 33: 15, 34: 3, 35: 4, 36: 1, 37: 2, 38: 1, 39: 1, 40: 2, 41: 1, 42: 1, 43: 1, 44: 2, 45: 19, 46: 1, 47: 18, 48: 5, 49: 5, 50: 2, 51: 1, 52: 8, 53: 1, 54: 1, 55: 8, 56: 3, 57: 2, 58: 1, 59: 3, 60: 1, 61: 1, 62: 10, 63: 2, 64: 1, 65: 6, 66: 1, 67: 1, 68: 5, 69: 2, 70: 3, 71: 2, 72: 1, 73: 4, 74: 3, 75: 1, 76: 1, 77: 8, 78: 1, 79: 1, 80: 1, 81: 6, 82: 2, 83: 2, 84: 7, 85: 4, 86: 3, 87: 1, 88: 1, 89: 1, 90: 2, 91: 6, 92: 1, 93: 1, 94: 4, 95: 1, 96: 1, 97: 1, 98: 1, 99: 1, 100: 3, 101: 1, 102: 7, 103: 8, 104: 4, 105: 3, 106: 1, 107: 1, 108: 13, 109: 1, 110: 1, 111: 9, 112: 1, 113: 9, 114: 1, 115: 1, 116: 7, 117: 7, 118: 1, 119: 66, 120: 3, 121: 3}}

标签: pythonpandas

解决方案


你知道所有这些必要的改变吗?这是一项手动工作,但最简单的方法是将所有错误的值替换为正确的值,然后使用 group by。

# put all corrections here
replace_dict = {'United Knigdom': 'United Kingdom', 'Congo': 'Congo Democratic Republic'}

df['country'] = df['country'].replace(replace_dict)
df.groupby('country').sum().reset_index()

推荐阅读