python - 熊猫合并具有相似值的行中的值
问题描述
我已经使用 pandas 从这个url读取 html 文件提取了这个表:
+---------+---------------------+-----------+
| | country | accidents |
+---------+---------------------+-----------+
| 0 | Afghanistan | 3 |
| 1 | Airport | 2 |
| 2 | Algeria | 2 |
| 3 | Angola | 5 |
| 4 | Antigua and Barbuda | 1 |
| ... | ... | ... |
| 117 | United Kingdom | 7 |
| 118 | United Knigdom | 1 |
| 119 | USA | 66 |
| 120 | Venezuela | 3 |
| 121 | Yemen | 3 |
+---------+---------------------+-----------+
这需要一些操作来清理它:
- 删除非国家,如
airport
,Indian Ocean
,Pacific Ocean
等 - 合并名称略有不同的重复国家/地区的任何行,并总结这些值。例如
United Kingdom
,并且United Knigdom
应该与 7 + 1 次事故合并为 1 行,以此类推其他重复国家/地区,例如Congo
/Congo Democratic Republic
和Tanzania
/Tanzanie
等
执行这些操作的最简单方法是什么?我知道第一部分我可以df.loc
用来过滤 df,但我在第二部分苦苦挣扎。有什么帮助吗?
生成df的字典代码:
{'country': {0: 'Afghanistan', 1: 'Airport', 2: 'Algeria', 3: 'Angola', 4: 'Antigua and Barbuda', 5: 'Argentina', 6: 'Armenia', 7: 'Atlantic Ocean', 8: 'Australia', 9: 'Austria', 10: 'Bahamas', 11: 'Bahrain', 12: 'Bangladesh', 13: 'Belgium', 14: 'Botswana', 15: 'Brazil', 16: 'Cameroon', 17: 'Canada', 18: 'Chile', 19: 'China', 20: 'Colombia', 21: 'Comoros', 22: 'Congo', 23: 'Congo Democratic Republic', 24: 'Costa Rica', 25: 'Croatia', 26: 'Cuba', 27: 'Czech Republic', 28: 'Dominican Republic', 29: 'Ecuador', 30: 'Egypt', 31: 'Ethiopia', 32: 'Finland', 33: 'France', 34: 'Gabon', 35: 'Germany', 36: 'Ghana', 37: 'Greece', 38: 'Greenland', 39: 'Guam', 40: 'Guatemala', 41: 'Guinea', 42: 'Guyana', 43: 'Honduras', 44: 'Hong Kong', 45: 'India', 46: 'Indian Ocean', 47: 'Indonesia', 48: 'Iran', 49: 'Iraq', 50: 'Ireland', 51: 'Israel', 52: 'Italy', 53: 'Ivory Coast', 54: 'Jamaica', 55: 'Japan', 56: 'Kazakhstan', 57: 'Kenya', 58: 'Kuwait', 59: 'Kyrgyzstan', 60: 'Laos', 61: 'Lebanon', 62: 'Libya', 63: 'Lybia', 64: 'Macedonia', 65: 'Malaysia', 66: 'Mali', 67: 'Malta', 68: 'Mexico', 69: 'Morocco', 70: 'Mozambique', 71: 'Myanmar', 72: 'Namibia', 73: 'Nepal', 74: 'Netherlands', 75: 'Nicaragua', 76: 'Niger', 77: 'Nigeria', 78: 'Norway', 79: 'Oman', 80: 'Pacific Ocean', 81: 'Pakistan', 82: 'Panama', 83: 'Papua New Guinea', 84: 'Peru', 85: 'Philippines', 86: 'Poland', 87: 'Portugal', 88: 'Puerto Rico', 89: 'Qatar', 90: 'Romania', 91: 'Russia', 92: 'Rwanda', 93: 'San Vicente', 94: 'Saudi Arabia', 95: 'Senegal', 96: 'Serbia and Montenegro', 97: 'Sierra Leone', 98: 'Slovakia', 99: 'Solomon Islands', 100: 'Somalia', 101: 'South Africa', 102: 'South Korea', 103: 'Spain', 104: 'Sri Lanka', 105: 'Sudan', 106: 'Sweden', 107: 'Switzerland', 108: 'Taiwan', 109: 'Tanzania', 110: 'Tanzanie', 111: 'Thailand', 112: 'Tunisia', 113: 'Turkey', 114: 'Uganda', 115: 'Ukraine', 116: 'United Arab Emirates', 117: 'United Kingdom', 118: 'United Knigdom', 119: 'USA', 120: 'Venezuela', 121: 'Yemen'}, 'accidents': {0: 3, 1: 2, 2: 2, 3: 5, 4: 1, 5: 7, 6: 1, 7: 1, 8: 1, 9: 1, 10: 2, 11: 1, 12: 1, 13: 5, 14: 3, 15: 18, 16: 4, 17: 12, 18: 3, 19: 13, 20: 7, 21: 3, 22: 2, 23: 3, 24: 1, 25: 1, 26: 1, 27: 1, 28: 1, 29: 4, 30: 5, 31: 1, 32: 1, 33: 15, 34: 3, 35: 4, 36: 1, 37: 2, 38: 1, 39: 1, 40: 2, 41: 1, 42: 1, 43: 1, 44: 2, 45: 19, 46: 1, 47: 18, 48: 5, 49: 5, 50: 2, 51: 1, 52: 8, 53: 1, 54: 1, 55: 8, 56: 3, 57: 2, 58: 1, 59: 3, 60: 1, 61: 1, 62: 10, 63: 2, 64: 1, 65: 6, 66: 1, 67: 1, 68: 5, 69: 2, 70: 3, 71: 2, 72: 1, 73: 4, 74: 3, 75: 1, 76: 1, 77: 8, 78: 1, 79: 1, 80: 1, 81: 6, 82: 2, 83: 2, 84: 7, 85: 4, 86: 3, 87: 1, 88: 1, 89: 1, 90: 2, 91: 6, 92: 1, 93: 1, 94: 4, 95: 1, 96: 1, 97: 1, 98: 1, 99: 1, 100: 3, 101: 1, 102: 7, 103: 8, 104: 4, 105: 3, 106: 1, 107: 1, 108: 13, 109: 1, 110: 1, 111: 9, 112: 1, 113: 9, 114: 1, 115: 1, 116: 7, 117: 7, 118: 1, 119: 66, 120: 3, 121: 3}}
解决方案
你知道所有这些必要的改变吗?这是一项手动工作,但最简单的方法是将所有错误的值替换为正确的值,然后使用 group by。
# put all corrections here
replace_dict = {'United Knigdom': 'United Kingdom', 'Congo': 'Congo Democratic Republic'}
df['country'] = df['country'].replace(replace_dict)
df.groupby('country').sum().reset_index()
推荐阅读
- encoding - 为什么在 Thrift CompactProtocol 中以可变长度编码交换字节?
- python - 如何验证 pyMetasploit 模块中的 SSL 证书?
- android - 计算使用 tensorflowlite android 检测到的对象数
- java - 如何使用 Spring Boot、JPA 和 Thymeleaf 设置搜索栏
- c++ - 我有这个unicodestring“Param�tres”,è被转换成未知字符。为什么?
- html - 如何在 ngFor* 循环中动态设置名称标签?
- forms - 如何从 Angular 6 中的父组件访问子组件中的表单?
- python-3.x - 如何使用python获取文件大小和文件夹大小
- javascript - 如何用中间的额外空格替换对象的 PascalCase 键?
- jquery - morris.js - 仅支持 x 轴上的日期?