pandas - str.contains 和 str.find 的结果不同
问题描述
在我看来,两者都应该给出相同的答案:
train = pd.read_csv('https://raw.github.com/mattdelhey/kaggle-titanic/master/Data/train.csv')
train.name.str.contains('Mr.').sum()
(train.name.str.find('Mr.')>0).sum()
但输出是:
647
517
不同结果背后的原因是什么?
解决方案
区别str.contains
也是 match Mrs.
,因为.
是特殊的正则表达式字符(它用于匹配任何字符)。
我认为需要转义它或添加参数regex=False
:
print(train.name.str.contains('Mr\.').sum())
517
print(train.name.str.contains('Mr.', regex=False).sum())
517
print((train.name.str.find('Mr.')>0).sum())
517
测试差异:
a = train.loc[train.name.str.contains('Mr.'), 'name']
b = train.loc[(train.name.str.find('Mr.')>0), 'name']
c = pd.concat([a, b], axis=1, keys=('contains','find'))
c = c[c.isnull().any(axis=1)]
print (c)
contains find
1 Cumings, Mrs. John Bradley (Florence Briggs Th... NaN
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) NaN
8 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) NaN
9 Nasser, Mrs. Nicholas (Adele Achem) NaN
15 Hewlett, Mrs. (Mary D Kingcome) NaN
18 Vander Planke, Mrs. Julius (Emelia Maria Vande... NaN
19 Masselmani, Mrs. Fatima NaN
25 Asplund, Mrs. Carl Oscar (Selma Augusta Emilia... NaN
31 Spencer, Mrs. William Augustus (Marie Eugenie) NaN
40 Ahlin, Mrs. Johan (Johanna Persdotter Larsson) NaN
41 Turpin, Mrs. William John Robert (Dorothy Ann ... NaN
49 Arnold-Franchi, Mrs. Josef (Josefine Franchi) NaN
52 Harper, Mrs. Henry Sleeper (Myna Haxtun) NaN
53 Faunthorpe, Mrs. Lizzie (Elizabeth Anne Wilkin... NaN
66 Nye, Mrs. (Elizabeth Ramell) NaN
85 Backstrom, Mrs. Karl Alfred (Maria Mathilda Gu... NaN
...
...
推荐阅读
- c# - .Net Core 2.1 + MassTransit - 无法访问已处置的对象。对象名称:'IServiceProvider'
- php - PHP MySQLi 检查用户是否投票,如果是,其他用户也可以投票
- amazon-web-services - Appsync 响应映射模板 json 键名更改
- python - 如何在 Python 中使用具有复杂类型的 C 函数?
- sql - 在 SQL Server 中逐行比较两个不同表中的两个文本列
- c# - 正则表达式模式匹配 C#
- javascript - javascript - 在按钮单击和返回值时重复多级下拉菜单
- c# - 错误 CS0103:当前上下文 (CS0103) (testingProgram) 中不存在名称“TimeSpan”?
- javascript - 提交时在文本框中附加数字
- vim - 在vim中的命令后进入两个标签之间的插入模式