python - python / pandas中的正则表达式导致奇怪的行尾字符
问题描述
我刚刚开始使用 Pandas,并且正在开发域清理工具。本质上,我想删除所有子域并只保留主域 + tld。
下面在 ipython 中针对单个域工作,但我正在努力对抗多个域的数据框。
该脚本似乎工作,但正则表达式导致行尾字符(如下所示)被打印在域的末尾(例如 com\n1')
我不确定这些字符是什么——我试过 rstrip,但没有奏效。谁能建议这些角色是什么以及如何摆脱它们以使脚本正常工作?
输出
['0 graph', 'facebook', 'com\n1 news', 'bbc', 'co', 'uk\n2
预期输出
当我在 ipython 中运行相同的内容时,我得到以下信息 - 使用 df 列时我需要拆分相同。
In [12]: re.split(r'\.(?!\d)', (str('domain.domain.com')))
Out[12]: ['domain', 'domain', 'com']
输入
In [1]: import pandas as pd
In [2]: import re
In [3]: path = "Desktop/domains.csv"
In [4]: df = pd.read_csv(path, delimiter=',', header='infer')
In [5]: df
Out[5]:
Domain
0 graph.facebook.com
1 news.bbc.co.uk
2 news.more.news.bbc.co.uk
3 profile.username.co
4 offers.o2.co.uk
5 subdomain.pyspark.org
6 uds.data.domain.net
In [7]: for index, row in df.iterrows():
...: tld = ['co.uk', 'com', 'org', 'co', 'net']
...: index = re.split(r'\.(?!\d)', (str(df.Domain)))
...: print(index)
...: if str(index[len(index)-2]).rstrip()+'.'+ str(index[len(index)-1]).rstrip() in tld:
...: print(str(index[len(index)-3])+'.'+str(index[len(index)-2])+'.'+ str(index[len(index)-1]))
...: elif str(index[len(index)-1]) in tld:
...: print(str(index[len(index)-2])+'.'+ str(index[len(index)-1]))
更新
感谢大家到目前为止的帮助。
下面的工作现在完全按预期工作,但是输出都重复了几次。
例如,您可以看到 Facebook.com 在输出列表中打印了两次,我真的不明白为什么 - 任何人都可以建议吗?
输入
In [38]: for row in df.iterrows():
...: tld = ['co.uk', 'com', 'org', 'co', 'net']
...: index = df.Domain[df.Domain.str.strip().str.endswith(tuple(tld))].str.split('.').tolist()
...: for x in index:
...: if str(x[len(x)-2]).rstrip()+'.'+ str(x[len(x)-1]).rstrip() in tld:
...: print(str(x[len(x)-3])+'.'+str(x[len(x)-2])+'.'+ str(x[len(x)-1]))
...: elif str(x[len(x)-1]) in tld:
...: print(str(x[len(x)-2])+'.'+ str(x[len(x)-1]))
...:
...:
facebook.com
bbc.co.uk
bbc.co.uk
username.co
o2.co.uk
pyspark.org
domain.net
facebook.com
解决方案
这是工作代码:
In [5]: import pandas as pd
In [6]: import re
#Define the path of the file & generate the dataframe from it
In [7]: path = "Desktop/domains.csv"
In [8]: df = pd.read_csv(path, delimiter=',', header='infer')
#Show the dataframe to validate input is correct
In [9]: df
Out[10]:
Domain
0 graph.facebook.com
1 news.bbc.co.uk
2 news.more.news.bbc.co.uk
3 profile.username.co
4 offers.o2.co.uk
5 subdomain.pyspark.org
6 uds.data.domain.net
#Iterate through the rows of the dataframe. If the domain ends with anything in the TLD list, then split the domains at the '.' into a list
In [11]: for row in df.iterrows():
...: tld = ['co.uk', 'com', 'org', 'co', 'net']
...: index = df.Domain[df.Domain.str.strip().str.endswith(tuple(tld))].str.split('.').tolist()
...:
#Now, let's look at the output of index. It's a list of lists. So, if we select index[1]. we will get ['news', 'bbc', 'co', 'uk']
In [12]: index
Out[12]:
[['graph', 'facebook', 'com'],
['news', 'bbc', 'co', 'uk'],
['news', 'more', 'news', 'bbc', 'co', 'uk'],
['profile', 'username', 'co'],
['offers', 'o2', 'co', 'uk'],
['subdomain', 'pyspark', 'org'],
['uds', 'data', 'domain', 'net']]
#We therefore need to iterate through each item of the lists
#So, we need to go through index[0][0] (the sub lists - to see the breakdown of each domain)
#Now we take the length of the list e.g. graph, facebook, com = length of 3
#as it's a zero'd index, we need to start at 0 and hence a list with a length of 3, has a range of 0-2
#so first, we check, is the last but one character + the last character in the TLD list (e.g. co.uk)
#if yes, we take the last 3 elements of the list, to produce 'bbc.co.uk'
#if no, we go through and check the last character & see if its in the list (e.g. com)
#if it is, we print the last 2 characters (e.g. facebook.com)
...: for x in index:
...: if str(x[len(x)-2]).rstrip()+'.'+ str(x[len(x)-1]).rstrip() in tld:
...: print(str(x[len(x)-3])+'.'+str(x[len(x)-2])+'.'+ str(x[len(x)-1]))
...: elif str(x[len(x)-1]) in tld:
...: print(str(x[len(x)-2])+'.'+ str(x[len(x)-1]))
...:
#and here is the ouput
facebook.com
bbc.co.uk
bbc.co.uk
username.co
o2.co.uk
pyspark.org
domain.net
推荐阅读
- node.js - NodeJS - 有没有办法使用 Google API 获取当前登录的用户?
- javascript - 如何根据您来自的元素在同一个 HTML 页面上嵌入不同的网站
- android - 当我尝试在我的颤振应用程序中使用 firebase_Auth 时出现问题
- python - django all-auth 不区分大小写的用户名
- angularjs - AngularJS - 当这些列的数据基于 api 时,Jquery-TreeTable 如何在树表中动态添加列
- amazon-web-services - 默认 VPC 中是否默认创建 ec2 实例?
- excel - 是否有可以在 Excel 文档中运行宏的 Azure 资源?
- sql - 将原始 SQL 查询转换为 Laravel DB 构建器查询
- bluetooth - 蓝牙 LE 广告数据包格式与蓝牙规范不匹配
- c# - C# MVVM 索引集合