python - 检查这是不是 csv 文件中的 url
问题描述
我想从 csv 文件中删除不是 url 的值:我们的 df['url'] 包含像' https://stackoverflow.com/questions/ask '' https://www.linkedin.com/feed/这样的值''345',我想删除 345。
def Find_url(string):
url = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\), ]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', string)
return url
if __name__ == "__main__":
file = pd.read_csv('url_file.csv')
df = pd.DataFrame(file)
for i in range(len(df)):
url = Find_url(df.loc[i]['url'])
df.loc[i]['url']=url
df.to_csv('clean_url.csv')
样本输入:
'https://www.zaubacorp.com/company/HINDUSTAN-CABLES-LTD/L31300WB1952GOI020560'
'http://www.indianrailways.gov.in/railwayboard/view_section.jsp?lang=0&id=0
1
304
365'
'https://en.wikipedia.org/wiki/Railway_Board'
'https://en.wikipedia.org/wiki/Railway_Board#History'
我想要这样的输出示例输出:
'https://www.zaubacorp.com/company/HINDUSTAN-CABLES-LTD/L31300WB1952GOI020560'
'http://www.indianrailways.gov.in/railwayboard/view_section.jsp?lang=0&id=0
'https://en.wikipedia.org/wiki/Railway_Board'
'https://en.wikipedia.org/wiki/Railway_Board#History'
解决方案
您可以使用urllib.parse
标准库尝试将字符串解析为具有必要属性的 URL。
from io import StringIO
from urllib.parse import urlparse
import pandas as pd
def url_validator(x):
try:
result = urlparse(x)
# check non-empty attributes
return all((result.scheme, result.netloc, result.path))
except AttributeError:
return False
mystr = StringIO("""https://www.zaubacorp.com/company/HINDUSTAN-CABLES-LTD/L31300WB1952GOI020560
http://www.indianrailways.gov.in/railwayboard/view_section.jsp?lang=0&id=0
1
304
365
https://en.wikipedia.org/wiki/Railway_Board
https://en.wikipedia.org/wiki/Railway_Board#History""")
# replace mystr with 'file.csv'
df = pd.read_csv(mystr, header=None, names=['values'])
# apply filter based on checker function
df = df[df['values'].apply(url_validator)]
print(df)
values
0 https://www.zaubacorp.com/company/HINDUSTAN-CA...
1 http://www.indianrailways.gov.in/railwayboard/...
5 https://en.wikipedia.org/wiki/Railway_Board
6 https://en.wikipedia.org/wiki/Railway_Board#Hi...
推荐阅读
- python - 使用 re.sub 在 Python 中查找斜体文本
- python - python中的list.append(a[:])和list.append(a)有什么区别?
- javascript - node.js express 无法理解 css 文件的路径
- javascript - PHP下拉菜单选择连接
- java - 在 Apache POI 中更改 pptx 幻灯片母版中的字体
- machine-learning - 关于波士顿房价数据集,我应该使用哪个随机森林分类器或回归器
- android - 如何在 ExoPlayer 中连接 MediaSource 的 ArrayList
- javascript - 为 iPhone 调整网站的问题
- linux - VS Code,如何设置为 linux 机器编写
- webrtc - Tizen 无法识别 Javascript async 和 await