首页 > 解决方案 > 检查这是不是 csv 文件中的 url

问题描述

我想从 csv 文件中删除不是 url 的值:我们的 df['url'] 包含像' https://stackoverflow.com/questions/ask '' https://www.linkedin.com/feed/这样的值''345',我想删除 345。

def Find_url(string):
    url = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\), ]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', string)
    return url



if __name__ == "__main__":
    file = pd.read_csv('url_file.csv')
    df =  pd.DataFrame(file)
    for i in range(len(df)):
        url = Find_url(df.loc[i]['url'])
        df.loc[i]['url']=url
df.to_csv('clean_url.csv')

样本输入:

 'https://www.zaubacorp.com/company/HINDUSTAN-CABLES-LTD/L31300WB1952GOI020560'
'http://www.indianrailways.gov.in/railwayboard/view_section.jsp?lang=0&id=0
1
304
365'
 'https://en.wikipedia.org/wiki/Railway_Board'
 'https://en.wikipedia.org/wiki/Railway_Board#History'

我想要这样的输出示例输出:

 'https://www.zaubacorp.com/company/HINDUSTAN-CABLES-LTD/L31300WB1952GOI020560'
'http://www.indianrailways.gov.in/railwayboard/view_section.jsp?lang=0&id=0
 'https://en.wikipedia.org/wiki/Railway_Board'
 'https://en.wikipedia.org/wiki/Railway_Board#History'

标签: python

解决方案


您可以使用urllib.parse标准库尝试将字符串解析为具有必要属性的 URL。

from io import StringIO
from urllib.parse import urlparse
import pandas as pd

def url_validator(x):
    try:
        result = urlparse(x)
        # check non-empty attributes
        return all((result.scheme, result.netloc, result.path))
    except AttributeError:
        return False

mystr = StringIO("""https://www.zaubacorp.com/company/HINDUSTAN-CABLES-LTD/L31300WB1952GOI020560
http://www.indianrailways.gov.in/railwayboard/view_section.jsp?lang=0&id=0
1
304
365
https://en.wikipedia.org/wiki/Railway_Board
https://en.wikipedia.org/wiki/Railway_Board#History""")

# replace mystr with 'file.csv'
df = pd.read_csv(mystr, header=None, names=['values'])

# apply filter based on checker function
df = df[df['values'].apply(url_validator)]

print(df)

                                              values
0  https://www.zaubacorp.com/company/HINDUSTAN-CA...
1  http://www.indianrailways.gov.in/railwayboard/...
5        https://en.wikipedia.org/wiki/Railway_Board
6  https://en.wikipedia.org/wiki/Railway_Board#Hi...

推荐阅读