首页 > 解决方案 > 从相似类型中提取一个 url

问题描述

我有一个包含数千个 url 的 csv 文件。如何从每个基本类型的 url 中随机选择一个 url。获取url的顺序可以是反正。它必须是随机的。

import pandas as pd

# initialise data of lists.
data = {'url':['https://alabamasymphony.org/event/shamrocks-strings', 
               'https://alabamasymphony.org/event/emperor', 
               'https://mobilesymphony.org/event/fanfare',
               'https://mobilesymphony.org/event/the-fireworks-of-jupiter/',
               'https://www.hso.org/concerts/liszt-fantasy/',
               'https://www.juneausymphony.org/apr2019/']}

# Create DataFrame
df = pd.DataFrame(data)
df

预期产出

['https://alabamasymphony.org/event/emperor','https://mobilesymphony.org/event/fanfare','https://www.hso.org/concerts/liszt-fantasy/','https://www.juneausymphony.org/apr2019/']

标签: pythonpandas

解决方案


您需要做的第一件事是提取基本 url,这可以使用urllib.

然后,您可以使用groupbywithsample为每个 base_url 提取随机 url。

import urllib.parse
import pandas as pd


# initialise data of lists.
data = {'url':['https://alabamasymphony.org/event/shamrocks-strings', 
               'https://alabamasymphony.org/event/emperor', 
               'https://mobilesymphony.org/event/fanfare',
               'https://mobilesymphony.org/event/the-fireworks-of-jupiter/',
               'https://www.hso.org/concerts/liszt-fantasy/',
               'https://www.juneausymphony.org/apr2019/']}

# Create DataFrame
df = pd.DataFrame(data)

df['base_url'] = df['url'].apply(lambda url: urllib.parse.urlparse(url).netloc)

random = df.groupby('base_url').sample(n=1)

print(random)
                                           url                base_url
1    https://alabamasymphony.org/event/emperor     alabamasymphony.org
2     https://mobilesymphony.org/event/fanfare      mobilesymphony.org
4  https://www.hso.org/concerts/liszt-fantasy/             www.hso.org
5      https://www.juneausymphony.org/apr2019/  www.juneausymphony.org

推荐阅读