首页 > 解决方案 > 数据框python中函数的分组结果

问题描述

我有我在谷歌新闻中搜索的请求列表

输出在一个列表中给我所有与此新闻的链接

rqsts_catdogtiger = ['Cat' , 'Dog', 'Tiger']

headers = {'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}
page=0 #first page of google news (10 first news)
url_list = []
for term in rqsts_catdogtiger[0:3]:      
    url = 'https://www.google.com/search?q={}&tbm=nws&start={}'.format(term,page) #url of request
    print(url)
    url_list.append(url)       

soups = []

for link in url_list:
    response = requests.get(link, headers=headers,verify=False)
    soup = BeautifulSoup(response.text, 'html.parser')
    soups.append(soup)

def find_links():
    for soup in soups:
        results = soup.findAll("div", {'class': 'g'}) #class of google news
        for result in results:
            result_link = result.find('a').get('href') #getting links
            yield result_link

list_of_links = list(find_links())

list_of_links

输出看起来像 30 个链接的列表:10 个Cat,10 个Dog,10 个Tiger

我如何将这个结果组合pd.DataFrame成这样:

    Request Name                   Links
0            Cat      'https://www.polygon.com/2020/3/19/21187025/cats-2019-tom-hooper-mr-mistoffelees-broadway-musical',...
1            Dog      'https://nypost.com/2020/03/19/second-dog-in-hong-kong-tests-positive-for-coronavirus/',...
2          Tiger      'https://tvrain.ru/teleshow/doma_pogovorim/tiger_cave-504935/',...

list_of_links现在看起来像这样:

['https://www.polygon.com/2020/3/19/21187025/cats-2019-tom-hooper-mr-mistoffelees-broadway-musical',
 'https://pagesix.com/2020/03/19/anthony-hopkins-plays-piano-for-cat-while-at-home-amid-coronavirus-pandemic/',
 'https://www.snopes.com/fact-check/butthole-cut-of-the-movie-cats/',
 'https://www.vox.com/culture/2020/3/18/21185255/cats-movie-twitter-release-the-butthole-cut-meme',
 'https://mashable.com/video/cat-domino-video-coronavirus/',
 'https://www.newyorker.com/humor/daily-shouts/quarantine-tips-from-my-cat',
 'https://www.nydailynews.com/coronavirus/ny-coronavirus-cat-angry-family-home-20200318-4w2v624fpzggdco5eh4afydsiq-story.html',
 'https://santaclaritafree.com/gazette/news/the-cat-is-out-of-the-bag',
 'https://www.huffpost.com/entry/dog-cat-coronavirus-what-to-do_l_5e7156cbc5b63c0231e42a4c',
 'https://www.kxl.com/9-quarantine-tips-from-your-cat/',
 'https://nypost.com/2020/03/19/second-dog-in-hong-kong-tests-positive-for-coronavirus/',
 'https://www.thecut.com/2020/03/walking-the-dog-is-the-only-time-i-feel-sane.html',
 'https://wtop.com/coronavirus/2020/03/curbside-dog-drop-off-emerges-in-pandemic/',
 'https://www.wsaw.com/content/news/2-charged-with-outdoor-dogs-death-not-providing-proper-food-or-shelter-for-others-568933521.html',
 'https://www.theguardian.com/lifeandstyle/2020/mar/18/working-like-a-dog-an-instagram-account-capturing-the-bright-side-of-social-distance',
 'https://www.nytimes.com/2020/03/17/smarter-living/dog-pets-quarantine-coronavirus-tips.html',
 'https://www.wnep.com/video/weather/accuweather/this-dog-is-not-ready-for-winter-to-go-just-yet/607-c8915a58-b8ff-49d1-9e91-1ffd6d8fd175',
 'https://www.thelocal.es/20200319/why-everyone-in-spain-wishes-they-had-a-dog-during-the-coronavirus-lockdown',
 'https://www.washingtonpost.com/science/2020/03/18/coronavirus-dogs-pets/',
 'https://time.com/5806617/law-and-order-dog/',
 'https://tvrain.ru/teleshow/doma_pogovorim/tiger_cave-504935/',
 'https://nypost.com/2020/03/19/everything-you-need-to-know-about-netflixs-new-joe-exotic-doc-tiger-king/',
 'https://www.golfchannel.com/news/day-golf-tiger-woods-wins-first-bay-hill-title',
 'https://www.racingtv.com/news/national-duty-could-still-be-on-the-agenda-for-tiger-roll',
 'https://tvline.com/2020/03/19/coronavirus-homeschool-resources-tips-daniel-tigers-neighborhood/',
 'https://www.pnj.com/story/news/2020/03/18/netflix-series-tiger-king-joe-exotic-released-friday/5062782002/',
 'https://www.sen.com.au/news/2020/03/19/didnt-realise-how-much-rubbish-we-talk-tigers-reaction-to-strange-night',
 'https://www.memphisflyer.com/NewsBlog/archives/2020/03/19/city-preparing-covid-19-drive-thru-testing-site-at-tiger-lane',
 'https://www.dailyexaminer.com.au/news/grafton-tiger-named-captain-of-afl-north-coast-tea/3976977/',
 'https://www.myrtlebeachonline.com/news/local/article241326116.html']

标签: pythonpandas

解决方案


如果我理解你,你应该首先通过将list_of_links列表拆分为均匀长的子列表来准备数据:

import pandas as pd

rqsts_catdogtiger = ['Cat' , 'Dog', 'Tiger']
list_of_links = [...] # your list of links

n = int(len(list_of_links) / len(rqsts_catdogtiger))
list_of_list_of_links = [list_of_links[i:i + n] for i in range(0, len(list_of_links), n)]

之后,您可以轻松制作pandas.DataFrame. 如果您希望列表位于Links列中,请使用以下代码:

>>> df = pd.DataFrame({'Request Name': rqsts_catdogtiger, 'Links': list_of_list_of_links})
>>> print(df)
  Request Name                                              Links
0          Cat  [https://www.polygon.com/2020/3/19/21187025/ca...
1          Dog  [https://nypost.com/2020/03/19/second-dog-in-h...
2        Tiger  [https://tvrain.ru/teleshow/doma_pogovorim/tig...

如果您想在一个长字符串中包含链接,其中每个链接将用逗号分隔,请使用以下代码:

>>> df = pd.DataFrame({'Request Name': rqsts_catdogtiger, 'Links': [', '.join([url for url in l_of_urls]) for l_of_urls in list_of_list_of_links]})
>>> print(df)
  Request Name                                              Links
0          Cat  https://www.polygon.com/2020/3/19/21187025/cat...
1          Dog  https://nypost.com/2020/03/19/second-dog-in-ho...
2        Tiger  https://tvrain.ru/teleshow/doma_pogovorim/tige...

推荐阅读