python - 数据框python中函数的分组结果
问题描述
我有我在谷歌新闻中搜索的请求列表
输出在一个列表中给我所有与此新闻的链接
rqsts_catdogtiger = ['Cat' , 'Dog', 'Tiger']
headers = {'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}
page=0 #first page of google news (10 first news)
url_list = []
for term in rqsts_catdogtiger[0:3]:
url = 'https://www.google.com/search?q={}&tbm=nws&start={}'.format(term,page) #url of request
print(url)
url_list.append(url)
soups = []
for link in url_list:
response = requests.get(link, headers=headers,verify=False)
soup = BeautifulSoup(response.text, 'html.parser')
soups.append(soup)
def find_links():
for soup in soups:
results = soup.findAll("div", {'class': 'g'}) #class of google news
for result in results:
result_link = result.find('a').get('href') #getting links
yield result_link
list_of_links = list(find_links())
list_of_links
输出看起来像 30 个链接的列表:10 个Cat
,10 个Dog
,10 个Tiger
我如何将这个结果组合pd.DataFrame
成这样:
Request Name Links
0 Cat 'https://www.polygon.com/2020/3/19/21187025/cats-2019-tom-hooper-mr-mistoffelees-broadway-musical',...
1 Dog 'https://nypost.com/2020/03/19/second-dog-in-hong-kong-tests-positive-for-coronavirus/',...
2 Tiger 'https://tvrain.ru/teleshow/doma_pogovorim/tiger_cave-504935/',...
list_of_links
现在看起来像这样:
['https://www.polygon.com/2020/3/19/21187025/cats-2019-tom-hooper-mr-mistoffelees-broadway-musical',
'https://pagesix.com/2020/03/19/anthony-hopkins-plays-piano-for-cat-while-at-home-amid-coronavirus-pandemic/',
'https://www.snopes.com/fact-check/butthole-cut-of-the-movie-cats/',
'https://www.vox.com/culture/2020/3/18/21185255/cats-movie-twitter-release-the-butthole-cut-meme',
'https://mashable.com/video/cat-domino-video-coronavirus/',
'https://www.newyorker.com/humor/daily-shouts/quarantine-tips-from-my-cat',
'https://www.nydailynews.com/coronavirus/ny-coronavirus-cat-angry-family-home-20200318-4w2v624fpzggdco5eh4afydsiq-story.html',
'https://santaclaritafree.com/gazette/news/the-cat-is-out-of-the-bag',
'https://www.huffpost.com/entry/dog-cat-coronavirus-what-to-do_l_5e7156cbc5b63c0231e42a4c',
'https://www.kxl.com/9-quarantine-tips-from-your-cat/',
'https://nypost.com/2020/03/19/second-dog-in-hong-kong-tests-positive-for-coronavirus/',
'https://www.thecut.com/2020/03/walking-the-dog-is-the-only-time-i-feel-sane.html',
'https://wtop.com/coronavirus/2020/03/curbside-dog-drop-off-emerges-in-pandemic/',
'https://www.wsaw.com/content/news/2-charged-with-outdoor-dogs-death-not-providing-proper-food-or-shelter-for-others-568933521.html',
'https://www.theguardian.com/lifeandstyle/2020/mar/18/working-like-a-dog-an-instagram-account-capturing-the-bright-side-of-social-distance',
'https://www.nytimes.com/2020/03/17/smarter-living/dog-pets-quarantine-coronavirus-tips.html',
'https://www.wnep.com/video/weather/accuweather/this-dog-is-not-ready-for-winter-to-go-just-yet/607-c8915a58-b8ff-49d1-9e91-1ffd6d8fd175',
'https://www.thelocal.es/20200319/why-everyone-in-spain-wishes-they-had-a-dog-during-the-coronavirus-lockdown',
'https://www.washingtonpost.com/science/2020/03/18/coronavirus-dogs-pets/',
'https://time.com/5806617/law-and-order-dog/',
'https://tvrain.ru/teleshow/doma_pogovorim/tiger_cave-504935/',
'https://nypost.com/2020/03/19/everything-you-need-to-know-about-netflixs-new-joe-exotic-doc-tiger-king/',
'https://www.golfchannel.com/news/day-golf-tiger-woods-wins-first-bay-hill-title',
'https://www.racingtv.com/news/national-duty-could-still-be-on-the-agenda-for-tiger-roll',
'https://tvline.com/2020/03/19/coronavirus-homeschool-resources-tips-daniel-tigers-neighborhood/',
'https://www.pnj.com/story/news/2020/03/18/netflix-series-tiger-king-joe-exotic-released-friday/5062782002/',
'https://www.sen.com.au/news/2020/03/19/didnt-realise-how-much-rubbish-we-talk-tigers-reaction-to-strange-night',
'https://www.memphisflyer.com/NewsBlog/archives/2020/03/19/city-preparing-covid-19-drive-thru-testing-site-at-tiger-lane',
'https://www.dailyexaminer.com.au/news/grafton-tiger-named-captain-of-afl-north-coast-tea/3976977/',
'https://www.myrtlebeachonline.com/news/local/article241326116.html']
解决方案
如果我理解你,你应该首先通过将list_of_links
列表拆分为均匀长的子列表来准备数据:
import pandas as pd
rqsts_catdogtiger = ['Cat' , 'Dog', 'Tiger']
list_of_links = [...] # your list of links
n = int(len(list_of_links) / len(rqsts_catdogtiger))
list_of_list_of_links = [list_of_links[i:i + n] for i in range(0, len(list_of_links), n)]
之后,您可以轻松制作pandas.DataFrame
. 如果您希望列表位于Links
列中,请使用以下代码:
>>> df = pd.DataFrame({'Request Name': rqsts_catdogtiger, 'Links': list_of_list_of_links})
>>> print(df)
Request Name Links
0 Cat [https://www.polygon.com/2020/3/19/21187025/ca...
1 Dog [https://nypost.com/2020/03/19/second-dog-in-h...
2 Tiger [https://tvrain.ru/teleshow/doma_pogovorim/tig...
如果您想在一个长字符串中包含链接,其中每个链接将用逗号分隔,请使用以下代码:
>>> df = pd.DataFrame({'Request Name': rqsts_catdogtiger, 'Links': [', '.join([url for url in l_of_urls]) for l_of_urls in list_of_list_of_links]})
>>> print(df)
Request Name Links
0 Cat https://www.polygon.com/2020/3/19/21187025/cat...
1 Dog https://nypost.com/2020/03/19/second-dog-in-ho...
2 Tiger https://tvrain.ru/teleshow/doma_pogovorim/tige...
推荐阅读
- scala - 对 Set 的原子引用 - 添加值
- python-3.x - Python pandas - 如何从列中挑选日期并将它们移动?
- html - CSS 网格:粘性角落
- apache - 重写导致 400 错误请求的 URL
- php - PHP MySQL Insert INTO...ON Duplicate Key Update Where
- apple-maps - CarPlay - 在深色模式下 Apple Maps 以浅色模式显示
- redux - Redux:为主减速器的子元素构建减速器
- angular - 如何为Angular8中选择的每个复选框获取我的下拉列表的值
- c++ - 将匹配谓词的相邻元组元素分组为子元组
- css - CSS向上滚动时如何更改粘性菜单的标志?