python - 识别特定列中最常见的词,但仅前 10 首歌曲(另一列)
问题描述
我在使用此代码时遇到了一些问题。我应该从每年(1965-2015)的前 10 首歌曲中检索 20 个最常用的词,有一个排名,所以我觉得我可以用排名 <= 10 来识别前 10 名。但我只是迷失了如何开始。这就是我到目前为止所拥有的。我还没有收录排名前 10 的歌曲。此外,最常见的 20 个词来自歌词栏(即 4 个)
import collections
import csv
import re
words = re.findall(r'\w+', open('billboard_songs.csv').read().lower())
reader = csv.reader(words, delimiter=',')
csvrow = [row[4] for row in reader]
most_common = collections.Counter(words[4]).most_common(20)
print(most_common)
我文件的第一行如下:
"Rank","Song","Artist","Year","Lyrics","Source"
1,"wooly bully","sam the sham and the pharaohs",1965,"sam the sham miscellaneous wooly bully wooly bully sam the sham the pharaohs .....,3
当它达到 100(等级)时,它会在下一年再次从 1 开始,依此类推。
解决方案
您可以使用它csv.DictReader
来解析文件并从中获取可用的 Python 字典列表。然后,您可以使用 for-comprehensions 并itertools.groupby()
提取您需要的歌曲信息。最后,您可以使用collections.Counter
来查找歌曲中最常用的单词。
#!/usr/bin/env python
import collections
import csv
import itertools
def analyze_songs(songs):
# Grouping songs by year (groupby can only be used with a sorted list)
sorted_songs = sorted(songs, key=lambda s: s["year"])
for year, songs_iter in itertools.groupby(sorted_songs, key=lambda s: s["year"]):
# Extract lyrics of top 10 songs
top_10_songs_lyrics = [
song["lyrics"] for song in songs_iter if song["rank"] <= 10
]
# Join all lyrics together from all songs, and then split them into
# a big list of words.
top_10_songs_words = (" ".join(top_10_songs_lyrics)).split()
# Using Counter to find the top 20 words
most_common_words = collections.Counter(top_10_songs_words).most_common(20)
print(f"Year {year}, most common words: {most_common_words}")
with open("billboard_songs.csv") as songs_file:
reader = csv.DictReader(songs_file)
# Transform the entries to a simpler format with appropriate types
songs = [
{
"rank": int(row["Rank"]),
"song": row["Song"],
"artist": row["Artist"],
"year": int(row["Year"]),
"lyrics": row["Lyrics"],
"source": row["Source"],
}
for row in reader
]
analyze_songs(songs)
在这个答案中,我假设以下格式billboard_songs.csv
:
"Rank","Song","Artist","Year","Lyrics","Source"
1,"wooly bully","sam the sham and the pharaohs",1965,"sam the sham miscellaneous wooly bully wooly bully sam the sham the pharaohs","Source Evian"
我假设数据集是从 1965 年到 2015 年,如问题中所述。如果不是,则应首先对歌曲列表进行相应过滤。
推荐阅读
- jquery - 部署的 Heroku 应用程序仅在刷新后工作(Node.js、Express.js、Jquery、Css、Html、TMDB API)
- .net-core - 如何创建具有嵌套依赖项的 Nuget 包
- javascript - 如何更改单选按钮的跨度文本
- mongodb - 当它应该是完全限定的域名时,Redhat Linux 7 kerberos 客户端在 kerberos 跟踪中返回 localhost
- vega-lite - 如何避免带图层的折线图中的缩放冲突?
- prolog - Prolog 初学者 - 实例化不足的参数
- postgresql - PostgreSQL 查询计划已更改
- django - 检查值是否是 django 模板中的 url
- php - PHP 自动预先和附加动态内容到输出
- r - 为什么我收到此错误:“参数不是数字或逻辑:返回 NA”?