python - Python 3 相当于涉及传递给函数的元组的代码
问题描述
我有一个代码可以计算给定数据集的电影相似度,即 rating.dat 和 movies.dat。然而,代码是用 python 2.7 编写的。
我尝试将代码转换为 python-3,但无法获得所需的结果。如果代码中有任何错误,需要一些专家帮助来检查。
下面的代码是我需要转换为 python 3 的代码区域:
def makePairs((user, ratings)):
(movie1, rating1) = ratings[0]
(movie2, rating2) = ratings[1]
return ((movie1, movie2), (rating1, rating2))
def filterDuplicates( (userID, ratings) ):
(movie1, rating1) = ratings[0]
(movie2, rating2) = ratings[1]
return movie1 < movie2
还有这个
# Filter for movies with this sim that are "good" as defined by
# our quality thresholds above
filteredResults = moviePairSimilarities.filter(lambda((pair,sim)): \
(pair[0] == movieID or pair[1] == movieID) \
and sim[0] > scoreThreshold and sim[1] > coOccurenceThreshold)
# Sort by quality score.
results = filteredResults.map(lambda((pair,sim)): (sim, pair)).sortByKey(ascending = False).take(10)
完整代码如下
spark-submit mycodefile.py 50
这是python 2.7中的代码
import sys
from pyspark import SparkConf, SparkContext
from math import sqrt
def loadMovieNames():
movieNames = {}
with open("movies.dat") as f:
for line in f:
fields = line.split("::")
movieNames[int(fields[0])] = fields[1].decode('ascii', 'ignore')
return movieNames
def makePairs((user, ratings)):
(movie1, rating1) = ratings[0]
(movie2, rating2) = ratings[1]
return ((movie1, movie2), (rating1, rating2))
def filterDuplicates( (userID, ratings) ):
(movie1, rating1) = ratings[0]
(movie2, rating2) = ratings[1]
return movie1 < movie2
def computeCosineSimilarity(ratingPairs):
numPairs = 0
sum_xx = sum_yy = sum_xy = 0
for ratingX, ratingY in ratingPairs:
sum_xx += ratingX * ratingX
sum_yy += ratingY * ratingY
sum_xy += ratingX * ratingY
numPairs += 1
numerator = sum_xy
denominator = sqrt(sum_xx) * sqrt(sum_yy)
score = 0
if (denominator):
score = (numerator / (float(denominator)))
return (score, numPairs)
conf = SparkConf()
sc = SparkContext(conf = conf)
print("\nLoading movie names...")
nameDict = loadMovieNames()
data = sc.textFile("ratings.dat")
# Map ratings to key / value pairs: user ID => movie ID, rating
ratings = data.map(lambda l: l.split("::")).map(lambda l: (int(l[0]), (int(l[1]), float(l[2]))))
# Emit every movie rated together by the same user.
# Self-join to find every combination.
ratingsPartitioned = ratings.partitionBy(100)
joinedRatings = ratingsPartitioned.join(ratingsPartitioned)
# At this point our RDD consists of userID => ((movieID, rating), (movieID, rating))
# Filter out duplicate pairs
uniqueJoinedRatings = joinedRatings.filter(filterDuplicates)
# Now key by (movie1, movie2) pairs.
moviePairs = uniqueJoinedRatings.map(makePairs).partitionBy(100)
# We now have (movie1, movie2) => (rating1, rating2)
# Now collect all ratings for each movie pair and compute similarity
moviePairRatings = moviePairs.groupByKey()
# We now have (movie1, movie2) = > (rating1, rating2), (rating1, rating2) ...
# Can now compute similarities.
moviePairSimilarities = moviePairRatings.mapValues(computeCosineSimilarity).persist()
# Save the results if desired
moviePairSimilarities.sortByKey()
moviePairSimilarities.saveAsTextFile("movie-sims")
# Extract similarities for the movie we care about that are "good".
if (len(sys.argv) > 1):
scoreThreshold = 0.97
coOccurenceThreshold = 1000
movieID = int(sys.argv[1])
# Filter for movies with this sim that are "good" as defined by
# our quality thresholds above
filteredResults = moviePairSimilarities.filter(lambda((pair,sim)): \
(pair[0] == movieID or pair[1] == movieID) \
and sim[0] > scoreThreshold and sim[1] > coOccurenceThreshold)
# Sort by quality score.
results = filteredResults.map(lambda((pair,sim)): (sim, pair)).sortByKey(ascending = False).take(10)
print("Top 10 similar movies for " + nameDict[movieID])
for result in results:
(sim, pair) = result
# Display the similarity result that isn't the movie we're looking at
similarMovieID = pair[0]
if (similarMovieID == movieID):
similarMovieID = pair[1]
print(nameDict[similarMovieID] + "\tscore: " + str(sim[0]) + "\tstrength: " + str(sim[1]))
任何帮助深表感谢。
看待
我已经做的是将此代码转换为 python 3 等效代码,如下所示,但无法获得所需的结果。
import sys
from pyspark import SparkConf, SparkContext
from math import sqrt
def loadMovieNames():
movieNames = {}
with open("movies.dat") as f:
for line in f:
fields = line.split("::")
movieNames[int(fields[0])] = fields[1] #.decode('ascii', 'ignore')
return movieNames
def makePairs(*ratings):
for t in ratings:
(movie1, rating1) = t[1][0]
(movie2, rating2) = t[1][1]
return ((movie1, movie2), (rating1, rating2))
def filterDuplicates(*ratings):
for t in ratings:
(movie1, rating1) = t[1][0]
(movie2, rating2) = t[1][1]
return movie1 < movie2
def computeCosineSimilarity(ratingPairs):
numPairs = 0
sum_xx = sum_yy = sum_xy = 0
for ratingX, ratingY in ratingPairs:
sum_xx += ratingX * ratingX
sum_yy += ratingY * ratingY
sum_xy += ratingX * ratingY
numPairs += 1
numerator = sum_xy
denominator = sqrt(sum_xx) * sqrt(sum_yy)
score = 0
if (denominator):
score = (numerator / (float(denominator)))
return (score, numPairs)
conf = SparkConf().setMaster("local[*]").setAppName("MovieSimilarities")
sc = SparkContext(conf = conf)
print("\nLoading movie names...")
nameDict = loadMovieNames()
print("\nLoading movie ratings...")
data = sc.textFile("ratings100.dat")
print("\nDone..")
# Map ratings to key / value pairs: user ID => movie ID, rating
ratings = data.map(lambda l: l.split("::")).map(lambda l: (int(l[0]), (int(l[1]), float(l[2]))))
# Emit every movie rated together by the same user.
# Self-join to find every combination.
ratingsPartitioned = ratings.partitionBy(100)
joinedRatings = ratingsPartitioned.join(ratingsPartitioned)
#joinedRatings = ratings.join(ratings)
# At this point our RDD consists of userID => ((movieID, rating), (movieID, rating))
# Filter out duplicate pairs
uniqueJoinedRatings = joinedRatings.filter(filterDuplicates)
# Now key by (movie1, movie2) pairs.
moviePairs = uniqueJoinedRatings.map(makePairs).partitionBy(100)
# We now have (movie1, movie2) => (rating1, rating2)
# Now collect all ratings for each movie pair and compute similarity
moviePairRatings = moviePairs.groupByKey()
# We now have (movie1, movie2) = > (rating1, rating2), (rating1, rating2) ...
# Can now compute similarities.
moviePairSimilarities = moviePairRatings.mapValues(computeCosineSimilarity).persist()
# Save the results if desired
moviePairSimilarities.sortByKey()
moviePairSimilarities.saveAsTextFile("movie-sims")
# Extract similarities for the movie we care about that are "good".
if (len(sys.argv) > 1):
scoreThreshold = 0.9
coOccurenceThreshold = 1000
movieID = int(sys.argv[1])
# Filter for movies with this sim that are "good" as defined by
# our quality thresholds above
filteredResults = moviePairSimilarities.filter(lambda pairSim: (pairSim[0][0] == movieID or pairSim[0][1] == movieID) and pairSim[1][0] > scoreThreshold and pairSim[1][1] > coOccurenceThreshold)
# Sort by quality score.
results = filteredResults.map(lambda pairSim: (pairSim[1], pairSim[0])).sortByKey(ascending = False).take(10)
print("Top 10 similar movies for " + str(nameDict[movieID]))
for result in results:
(sim, pair) = result
# Display the similarity result that isn't the movie we're looking at
similarMovieID = pair[0]
if (similarMovieID == movieID):
similarMovieID = pair[1]
print(nameDict[similarMovieID] + "\tscore: " + str(sim[0]) + "\tstrength: " + str(sim[1]))
以下是预期结果,应该显示前 10 个类似的电影结果。
Top 10 similar movies for Wizard of Oz, The (1939)
Toy Story (1995) score: 661 strength: 1545
Some Other Movie score: 594 strength: 720
Another Movie score: 2018 strength: 2804
解决方案
def f(*tuplex)
不一样def f((x, y))
; 它(或多或少)与def f(x, y)
. 也就是说,第一个 (py3) 函数接收非关键字参数列表,第二个 (py2) 函数接收单个元组参数。由于您传递的是单个元素(恰好是一个元组),tuplex
因此将是一个元素的元组(并且结果for t in tuplex
只会迭代一次)。你应该成功def(xy)
,xy
你的(x, y)
元组在哪里。
你的 Python 2 代码:
def makePairs((user, ratings)):
(movie1, rating1) = ratings[0]
(movie2, rating2) = ratings[1]
return ((movie1, movie2), (rating1, rating2))
实际兼容的 Python 3 代码:
def makePairs(user_ratings):
_, ratings = user_ratings
(movie1, rating1) = ratings[0]
(movie2, rating2) = ratings[1]
return ((movie1, movie2), (rating1, rating2))
正如评论中的某处所提到的,您可以通过一个简单的调用来替换整个函数zip
,例如:
>>> a = (('movie1', 'rating1'), ('movie2', 'rating2'))
>>> list(zip(*a))
[('movie1', 'movie2'), ('rating1', 'rating2')]
(list(...)
如果您只需要返回一个迭代器,则不需要,但这不会在命令行上显示实际内容。所以请不要调用 to ,list(...)
除非您在实际代码中收到有关“zip object”的错误.)
这里不幸的部分是 -map
你使用的方法makePairs
只传递一个函数,所以你不能指定星号。- 你需要摆脱第一个论点,user
.
您可能会使用以下内容:
moviePairs = uniqueJoinedRatings.map(lambda x: zip(*x[1])).partitionBy(100)
(未经测试)这以一些清晰为代价
摆脱了完整的功能。makePairs
最后一点:make_pairs
遵循风格指南;makePairs
不是 Python 风格。与代码中的所有其他名称一样。由于您在问题的顶部提到了审查一词(但这可能更多是Code Review的问题。
推荐阅读
- c# - FontFamily 导入不适用于 Visual Studio 中的 C# Xamarin
- javascript - 你能在 JavaScript 中将一个变量声明为另一个变量的内容吗?
- r - cor() 热图也考虑了滞后?
- python-3.x - FASTA 文件中运行脚本中的 KeyError 'c'?
- python - 取消缓冲时,Unix 脚本命令会中断终端
- javascript - 当前一个流接收到第一个值时,RxJS 合并
- git - 如何验证我签出了正确的代码?
- typescript - Typescript - @types/uniqid - 使用命名空间和 require("uniqid") 的问题
- javascript - 使用 Javascript 为页面上的多个按钮启动 onClick() 事件
- spring-boot - 如何使用springboot在json响应中添加@ManyToOne列?