python-3.x - 在给定不同参数的情况下将函数应用于熊猫系列
问题描述
最初的问题
我想计算多个字符串之间的 Levenshtein 距离,一个在系列中,另一个在列表中。我尝试了 map、zip 等,但我只使用 for 循环和应用得到了想要的结果。有没有办法改善风格,尤其是速度?
这是我尝试过的,它完成了它应该做的事情,但是由于大系列而缺乏速度。
import stringdist
strings = ['Hello', 'my', 'Friend', 'I', 'am']
s = pd.Series(data=strings, index=strings)
c = ['me', 'mine', 'Friend']
df = pd.DataFrame()
for w in c:
df[w] = s.apply(lambda x: stringdist.levenshtein(x, w))
## Result: ##
me mine Friend
Hello 4 5 6
my 1 3 6
Friend 5 4 0
I 2 4 6
am 2 4 6
解决方案
感谢@ Dames和@ molybdenum42,我可以直接在问题下方提供我使用的解决方案。有关更多见解,请在下面查看他们的精彩答案。
import stringdist
from itertools import product
strings = ['Hello', 'my', 'Friend', 'I', 'am']
s = pd.Series(data=strings, index=strings)
c = ['me', 'mine', 'Friend']
word_combinations = np.array(list(product(s.values, c)))
vectorized_levenshtein = np.vectorize(stringdist.levenshtein)
result = vectorized_levenshtein(word_combinations[:, 0],
word_combinations[:, 1])
result = result.reshape((len(s), len(c)))
df = pd.DataFrame(result, columns=c, index=s)
这会产生所需的数据帧。
解决方案
设置:
import stringdist
import pandas as pd
import numpy as np
import itertools
s = pd.Series(data=['Hello', 'my', 'Friend'],
index=['Hello', 'my', 'Friend'])
c = ['me', 'mine', 'Friend']
选项
- 选项:一个简单的单线
df = pd.DataFrame([s.apply(lambda x: stringdist.levenshtein(x, w)) for w in c])
- 选项:(
np.fromfunction
感谢@baccandr)
@np.vectorize
def lavdist(a, b):
return stringdist.levenshtein(c[a], s[b])
df = pd.DataFrame(np.fromfunction(lavdist, (len(c), len(s)), dtype = int),
columns=c, index=s)
- 选项:见@molybdenum42
word_combinations = np.array(list(itertools.product(s.values, c)))
vectorized_levenshtein = np.vectorize(stringdist.levenshtein)
result = vectorized_levenshtein(word_combinations[:,0], word_combinations[:,1])
df = pd.DataFrame([word_combinations[:,1], word_combinations[:,1], result])
df = df.set_index([0,1])[2].unstack()
- (最佳)选项:修改选项3
word_combinations = np.array(list(itertools.product(s.values, c)))
vectorized_levenshtein = np.vectorize(distance)
result = vectorized_levenshtein(word_combinations[:,0], word_combinations[:,1])
result = result.reshape((len(s), len(c)))
df = pd.DataFrame(result, columns=c, index=s)
性能测试:
import timeit
from Levenshtein import distance
import pandas as pd
import numpy as np
import itertools
s = pd.Series(data=['Hello', 'my', 'Friend'],
index=['Hello', 'my', 'Friend'])
c = ['me', 'mine', 'Friend']
test_code0 = """
df = pd.DataFrame()
for w in c:
df[w] = s.apply(lambda x: distance(x, w))
"""
test_code1 = """
df = pd.DataFrame({w:s.apply(lambda x: distance(x, w)) for w in c})
"""
test_code2 = """
@np.vectorize
def lavdist(a, b):
return distance(c[a], s[b])
df = pd.DataFrame(np.fromfunction(lavdist, (len(c), len(s)), dtype = int),
columns=c, index=s)
"""
test_code3 = """
word_combinations = np.array(list(itertools.product(s.values, c)))
vectorized_levenshtein = np.vectorize(distance)
result = vectorized_levenshtein(word_combinations[:,0], word_combinations[:,1])
df = pd.DataFrame([word_combinations[:,1], word_combinations[:,1], result])
df = df.set_index([0,1])[2] #.unstack() produces error
"""
test_code4 = """
word_combinations = np.array(list(itertools.product(s.values, c)))
vectorized_levenshtein = np.vectorize(distance)
result = vectorized_levenshtein(word_combinations[:,0], word_combinations[:,1])
result = result.reshape((len(s), len(c)))
df = pd.DataFrame(result, columns=c, index=s)
"""
test_setup = "from __main__ import distance, s, c, pd, np, itertools"
print("test0", timeit.timeit(test_code0, number = 1000, setup = test_setup))
print("test1", timeit.timeit(test_code1, number = 1000, setup = test_setup))
print("test2", timeit.timeit(test_code2, number = 1000, setup = test_setup))
print("test3", timeit.timeit(test_code3, number = 1000, setup = test_setup))
print("test4", timeit.timeit(test_code4, number = 1000, setup = test_setup))
结果
# results
# test0 1.3671939949999796
# test1 0.5982696900009614
# test2 0.3246431229999871
# test3 2.0100400850005826
# test4 0.23796007100099814
推荐阅读
- java - 如何在不使方法非静态的情况下增加 numDigits?
- excel - https://scra.dmdc.osd.mil/scra/#/home 上的多条记录请求
- java - 在字符数组中查找和拼凑分散的单词
- python - 在Django中,获取各种条件的多个Counts
- r - 使用 R 对大型数据框中的列进行重新排序的便捷方法
- ios - iOS 13 Safari 链接预览如何工作?
- asp.net-core-mvc - 如何在单击时将按钮值从“buttonX”更改为“clicked”?
- python - 为什么我得到变量名作为输出而不是它的值?
- python - relu 函数神经网络输出 0 或 1
- scala - Apache Spark 性能调优