首页 > 解决方案 > 为 Pandas Dataframe Columns 中两个列表中的每个元素运行一个函数

问题描述

东风

col1
['aa', 'bb', 'cc', 'dd']
['this', 'is', 'a', 'list', '2']
['this', 'list', '3']

col2
[['ee', 'ff', 'gg', 'hh'], ['qq', 'ww', 'ee', 'rr']]
[['list', 'a', 'not', '1'], ['not', 'is', 'this', '2']]
[['this', 'is', 'list', 'not'], ['a', 'not', 'list', '2']]

我正在尝试做的事情:

我正在尝试在 dfcol1中的每个子列表中的每个对应元素上对df 中的每个元素(单词)运行下面的代码col2,并将分数放在一个新列中。

因此,对于 中的第一行col1,在此运行get_top_matches函数:

`col1` "aa" and `col2` "ee" and "qq"
`col1` "bb" and `col2` "ff" and "ww"
`col1` "cc" and `col2` "gg" and "ee"
`col1` "dd" and `col2` "hh" and "rr"

新列应该是什么样子:

我不确定第 2 行和第 3 行的分数应该是多少

score_col
[1.0, 1.0, 1.0, 1.0]
[.34, .33, .27, .24, .23] #not sure
[.23, .13, .26] #not sure

我之前尝试过的:

我已经完成了 when col1is just a string 针对 中的每个列表元素col2,就像这样,但我一点也不知道如何针对列表元素运行它到相应的子列表元素:

df.agg(lambda x: get_top_matches(*x), axis=1)

. . . .

功能代码

这是get_top_matches功能 - 只需运行整个事情;我只为这个问题调用最后一个函数:

#jaro version
def sort_token_alphabetically(word):
    token = re.split('[,. ]', word)
    sorted_token = sorted(token)
    return ' '.join(sorted_token)

def get_jaro_distance(first, second, winkler=True, winkler_ajustment=True,
                      scaling=0.1, sort_tokens=True):
    """
    :param first: word to calculate distance for
    :param second: word to calculate distance with
    :param winkler: same as winkler_ajustment
    :param winkler_ajustment: add an adjustment factor to the Jaro of the distance
    :param scaling: scaling factor for the Winkler adjustment
    :return: Jaro distance adjusted (or not)
    """
    if sort_tokens:
        first = sort_token_alphabetically(first)
        second = sort_token_alphabetically(second)

    if not first or not second:
        raise JaroDistanceException(
            "Cannot calculate distance from NoneType ({0}, {1})".format(
                first.__class__.__name__,
                second.__class__.__name__))

    jaro = _score(first, second)
    cl = min(len(_get_prefix(first, second)), 4)

    if all([winkler, winkler_ajustment]):  # 0.1 as scaling factor
        return round((jaro + (scaling * cl * (1.0 - jaro))) * 100.0) / 100.0

    return jaro

def _score(first, second):
    shorter, longer = first.lower(), second.lower()

    if len(first) > len(second):
        longer, shorter = shorter, longer

    m1 = _get_matching_characters(shorter, longer)
    m2 = _get_matching_characters(longer, shorter)

    if len(m1) == 0 or len(m2) == 0:
        return 0.0

    return (float(len(m1)) / len(shorter) +
            float(len(m2)) / len(longer) +
            float(len(m1) - _transpositions(m1, m2)) / len(m1)) / 3.0

def _get_diff_index(first, second):
    if first == second:
        pass

    if not first or not second:
        return 0

    max_len = min(len(first), len(second))
    for i in range(0, max_len):
        if not first[i] == second[i]:
            return i

    return max_len

def _get_prefix(first, second):
    if not first or not second:
        return ""

    index = _get_diff_index(first, second)
    if index == -1:
        return first

    elif index == 0:
        return ""

    else:
        return first[0:index]

def _get_matching_characters(first, second):
    common = []
    limit = math.floor(min(len(first), len(second)) / 2)

    for i, l in enumerate(first):
        left, right = int(max(0, i - limit)), int(
            min(i + limit + 1, len(second)))
        if l in second[left:right]:
            common.append(l)
            second = second[0:second.index(l)] + '*' + second[
                                                       second.index(l) + 1:]

    return ''.join(common)

def _transpositions(first, second):
    return math.floor(
        len([(f, s) for f, s in zip(first, second) if not f == s]) / 2.0)

def get_top_matches(reference, value_list, max_results=None):
    scores = []
    if not max_results:
        max_results = len(value_list)
    for val in value_list:
        score_sorted = get_jaro_distance(reference, val)
        score_unsorted = get_jaro_distance(reference, val, sort_tokens=False)
        scores.append((val, max(score_sorted, score_unsorted)))
    scores.sort(key=lambda x: x[1], reverse=True)

    return scores[:max_results]

class JaroDistanceException(Exception):
    def __init__(self, message):
        super(Exception, self).__init__(message)

. . .


尝试 1 只是想让它与列表中的每个单词而不是每个字母进行比较:

[[[df1.agg(lambda x: get_top_matches(u,w), axis=1) for u,w in zip(x,v)]\ for v in y] for x,y in zip(df1['parent_org_name_list'], df1['children_org_name_sublists'])]

结果

尝试 2将函数 更改get_top_matches为 sayfor val in value_list.split():导致以下结果 - 它抓取第一个单词并将其与每个子列表中的第一个单词进行col25 次比较(不确定为什么 5 次):

[
  [0    [(myalyk, 0.73)]1    [(myalyk, 0.73)]2    [(myalyk, 0.73)]3    [(myalyk, 0.73)]4    [(myalyk, 0.73)]dtype: object]
, [0    [(myliu, 0.79)]1    [(myliu, 0.79)]2    [(myliu, 0.79)]3    [(myliu, 0.79)]4    [(myliu, 0.79)]dtype: object]
, [0    [(myllc, 0.97)]1    [(myllc, 0.97)]2    [(myllc, 0.97)]3    [(myllc, 0.97)]4    [(myllc, 0.97)]dtype: object]
, [0    [(myloc, 0.88)]1    [(myloc, 0.88)]2    [(myloc, 0.88)]3    [(myloc, 0.88)]4    [(myloc, 0.88)]dtype: object]
]

只需要在子列表中的每个单词上运行该函数。

尝试 3 从函数中删除第二次尝试代码get_top_matches并将尝试一次列表理解代码修改为下面,在 中的前 3 个子列表中获取第一个单词col2;需要将列表与子col1列表中的每个单词进行比较col2

[[df.agg(lambda x: get_top_matches(u,v), axis=1) for u in x ]
    for v in zip(*y)]
        for x,y in zip(df['col1'], df['col2'])
]

结果尝试 3

[[0    [(myllc, 0.97), (myloc, 0.88), (myliu, 0.79), 
...1    [(myllc, 0.97), (myloc, 0.88), (myliu, 0.79), 
...2    [(myllc, 0.97), (myloc, 0.88), (myliu, 0.79), 
...3    [(myllc, 0.97), (myloc, 0.88), (myliu, 0.79), 
...4    [(myllc, 0.97), (myloc, 0.88), (myliu, 0.79), 
...dtype: object]]

期望 (本例:第 1 行有 4 个子列表,第 2 行有 2 个子列表。该函数针对第 2 列中每个子列表中的每个单词对第 1 列中的每个单词运行,并将结果放入新列的子列表中。)

[[['myalyk',.97], ['oleksandr',.54], ['nychyporovych',.3], ['pp',0]], [['myliu',.88], ['srl',.43]], [['myllc',1.0]], [['myloc',1.0], ['manag',.45], ['IT',.1], ['ag',0]]], 
[[['ltd',.34], ['yuriapharm',.76]], [['yuriypra',.65], ['law',.54], ['offic',.45], ['pc',.34]]],
...

标签: pythonpandas

解决方案


这有效:

# Generate DataFrame
df = pd.DataFrame (data, columns = ['col1','col2'])

# Clean Data (strip out trailing commas on some words)
df['col1'] = df['col1'].map(lambda lst: [x.rstrip(',') for x in lst])

# 1. List comprehension Technique
# zip provides pairs of col1, col2 rows
result = [[get_top_matches(u, [v]) for u in x for w in y for v in w] for x, y in zip(df['col1'], df['col2'])]

# 2. DataFrame Apply Technique
def func(x, y):
return [get_top_matches(u, [v]) for u in x for w in y for v in w] 

df['func_scores'] = df.apply(lambda row: func(row['col1'], row['col2']), axis = 1)

# Verify two methods are equal
print(df['func_scores'].equals(pd.Series(result)))  # True

print(df['func_scores'].to_string(index=False))

感谢所有帮助过的人


推荐阅读