首页 > 解决方案 > 为什么 Map 工作但 Apply 引发 ValueError

问题描述

我试图让使用 Pandas 的各种方式更加舒适,并且我很难理解为什么 Map、Apply 和 Vectorization 与返回非布尔值的函数相对可互换,但 Apply 和 Vectorization 有时会在函数被调用时失败应用返回一个布尔值。这个问题将集中在应用上。

具体来说,我编写了非常简单的小代码来说明挑战:

import numpy as np
import pandas as pd

# make dataframe
x = range(1000)
df = pd.DataFrame(data = x, columns = ['Number']) 

# simple function to test if a number is a prime number
def is_prime(num):
    if num < 2:
        return False
    elif num == 2: 
        return True
    else: 
        for i in range(2,num):
            if num % i == 0:
                return False
    return True

# test if every number in the dataframe is prime using Map
df['map prime'] = list(map(is_prime, df['Number']))
df.head()

以下给出了我期望的输出: 在此处输入图像描述

所以这就是我不再理解发生了什么的地方:当我尝试使用 apply 时,我得到一个 ValueError。

in: df['apply prime'] = df.apply(func = is_prime, args = df['Number'], axis=1)
out: ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

我错过了什么?

谢谢!

ps 我知道有更有效的方法来测试素数。我故意写了一个低效的函数,这样我就可以测试应用和矢量化实际上比 map 快多少,但后来我遇到了这个挑战。谢谢你。

标签: pythonpandasvectorizationapply

解决方案


So here's where I no longer understand what's going on: when I try to use apply, I get a ValueError.

df.apply(..., axis=1), pass pd.Series(...).

i.e. df['apply prime'] = df['Number'].apply(func = is_prime) should work.

Given that apply is ostensibly faster than map, and vectorization faster still.

In addition pd.DataFrame.apply(...), doesn't use any type of vectorization, just a simple C for loop (ex. cython), so believe that map(...) should be asymptotically faster.


Update

You might need to figure that, .apply(...), method passes the values of a given axis=x to the function and returns Y, which could be any data type, In case of pd.DataFrame (multiple keys).

Suppose that df.shape = (1000, 4), if we are intend to move along axis=1, i.e. df.shape[1], it's means your apply function going to be called 1000 times, each run it's got (4, ) element of a type pd.Series, you could use there keys inside the function itself, or just pass the keys as an arguments, pd.DataFrame.apply(..., args=[...]).


import numpy as np
import pandas as pd

x = np.random.randn(1000, 4)
df = pd.DataFrame(data=x, columns=['a', 'b', 'c', 'd'])

print(df.shape)

df.head()

def func(x, key1, key2):

  # print(x.shape)

  if x[key1] > x[key2]:
    
    return True

  return False

df.apply(func, axis=1, args=['a', 'b'])

推荐阅读