python - 'str' object has no attribute 'values' - Object Does Not Appear to be String
问题描述
I am attempting to multiprocess a pandas read_sq()
import with chunking. The end goal is to find the distance between two lats/lons. Since I am working in a Jupyter Notebook, the functions for multiprocessing
need to be in a separate file. That file looks like this:
import pandas as pd
from sqlalchemy import event, create_engine
from math import radians, cos, sin, asin, sqrt
import numpy as np
engine = create_engine('engine-path')
data = pd.read_sql("SELECT * from SCHEMA.TABLE", engine)
def cartesian_product_simplified(left, right):
la, lb = len(left), len(right)
ia2, ib2 = np.broadcast_arrays(*np.ogrid[:la,:lb])
return pd.DataFrame(np.column_stack([left.values[ia2.ravel()], right.values[ib2.ravel()]]))
def haversine_np(lon1, lat1, lon2, lat2):
lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])
dlon = lon2 - lon1
dlat = lat2 - lat1
a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2
c = 2 * np.arcsin(np.sqrt(a))
m = 3956.269 * c
return m
def getDistance(chunk):
df = cartesian_product_simplified(chunk, data)
df = df.rename(columns={1:'lat1',2:'lon1',6:'lat2',7:'lon2'})
df = df.astype({"lat1": float,"lon1": float,"lat2": float,"lon2": float})
m = haversine_np(df['lon1'],df['lat1'],df['lon2'],df['lat2'])
dist = pd.DataFrame(m.values)
result = df.join(dist)
result = result.rename(columns={0:'dist'})
result = result[result['dist']<=3]
return result
The main notebook looks like this:
import pandas as pd
from dist_func import getDistance
from multiprocessing import Pool
if __name__ == '__main__':
global result
p = Pool(20)
for chunk in pd.read_sql("select top 10 * from SCHEMA.SecondTable", engine, chunksize=1):
result = p.map(getDistance, chunk)
p.terminate()
p.join()
This results in this traceback:
Traceback (most recent call last):
File "C:\Users\filepath\anaconda\lib\multiprocessing\pool.py", line 119, in worker
result = (True, func(*args, **kwds))
File "C:\Users\filepath\anaconda\lib\multiprocessing\pool.py", line 44, in mapstar
return list(map(*args))
File "C:\Users\filepath\dist_func.py", line 30, in getDistance
df = cartesian_product_simplified(chunk, vendor_geo)
File "C:\Users\filepath\dist_func.py", line 18, in cartesian_product_simplified
return pd.DataFrame(np.column_stack([left.values[ia2.ravel()], right.values[ib2.ravel()]]))
AttributeError: 'str' object has no attribute 'values'
This is pointing to the cartesian_product_simplified
function that feeds into the getDistance
function. However, when I remove multiprocessing and simply chunk through the read_sql()
query like this...
for chunk in pd.read_sql("select top 100 * from SCHEMA.SecondTable", engine, chunksize=10):
df = cartesian_product_simplified(chunk, data)
df = df.astype({"lat1": float,"lon1": float,"lat2": float,"lon2": float})
df = df.astype({"lat1": float,"lon1": float,"lat2": float,"lon2": float})
m = haversine_np(df['lon1'],df['lat1'],df['lon2'],df['lat2'])
dist = pd.DataFrame(m.values)
result = df.join(dist)
result = result.rename(columns={0:'dist'})
result = result[result['dist']<=3]
df_list.append(result)
...no such error is thrown. This is with using the exact same functions. Why is this error occurring when it seems like the function is being fed two DataFrames, and it works without multiprocessing involved?
解决方案
我不知道根本原因,但是在我自己的数据集上选择较少数量的分区为我解决了同样的问题。因此,错误可能直接或间接与您选择的分区数量或数据集中分区与行的比率有关。对于较大的数据集,我没有这个问题。
推荐阅读
- spring - 从 spring 控制器传递 tyhmeleaf 片段表达式
- reactjs - 使用 react-navigation 显示不是反应组件
- soap - 当被调用的特性在空手道框架中发出肥皂请求时,无法访问被调用的特性响应值
- algorithm - 我应该使用哪种算法将尽可能多的图片放入广告牌?
- javascript - 单击时激活按钮切换类
- nlog - NLOG 不在服务器上记录未处理的异常
- jsf - 未设置 omnifaces.ListConverter 中的列表
- sql - 使用 SQL 脚本的条件 SQL 查询
- javascript - 哪种语法对 React 更典型,如果有的话(版本 16)?
- c# - 将二进制文件的前 X 个字节读入新文件