首页 > 解决方案 > 'str' object has no attribute 'values' - Object Does Not Appear to be String

问题描述

I am attempting to multiprocess a pandas read_sq() import with chunking. The end goal is to find the distance between two lats/lons. Since I am working in a Jupyter Notebook, the functions for multiprocessing need to be in a separate file. That file looks like this:

import pandas as pd
from sqlalchemy import event, create_engine
from math import radians, cos, sin, asin, sqrt
import numpy as np

engine = create_engine('engine-path')

data = pd.read_sql("SELECT * from SCHEMA.TABLE", engine)  

def cartesian_product_simplified(left, right):
    la, lb = len(left), len(right)
    ia2, ib2 = np.broadcast_arrays(*np.ogrid[:la,:lb])
    return pd.DataFrame(np.column_stack([left.values[ia2.ravel()], right.values[ib2.ravel()]]))

def haversine_np(lon1, lat1, lon2, lat2):
    lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])
    dlon = lon2 - lon1
    dlat = lat2 - lat1
    a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2
    c = 2 * np.arcsin(np.sqrt(a))
    m = 3956.269 * c
    return m

def getDistance(chunk):
    df = cartesian_product_simplified(chunk, data)
    df = df.rename(columns={1:'lat1',2:'lon1',6:'lat2',7:'lon2'})
    df = df.astype({"lat1": float,"lon1": float,"lat2": float,"lon2": float})
    m = haversine_np(df['lon1'],df['lat1'],df['lon2'],df['lat2'])
    dist = pd.DataFrame(m.values)
    result = df.join(dist)
    result = result.rename(columns={0:'dist'})
    result = result[result['dist']<=3]
    return result

The main notebook looks like this:

import pandas as pd
from dist_func import getDistance

from multiprocessing import Pool

if __name__ == '__main__':
    global result
    p = Pool(20)
    for chunk in pd.read_sql("select top 10 * from SCHEMA.SecondTable", engine, chunksize=1):
        result = p.map(getDistance, chunk)
    p.terminate()
    p.join()

This results in this traceback:

Traceback (most recent call last):
  File "C:\Users\filepath\anaconda\lib\multiprocessing\pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "C:\Users\filepath\anaconda\lib\multiprocessing\pool.py", line 44, in mapstar
    return list(map(*args))
  File "C:\Users\filepath\dist_func.py", line 30, in getDistance
    df = cartesian_product_simplified(chunk, vendor_geo)
  File "C:\Users\filepath\dist_func.py", line 18, in cartesian_product_simplified
    return pd.DataFrame(np.column_stack([left.values[ia2.ravel()], right.values[ib2.ravel()]]))
AttributeError: 'str' object has no attribute 'values'

This is pointing to the cartesian_product_simplified function that feeds into the getDistance function. However, when I remove multiprocessing and simply chunk through the read_sql() query like this...

for chunk in pd.read_sql("select top 100 * from SCHEMA.SecondTable", engine, chunksize=10):
    df = cartesian_product_simplified(chunk, data)
    df = df.astype({"lat1": float,"lon1": float,"lat2": float,"lon2": float})
    df = df.astype({"lat1": float,"lon1": float,"lat2": float,"lon2": float})
    m = haversine_np(df['lon1'],df['lat1'],df['lon2'],df['lat2'])
    dist = pd.DataFrame(m.values)
    result = df.join(dist)
    result = result.rename(columns={0:'dist'})
    result = result[result['dist']<=3]
    df_list.append(result)

...no such error is thrown. This is with using the exact same functions. Why is this error occurring when it seems like the function is being fed two DataFrames, and it works without multiprocessing involved?

标签: pythonpython-3.xpandasmultiprocessingchunking

解决方案


我不知道根本原因,但是在我自己的数据集上选择较少数量的分区为我解决了同样的问题。因此,错误可能直接或间接与您选择的分区数量或数据集中分区与行的比率有关。对于较大的数据集,我没有这个问题。


推荐阅读