首页 > 解决方案 > IndexError multiprocessing.Pool

问题描述

我使用多处理来并行处理熊猫数据帧的一部分时得到一个 IndexError。vacancies是一个包含多个空位的 pandas DataFrame,其中一列是原始文本。

def addSkillRelevance(vacancies):
    skills = pickle.load(open("skills.pkl", "rb"))

    vacancies['skill'] = ''
    vacancies['skillcount'] = 0
    vacancies['all_skills_in_vacancy'] = ''
    new_vacancies = pd.DataFrame(columns=vacancies.columns)

    for vacancy_index, vacancy_row in vacancies.iterrows():

        #Create a df for which each row is a found skill (with the other attributes of the vacancy)
        per_vacancy_df = pd.DataFrame(columns=vacancies.columns)
        all_skills_in_vacancy = []
        skillcount = 0

        for skill_index, skill_row in skills.iterrows():

            #Making the search for the skill in the text body a bit smarter
            spaceafter = ' ' + skill_row['txn_skill_name'] + ' '
            newlineafter = ' ' + skill_row['txn_skill_name'] + '\n'
            tabafter = ' ' + skill_row['txn_skill_name'] + '\t'

            #Statement that returns true if we find a variation of the skill in the text body
            if((spaceafter in vacancies.at[vacancy_index,'body']) or (newlineafter in vacancies.at[vacancy_index,'body']) or (tabafter in vacancies.at[vacancy_index,'body'])):
                #Adding the skill to the list of skills found in the vacancy
                all_skills_in_vacancy.append(skill_row['txn_skill_name'])

                #Increasing the skillcount
                skillcount += 1

                #Adding the skill to the row
                vacancies.at[vacancy_index,'skill'] = skill_row['txn_skill_name']

                #Add a row to the vacancy df where 1 row, means 1 skill
                per_vacancy_df = per_vacancy_df.append(vacancies.iloc[vacancy_index])

        #Adding the list of all found skills in the vacancy to each (skill) row
        per_vacancy_df['all_skills_in_vacancy'] = str(all_skills_in_vacancy)
        per_vacancy_df['skillcount'] = skillcount

        #Adds the individual vacancy df to a new vacancy df
        new_vacancies = new_vacancies.append(per_vacancy_df)  
    return(new_vacancies)

def executeSkillScript(vacancies):
        from multiprocessing import Pool

        vacancies = vacancies.head(100298)

        num_workers = 47
        pool = Pool(num_workers)

        vacancy_splits = np.array_split(vacancies, num_workers)
        results_list = pool.map(addSkillRelevance,vacancy_splits)
        new_vacancies = pd.concat(results_list, axis=0)

        pool.close()
        pool.join()

executeSkillScript(vacancies)

该函数addSkillRelevance()接收一个 pandas DataFrame 并输出一个 pandas DataFrame(具有更多列)。出于某种原因,在完成所有多处理后,我在results_list = pool.map(addSkillRelevance,vacancy_splits). 我很困惑,因为我不知道如何处理错误。有没有人有关于为什么会发生 IndexError 的提示?

错误:

    IndexError                                Traceback (most recent call last)
<ipython-input-11-7cb04a51c051> in <module>()
----> 1 executeSkillScript(vacancies)

<ipython-input-9-5195d46f223f> in executeSkillScript(vacancies)
     14 
     15     vacancy_splits = np.array_split(vacancies, num_workers)
---> 16     results_list = pool.map(addSkillRelevance,vacancy_splits)
     17     new_vacancies = pd.concat(results_list, axis=0)
     18 

~/anaconda3/envs/amazonei_tensorflow_p36/lib/python3.6/multiprocessing/pool.py in map(self, func, iterable, chunksize)
    264         in a list that is returned.
    265         '''
--> 266         return self._map_async(func, iterable, mapstar, chunksize).get()
    267 
    268     def starmap(self, func, iterable, chunksize=None):

~/anaconda3/envs/amazonei_tensorflow_p36/lib/python3.6/multiprocessing/pool.py in get(self, timeout)
    642             return self._value
    643         else:
--> 644             raise self._value
    645 
    646     def _set(self, i, obj):

IndexError: single positional indexer is out-of-bounds

根据建议

标签: python-3.xpandasmultiprocessing

解决方案


错误来自这一行:

per_vacancy_df = per_vacancy_df.append(vacancies.iloc[vacancy_index])

发生错误是因为vacancy_index不在数据帧的索引中vacancies


推荐阅读