首页 > 解决方案 > 如何在没有函数的情况下将 executor.map 应用于 for 循环?

问题描述

我有一个 xml 列表和一个将 xml 扁平化为 pandas 数据框的 for 循环。

for 循环工作得非常好,但需要很长时间才能使 xml 变平,随着时间的推移它会变得越来越大。

如何包装下面的 for 循环executor.map以在不同内核之间分散工作负载?我正在关注这篇文章https://medium.com/@ageitgey/quick-tip-speed-up-your-python-data-processing-scripts-with-process-pools-cf275350163a

for循环展平xml:

df1 = pd.DataFrame()
for i in lst:
    print('i am working')
    soup = BeautifulSoup(i, "xml")
    # Get Attributes from all nodes
    attrs = []
    for elm in soup():  # soup() is equivalent to soup.find_all()
        attrs.append(elm.attrs)

    # Since you want the data in a dataframe, it makes sense for each field to be a new row consisting of all the other node attributes
    fields_attribute_list= [x for x in attrs if 'Id' in x.keys()]
    other_attribute_list = [x for x in attrs if 'Id' not in x.keys() and x != {}]

    # Make a single dictionary with the attributes of all nodes except for the `Field` nodes.
    attribute_dict = {}
    for d in other_attribute_list:
        for k, v in d.items():  
            attribute_dict.setdefault(k, v)

    # Update each field row with attributes from all other nodes.
    full_list = []
    for field in fields_attribute_list:
        field.update(attribute_dict)
        full_list.append(field)

    # Make Dataframe
    df = pd.DataFrame(full_list)
    df1 = df1.append(df)

for循环是否需要转换为函数?

标签: pythonpython-3.xconcurrent.futures

解决方案


是的,您确实需要将循环转换为函数。该函数必须能够只接受一个参数。那一个参数可以是任何东西,例如列表、元组、字典或其他任何东西。将具有多个参数的函数放入concurrent.futures.*Executor方法中有点复杂。

下面的这个例子应该适合你。

from bs4 import BeautifulSoup
import pandas as pd
from concurrent import futures


def create_dataframe(xml):
    soup = BeautifulSoup(xml, "xml")
    # Get Attributes from all nodes
    attrs = []
    for elm in soup():  # soup() is equivalent to soup.find_all()
        attrs.append(elm.attrs)

    # Since you want the data in a dataframe, it makes sense for each field to be a new row consisting of all the other node attributes
    fields_attribute_list = [x for x in attrs if 'FieldId' in x.keys()]
    other_attribute_list = [x for x in attrs if 'FieldId' not in x.keys() and x != {}]

    # Make a single dictionary with the attributes of all nodes except for the `Field` nodes.
    attribute_dict = {}
    for d in other_attribute_list:
        for k, v in d.items():
            attribute_dict.setdefault(k, v)

    # Update each field row with attributes from all other nodes.
    full_list = []
    for field in fields_attribute_list:
        field.update(attribute_dict)
        full_list.append(field)
    print(len(full_list))
    # Make Dataframe
    df = pd.DataFrame(full_list)
    # print(df)
    return df


with futures.ThreadPoolExecutor() as executor:  # Or use ProcessPoolExecutor
    df_list = executor.map(create_dataframe, lst)

df_list = list(df_list)
full_df = pd.concat(list(df_list))
print(full_df)

推荐阅读