首页 > 解决方案 > How to apply a masked array to a very large JSON fast

问题描述

The Data

I am currently working on very large JSON files formated as such

{key: [1000+ * arrays of length 241],
 key2: [1000+ * arrays of length 241],
 (...repeat 5-8 times...)}

The data is structured in a way that the nth element in each key's array belongs to the nth entity. Think about it as each key being a descriptor such as 'height' or 'pressure'. And therefore to get an entity's 'height' and 'pressure' you would access the entities index n in all the arrays. Therefore all the key's arrays are the same length Z

This, as you can imagine, is a pain to work with as a whole. Therefore, whenever I perform any data manipulation I return a masked array of length Z populated with 1's and 0's. 1 means the data in that index in every key is to be kept and 0 means it should be omitted)


The Problem

Once all of my data manipulation has been performed I need to apply the masked array to the data to return a copy of the original JSON data but where the length of each key's array Z is equal to the number of 1's in the masked array (If the element in the masked array at index n is a 0 then the element in index n will be removed from all of the json key's arrays and vice versa)


My attempt

# mask: masked array
# d: data to apply the mask to
 def apply_mask(mask, d):
    keys = d.keys()
    print(keys)
    rem = [] #List of index to remove
    for i in range(len(mask)):
        if mask[i] == 0:
            rem.append(i) #Populate 'rem'

        for k in keys:
            d[k] = [elem for elem in d[k] if not d[k].index(elem) in rem]

    return d

This works as intended but takes a while on such large JSON data


Question

I hope everything above was clear and helps you to understand my question:

Is there a more optimal/quicker way to apply a masked array to data such as this shown above?

Cheers

标签: pythonarraysjsonlistperformance

解决方案


This is going to be slow because

d[k] = [elem for elem in d[k] if not d[k].index(elem) in rem]

is completely recreating the inner list every time.

Since you're already modifying d in-place, you could just delete the respective elements:

def apply_mask(mask, d):
    for i, keep in enumerate(mask):
        if not keep:
            for key in d:
                del d[key][i - len(mask)]
    return d

(Negative indices i - len(mask) are being used because positive indices don't work anymore if the list has already changed its length due to previously removed elements.)


推荐阅读