首页 > 解决方案 > Splitting list of nested json to multiple columns

问题描述

This is sort of an extension on a previous question I asked, but different scope and approach.

I have a dataframe with a column populated by lists of dictionaries in each row

0    [{"date":"0 1 0" firstBoxerRating:[null null] ...
1    [{"date":"2 2 1" firstBoxerRating:[null null] ...
2    [{"date":"2013-10-05" firstBoxerRating:[null n...

This is short sample of some of the info In a given row:

[{"date":"2 2 1" firstBoxerRating:[null null] firstBoxerWeight:201.75 judges:[{"id":404749 name:"David Hudson" scorecard:[]} {"id":477070 name:"Mark Philips" scorecard:[]} {"id":404277 name:"Oren Shellenberger" scorecard:[]}] links:{"bio":1346666 bout:"558867/1346666" event:558867 other:[]} location:"Vanderbilt University Memorial Gymnasium Nashville" metadata:" time: 2:54\n | <span>referee:</span> <a href=\"/en/referee/403887\">Anthony Bryant</a><span> | </span><a href=\"/en/judge/404749\">David Hudson</a> | <a href=\"/en/judge/477070\">Mark Philips</a> 

I would like to create a clean dataframe where the key in the dictionary becomes the column and the value, the row related to the particular column.

So here is an example of my desired output using the short sample as the input data:

date   firstBoxerRating  firstBoxerWeight judges  id.......
2 2 1    [null null]          201.75              404749.....

I do not believe the question is a duplicate of this

Have tried every solution in this question, my data also contains lists of nested dictionaries, if anything resembling a json

For example, this solution:

pd.DataFrame.from_dict({(i,j): df[i][j] 
                           for i in df.keys() 
                           for j in df[i].keys()},
                       orient='index')

produces the exact same output I have

I have also tried unpacking the dicts in the column:

df[0].apply(pd.Series)

However, again this produces the same output

标签: pythonjsonpandas

解决方案


使用 regex 和 str.extract 设法解决了这个问题。

我提取两个字符串之间的文本并将所述文本附加到其相关列

例子:

df[0].str.extract('date(?P<date>.*?)firstBoxerRating(?P<firstBoxerRating>.*?)firstBoxerWeight(?P<firstBoxerWeight>.*?)judges(?P<JudgeID>.*?)links(?P<Links>.*?)location(?P<location>.*?)metadata(?P<metadata>.*?)')

推荐阅读