Skip to content Skip to sidebar Skip to footer

Faster Way To Make Pandas Multiindex Dataframe Than Append

I am looking for a faster way to load data from my json object into a multiindex dataframe. My JSON is like: { '1990-1991': { 'Cleveland': {

Solution 1:

You can adapt the answer to a very similar question as follow:

z = json.loads(json_data)

out = pd.Series({
    (i,j,m): z[i][j][k][m]
    for i in z
    for j in z[i]
    for k in ['players']
    for m in z[i][j][k]
}).to_frame('salary').rename_axis('year team player'.split())

# out:

                                           salary
year      team      player                       
1990-1991 Cleveland Hot Rod Williams   $3,785,000
                    Danny Ferry        $2,640,000
                    Mark Price         $1,400,000
                    Brad Daugherty     $1,320,000
                    Larry Nance        $1,260,000
                    Chucky Brown         $630,000
                    Steve Kerr           $548,000
                    Derrick Chievous     $525,000
                    Winston Bennett      $525,000
                    John Morton          $350,000
                    Milos Babic          $200,000
                    Gerald Paddio        $120,000
                    Darnell Valentine    $100,000
                    Henry James           $75,000

Also, if you intend to do some numerical analysis with those salaries, you probably want them as numbers, not strings. If so, also consider:

out['salary'] = pd.to_numeric(out['salary'].str.replace(r'\D', ''))

PS: Explanation:

The for lines are just one big comprehension to flatten your nested dict. To understand how it works, try first:

[
    (i,j)
    foriin z
    forjin z[i]
]

The 3rd for would be to list all keys of z[i][j], which would be: ['salary', 'players', 'url'], but we are only interested in 'players', so we say so.

The final bit is, instead of a list, we want a dict. Try the expression without surrounding with pd.Series() and you'll see exactly what's going on.

Solution 2:

We can use the for loop to create the dataframe and append, before finally concatenating: Delaying the concatenation till the end is much better than appending dataframes within the loop

box = []
# data refers to the shared json in the questionfor year, value in data.items():
    for team, players in value.items():
        content = players["players"]
        content = pd.DataFrame.from_dict(
            content, orient="index", columns=["salary"]
        ).rename_axis(index="player")
        content = content.assign(year=year, team=team)
        box.append(content)

box

[                       salary       year       team
 player                                             
 Hot Rod Williams   $3,785,0001990-1991  Cleveland
 Danny Ferry        $2,640,0001990-1991  Cleveland
 Mark Price         $1,400,0001990-1991  Cleveland
 Brad Daugherty     $1,320,0001990-1991  Cleveland
 Larry Nance        $1,260,0001990-1991  Cleveland
 Chucky Brown         $630,0001990-1991  Cleveland
 Steve Kerr           $548,0001990-1991  Cleveland
 Derrick Chievous     $525,0001990-1991  Cleveland
 Winston Bennett      $525,0001990-1991  Cleveland
 John Morton          $350,0001990-1991  Cleveland
 Milos Babic          $200,0001990-1991  Cleveland
 Gerald Paddio        $120,0001990-1991  Cleveland
 Darnell Valentine    $100,0001990-1991  Cleveland
 Henry James           $75,0001990-1991  Cleveland]

Concatenate and reorder index levels:

(
    pd.concat(box)
    .set_index(["year", "team"], append=True)
    .reorder_levels(["year", "team", "player"])
)

Post a Comment for "Faster Way To Make Pandas Multiindex Dataframe Than Append"