Faster Way To Make Pandas Multiindex Dataframe Than Append
Solution 1:
You can adapt the answer to a very similar question as follow:
z = json.loads(json_data)
out = pd.Series({
(i,j,m): z[i][j][k][m]
for i in z
for j in z[i]
for k in ['players']
for m in z[i][j][k]
}).to_frame('salary').rename_axis('year team player'.split())
# out:
salary
year team player
1990-1991 Cleveland Hot Rod Williams $3,785,000
Danny Ferry $2,640,000
Mark Price $1,400,000
Brad Daugherty $1,320,000
Larry Nance $1,260,000
Chucky Brown $630,000
Steve Kerr $548,000
Derrick Chievous $525,000
Winston Bennett $525,000
John Morton $350,000
Milos Babic $200,000
Gerald Paddio $120,000
Darnell Valentine $100,000
Henry James $75,000
Also, if you intend to do some numerical analysis with those salaries, you probably want them as numbers, not strings. If so, also consider:
out['salary'] = pd.to_numeric(out['salary'].str.replace(r'\D', ''))
PS: Explanation:
The for
lines are just one big comprehension to flatten your nested dict
. To understand how it works, try first:
[
(i,j)
foriin z
forjin z[i]
]
The 3rd for
would be to list all keys of z[i][j]
, which would be: ['salary', 'players', 'url']
, but we are only interested in 'players'
, so we say so.
The final bit is, instead of a list
, we want a dict
. Try the expression without surrounding with pd.Series()
and you'll see exactly what's going on.
Solution 2:
We can use the for loop to create the dataframe and append, before finally concatenating: Delaying the concatenation till the end is much better than appending dataframes within the loop
box = []
# data refers to the shared json in the questionfor year, value in data.items():
for team, players in value.items():
content = players["players"]
content = pd.DataFrame.from_dict(
content, orient="index", columns=["salary"]
).rename_axis(index="player")
content = content.assign(year=year, team=team)
box.append(content)
box
[ salary year team
player
Hot Rod Williams $3,785,0001990-1991 Cleveland
Danny Ferry $2,640,0001990-1991 Cleveland
Mark Price $1,400,0001990-1991 Cleveland
Brad Daugherty $1,320,0001990-1991 Cleveland
Larry Nance $1,260,0001990-1991 Cleveland
Chucky Brown $630,0001990-1991 Cleveland
Steve Kerr $548,0001990-1991 Cleveland
Derrick Chievous $525,0001990-1991 Cleveland
Winston Bennett $525,0001990-1991 Cleveland
John Morton $350,0001990-1991 Cleveland
Milos Babic $200,0001990-1991 Cleveland
Gerald Paddio $120,0001990-1991 Cleveland
Darnell Valentine $100,0001990-1991 Cleveland
Henry James $75,0001990-1991 Cleveland]
Concatenate and reorder index levels:
(
pd.concat(box)
.set_index(["year", "team"], append=True)
.reorder_levels(["year", "team", "player"])
)
Post a Comment for "Faster Way To Make Pandas Multiindex Dataframe Than Append"