Skip to content Skip to sidebar Skip to footer

Aggregations Over Specific Columns Of A Large Dataframe, With Named Output

I am looking for a way to aggregate over a large dataframe, possibly using groupby. Each group would be based on either pre-specified columns or regex, and the aggregation should p

Solution 1:

Not a groupby solution and it uses a loop but I think it's nontheless rather elegant: first get a list of unique column from - to combinations using a set and then do the sums using filter:

cols = sorted([(x[0],x[1]) for x in set([(x.split('.')[0], x.split('.')[-1]) for x in df.columns])])
for c0, c1 in cols:
    df[f'{c0}.SUM.{c1}'] = df.filter(regex = f'{c0}\.\d+\.{c1}').sum(axis=1)

Result:

            A.1.E  A.1.F  A.1.G  A.2.E  ...  B.SUM.G  C.SUM.E  C.SUM.F  C.SUM.G
2018-08-31    978    746    408    109  ...     4061     5413     4102     4908
2018-09-30    923    649    488    447  ...     5585     3634     3857     4228
2018-10-31    911    359    897    425  ...     5039     2961     5246     4126
2018-11-30     77    479    536    509  ...     4634     4325     2975     4249
2018-12-31    608    995    114    603  ...     5377     5277     4509     3499
2019-01-31    138    612    363    218  ...     4514     5088     4599     4835
2019-02-28    994    148    933    990  ...     3907     4310     3906     3552
2019-03-31    950    931    209    915  ...     4354     5877     4677     5557
2019-04-30    255    168    357    800  ...     5267     5200     3689     5001
2019-05-31    593    594    824    986  ...     4221     2108     4636     3606
2019-06-30    975    396    919    242  ...     3841     4787     4556     3141
2019-07-31    350    312    104    113  ...     4071     5073     4829     3717



If you want to have the result in a new DataFrame, just create an empty one and add the columns to it:

result = pd.DataFrame()
for c0, c1 in cols:
    result[f'{c0}.SUM.{c1}'] = df.filter(regex = f'{c0}\.\d+\.{c1}').sum(axis=1)

Update: using simple groupby (which is even more simple in this particular case):

def grouper(col):
    c = col.split('.')
    return f'{c[0]}.SUM.{c[-1]}'

df.groupby(grouper, axis=1).sum()

Post a Comment for "Aggregations Over Specific Columns Of A Large Dataframe, With Named Output"