Python Pandas Summarize Round Trip In Dataframe

December 26, 2023 Post a Comment

I have a dataframe (~30 000 rows) count of trips by station code. |station from|station to|count| |:-----------|:---------|:----| |20001 |20040 |55 | |20040 |2000

Solution 1:

This works even in the face of duplicated entries, and is quite fast (<250ms per million rows):

defroundtrip(df):
    a, b, c, d = 'station from', 'station to', 'count', 'count_back'
    idx = df[a] > df[b]
    df = df.assign(**{d: 0})
    df.loc[idx, [a, b, c, d]] = df.loc[idx, [b, a, d, c]].values
    return df.groupby([a, b]).sum()

On your example data (and yes, you can .reset_index() if your prefer):

>>> roundtrip(df)
                         count  count_back
station from station to                   
20001200405455200072008010050

Timing test:

n = 1_000_000
df = pd.DataFrame({
    'station from': np.random.randint(1000, 2000, n),
    'station to': np.random.randint(1000, 2000, n),
    'count': np.random.randint(0, 200, n),
})

%timeit roundtrip(df)
217 ms ± 2.22 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

(On 100K rows, it is 32.4 ms ± 333 µs per loop)

Solution 2:

Let's try sort the stations and pivot:

# the two stations
cols = ['station from', 'station to']

# back and fortdf['col'] = np.where(df['station from'] < df['station to'], 'count', 'count_back')

# rearrange the stationsdf[cols] = np.sort(df[cols], axis=1)

# pivotprint(df.pivot(index=cols, columns='col', values='count')
   .reset_index()
)

Output:

col  station from  station to  count  count_back
0200012004055671200072008010050

Solution 3:

Here is a simple solution which handles the cases without round trip.

import pandas as pd
import numpy as np
df = pd.DataFrame({"station from":[20001,20040,20007,20080, 2, 3],
                   "station to":[20040,20001,20080,20007, 1, 4],
                   "count":[55,67,100,50, 20, 40]})
df

df = df.set_index(["station from", "station to"])
df["count_back"] = df.apply(lambda row: df["count"].get((row.name[::-1])), axis=1)
mask_rows_to_delete = df.apply(lambda row: row.name[0] > row.name[1] and row.name[::-1] in df.index, axis=1)
df = df[~mask_rows_to_delete].reset_index()
df

Learn Python Programming

Python Pandas Summarize Round Trip In Dataframe

Solution 1:

Solution 2:

Solution 3:

Post a Comment for "Python Pandas Summarize Round Trip In Dataframe"