Skip to content Skip to sidebar Skip to footer

Split Csv File Thousands Of Times Based On Groupby

(An adaptation of David Erickson's question here) Given a CSV file with columns A, B, and C and some values: echo 'a,b,c' > file.csv head -c 10000000 /dev/urandom | od -d | awk

Solution 1:

In python:

import pandas as pd

df = pd.read_csv("file.csv")
for (a, b), gb in df.groupby(['a', 'b']):
    gb.to_csv(f"{a}_Invoice_{b}.csv", header=True, index=False)

In awk you can split like so, you will need to put the header back on each resultant file:

awk -F',' '{ out=$1"_Invoice_"$2".csv"; print >> out; close(out) }' file.csv

With adding the header line back:

awk -F',' 'NR==1 { hdr=$0; next } { out=$1"_Invoice_"$2".csv"; if (!seen[out]++) {print hdr > out} print >> out; close(out); }' file.csv

The benefit of this last example is that the input file.csv doesn't need to be sorted and is processed in a single pass.


Solution 2:

Since your input is to be sorted on the key fields all you need is:

sort -t ',' -k1,1n -k2,2n file.csv |
awk -F ',' '
NR==1 { hdr=$0; next }
{ out = $1 "_Invoice_" $2 ".csv" }
out != prev {
    close(prev)
    print hdr > out
    prev = out
}
{ print > out }
'

Post a Comment for "Split Csv File Thousands Of Times Based On Groupby"