Convert Multiple Array Of Structs Columns In Pyspark Sql
I have pyspark dataframe with multiple columns (Around 30) of nested structs, that I want to write into csv. (struct In order to do it, I want to stringify all of the struct colum
Solution 1:
As pault has already mentioned in the comments, you need a list comprehension. Such a list comprehension requires a list of columns and a functions which converts this columns to strings. I will use df.columns
and to_json but you can also provide your own python list of column names and a custom function to stringfy your complex columns.
#this converts all columns to json strings
#and writes it as to disk
s_df.select([F.to_json(x) for x in s_df.columns]).coalesce(1).write.csv('/tmp/testcsv')
In case you don't want to apply to_json to all columns, you can simply modify it like that:
list4tojson = ['2_complex_key', '3_complex_key']
s_df.select('1_complex_key', *[F.to_json(x) for x in list4tojson]).coalesce(1).write.csv('/tmp/testcsv')
You can restore dataframe with from_json:
df = spark.read.csv('/tmp/testcsv')
df.printSchema()
#root
# |-- _c0: string (nullable = true)
# |-- _c1: string (nullable = true)
# |-- _c2: string (nullable = true)
#interfering the schema
json_schema = spark.read.json(df.rdd.map(lambda row: row._c0)).schema
df.select([F.from_json(x, json_schema) for x in df.columns] ).printSchema()
#root
# |-- jsontostructs(_c0): struct (nullable = true)
# | |-- s: struct (nullable = true)
# | | |-- n1: boolean (nullable = true)
# | | |-- n2: boolean (nullable = true)
# | | |-- n3: boolean (nullable = true)
# | |-- type: string (nullable = true)
# | |-- x: long (nullable = true)
# | |-- y: long (nullable = true)
# |-- jsontostructs(_c1): struct (nullable = true)
# | |-- s: struct (nullable = true)
# | | |-- n1: boolean (nullable = true)
# | | |-- n2: boolean (nullable = true)
# | | |-- n3: boolean (nullable = true)
# | |-- type: string (nullable = true)
# | |-- x: long (nullable = true)
# | |-- y: long (nullable = true)
# |-- jsontostructs(_c2): struct (nullable = true)
# | |-- s: struct (nullable = true)
# | | |-- n1: boolean (nullable = true)
# | | |-- n2: boolean (nullable = true)
# | | |-- n3: boolean (nullable = true)
# | |-- type: string (nullable = true)
# | |-- x: long (nullable = true)
# | |-- y: long (nullable = true)
In case you just want to store your data in a readable format, you can avoid all of the above code by writing it to json directly:
s_df.coalesce(1).write.json('/tmp/testjson')
df = spark.read.json('/tmp/testjson')
Post a Comment for "Convert Multiple Array Of Structs Columns In Pyspark Sql"