Pyspark Create Dictionary From Data In Two Columns

September 30, 2023 Post a Comment

I have a pyspark dataframe with two columns: [Row(zip_code='58542', dma='MIN'), Row(zip_code='58701', dma='MIN'), Row(zip_code='57632', dma='MIN'), Row(zip_code='58734', dma='MI

Solution 1:

You can avoid using a udf here using pyspark.sql.functions.struct and pyspark.sql.functions.to_json (Spark version 2.1 and above):

import pyspark.sql.functions as f
from pyspark.sql import Row

data = [
    Row(zip_code='58542', dma='MIN'),
    Row(zip_code='58701', dma='MIN'),
    Row(zip_code='57632', dma='MIN'),
    Row(zip_code='58734', dma='MIN')
]

df = spark.createDataFrame(data)

df.withColumn("json", f.to_json(f.struct("dma", "zip_code"))).show(truncate=False)
#+---+--------+--------------------------------+#|dma|zip_code|json                            |#+---+--------+--------------------------------+#|MIN|58542   |{"dma":"MIN","zip_code":"58542"}|#|MIN|58701   |{"dma":"MIN","zip_code":"58701"}|#|MIN|57632   |{"dma":"MIN","zip_code":"57632"}|#|MIN|58734   |{"dma":"MIN","zip_code":"58734"}|#+---+--------+--------------------------------+

If you instead wanted the zip_code to be the key, you can create a MapType directly using pyspark.sql.functions.create_map:

df.withColumn("json", f.create_map(["zip_code", "dma"])).show(truncate=False)
#+---+--------+-----------------+#|dma|zip_code|json             |#+---+--------+-----------------+#|MIN|58542   |Map(58542 -> MIN)|#|MIN|58701   |Map(58701 -> MIN)|#|MIN|57632   |Map(57632 -> MIN)|#|MIN|58734   |Map(58734 -> MIN)|#+---+--------+-----------------+

Solution 2:

I am not sure how performant is this compared to other solutions, but you can simply do this:

dict = {row['zipcode']:row['dma'] for row in df.collect()}
print(dict)
#{'58542': 'MIN', '58701': 'MIN', '57632': 'MIN', '58734': 'MIN'}

Solution 3:

There is one more way to convert your dataframe into dict. for that you need to convert your dataframe into key-value pair rdd as it will be applicable only to key-value pair rdd. since dictionary itself a combination of key value pairs.

data = [
    Row(zip_code='58542', dma='MIN'),
    Row(zip_code='58701', dma='MIN'),
    Row(zip_code='57632', dma='MIN'),
    Row(zip_code='58734', dma='MIN')
]

>>> data.show();
+---+--------+|dma|zip_code|+---+--------+|MIN|58542||MIN|58701||MIN|57632||MIN|58734|+---+--------+

converting your dataframe into rdd.

newrdd = data.rdd

since you want zip_code as your key and dma as value, so have selected rdd element '1' as key and element '0' as value.

keypair_rdd = newrdd.map(lambda x : (x[1],x[0]))

once you have key-pair rdd then simply use collectAsMap to convert it into a dictonary

>>> dict = keypair_rdd.collectAsMap()
>>> printdict
{u'58542': u'MIN', u'57632': u'MIN', u'58734': u'MIN', u'58701': u'MIN'}

>>> dict.keys()
[u'58542', u'57632', u'58734', u'58701']

looks value for specific key:

>>> dict.get('58542')
u'MIN'

Solution 4:

yes you can using

pyspark.sql.types.MapType(keyType, valueType, valueContainsNull=True)

please share the more info like dataframe sample output and the way you want as an output that will help in writing a code snippet for the same.

Solution 5:

As Ankin says, you can use a MapType for this:

import pyspark
from pyspark.sql import Row

sc = pyspark.SparkContext()
spark = pyspark.sql.SparkSession(sc)

data = spark.createDataFrame([Row(zip_code='58542', dma='MIN'),
 Row(zip_code='58701', dma='MIN'),
 Row(zip_code='57632', dma='MIN'),
 Row(zip_code='58734', dma='MIN')])

data.show()

Output:

+---+--------+
|dma|zip_code|
+---+--------+
|MIN|   58542|
|MIN|   58701|
|MIN|   57632|
|MIN|   58734|
+---+--------+


from pyspark.sql.functions import udf
from pyspark.sql import types as T

@udf(T.MapType(T.StringType(), T.StringType()))defcreate_struct(zip_code, dma):
    return {zip_code: dma}

data.withColumn('struct', create_struct(data.zip_code, data.dma)).toJSON().collect()