Skip to content Skip to sidebar Skip to footer

Pyspark Best Alternative For Using Spark Sql/df Withing A Udf?

I'm stuck in a process where I need to perform some action for each column value in my Dataframe which requires traversing through the DF again. Following is a data sample: Row(use

Solution 1:

You are not allowed to use SparkSession/DataFrame objects in UDFS.

The solution I think that will work here is to explode every row by friends then do join (friend.id==user.id&&friend.business_id==user.business_id).

Second solution is possible (if the events table will fit into your memory), is to collect your event table at the start, and then broadcast it to all executors. Then, you can use your data in the UDF. It can be done only if the events is a small table and fits into your memory.

Post a Comment for "Pyspark Best Alternative For Using Spark Sql/df Withing A Udf?"