Pyspark Best Alternative For Using Spark Sql/df Withing A Udf?

June 11, 2024 Post a Comment

I'm stuck in a process where I need to perform some action for each column value in my Dataframe which requires traversing through the DF again. Following is a data sample: Row(use

Solution 1:

You are not allowed to use SparkSession/DataFrame objects in UDFS.

The solution I think that will work here is to explode every row by friends then do join (friend.id==user.id&&friend.business_id==user.business_id).

Second solution is possible (if the events table will fit into your memory), is to collect your event table at the start, and then broadcast it to all executors. Then, you can use your data in the UDF. It can be done only if the events is a small table and fits into your memory.

Baca Juga

Smooth Movement In Pygame
Clone Kubernetes Objects Programmatically Using The Python Api
Pyspark: Create Maptype Column From Existing Columns

alezinhacris

Pyspark Best Alternative For Using Spark Sql/df Withing A Udf?

Solution 1:

Post a Comment for "Pyspark Best Alternative For Using Spark Sql/df Withing A Udf?"

Widget HTML #3