Skip to content Skip to sidebar Skip to footer

Calculate The Minimum Distance To Destinations For Each Origin In Pyspark

I have a list of origins and destinations along with their geo coordinates. I need to calculate the minimum distance for each origin to the destinations. Below is my code: import p

Solution 1:

You are applying the haversine function to a column where it should be applied to a tuple or an array.

If you want to use this lib, you need to create an UDF and to install the haversine package on all your spark nodes.

from haversine import haversine
from pyspark.sql import functions as F, types as T

haversine_udf = F.udf(haversine, T.FloatType())

df.withColumn(
    "Distance", haversine_udf(F.col("Origin_Geo"), F.col("Destination_Geo"))
).groupBy("Origin").agg(F.min("Distance").alias("Min_Distance")).show()

If you cannot install the package on every node, then you can simply use the built-in version of the function (cf. Haversine Formula in Python (Bearing and Distance between two GPS points)) - The formula is heavily dependent on the radius of the earth you choose

from math import radians, cos, sin, asin, sqrt
from pyspark.sql import functions as F, types as T

@F.udf(T.FloatType())defhaversine_udf(point1, point2):
    """
    Calculate the great circle distance between two points 
    on the earth (specified in decimal degrees)
    """# convert decimal degrees to radians 
    lon1, lat1 = point1
    lon2, lat2 = point2
    lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])

    # haversine formula 
    dlon = lon2 - lon1 
    dlat = lat2 - lat1 
    a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
    c = 2 * asin(sqrt(a)) 
    r = 6372.8# Radius of earth in kilometers. Use 3956 for milesreturn c * r

df.withColumn(
    "Distance", haversine_udf(F.col("Origin_Geo"), F.col("Destination_Geo"))
).groupBy("Origin").agg(F.min("Distance").alias("Min_Distance")).show()
+------+------------+                                                           
|Origin|Min_Distance|
+------+------------+
|     B|   351.08905|
|     A|   392.32755|
+------+------------+

Post a Comment for "Calculate The Minimum Distance To Destinations For Each Origin In Pyspark"