Pandas: Running A Calculation On Rows Of A Data Table Based On Multiple Columns And Storing The Output In A New Column

December 27, 2023 Post a Comment

I am trying to calculate the distance from 2 locations and I've been provided with the longitude and latitude for the both destination. In my CSV, I have 4 columns (lat1, lon1, lat

Solution 1:

You can define a custom function for the distance calculation. Then, use .apply() to call and apply the function on each row to get the distance for each row.

1. Define a custom function for the distance calculation, as follows:

import math
from math import sin, cos, sqrt, atan2, radians

def get_distance(in_lat1, in_lon1, in_lat2, in_lon2):
    # approximate radius of earth in km
    R = 6373.0

    lat1 = radians(in_lat1)
    lon1 = radians(in_lon1)
    lat2 = radians(in_lat2)
    lon2 = radians(in_lon2)

    dlon = lon2 - lon1
    dlat = lat2 - lat1

    a = sin(dlat / 2)**2 + cos(lat1) * cos(lat2) * sin(dlon / 2)**2
    c = 2 * atan2(sqrt(a), sqrt(1 - a))

    distance = R * c

    return distance

2. Use .apply() to call and apply the function on each row to get the distance for each row, as follows:

df['Distance'] = df.apply(lambda x: get_distance(x['Departure_lat'], x['Departure_lon'], x['Arrival_lat'], x['Arrival_lon']), axis=1)

Demo

Input Dataframe

        City  Departure_lat  Departure_lon  Arrival_lat  Arrival_lon
0  CityName1      25.229676      36.012229    51.406374    20.925168

Output

        City  Departure_lat  Departure_lon  Arrival_lat  Arrival_lon    Distance
0  CityName1      25.229676      36.012229    51.406374    20.925168  3181.11039

Solution 2:

You can make your dlon, dlat, a, and c all as some temp columns and then calculate from there (or have it all in one difficult to read line).

Something like:

df['dlon'] = df['Arrival_lon'] - df['Departure_lon']
df['dlat'] = df['Arrival_lat'] - df['Departure_lat']

df['a'] = sin(df['dlat'] / 2)**2 + cos(df['Departure_lat']) * cos(df['Arrival_lat']) * sin(df['dlon'] / 2)**2
df['c'] = 2 * atan2(sqrt(df['a']), sqrt(1 - df['a']))

df['distance'] = R * df['c']

Then you can .drop() all those extra columns if you want, but that should create df['distance'] as a new column calculated for each row.

I wouldn't be surprised if I have a typo in that code, but hopefully you get the idea. Each df[xxx] = line makes that new column.

Solution 3:

You don't provide the data, so I made up my own based on your question; just use the numpy version of those functions on your columns.

import pandas as pd
import numpy as np

row = pd.Series({
    "lat1": 25.2296756,
    "lon1": 36.0122287,
    "lat2": 51.406374,
    "lon2": 20.9251681
})
df = pd.concat([row]*5, axis=1).T.apply(np.radians)

df["dlon"] = df.lon2 - df.lon1
df["dlat"] = df.lat2 - df.lat1

R = 6373
a = np.sin(df.dlat / 2)**2 + np.cos(df.lat1) * np.cos(df.lat2) * np.sin(df.dlon / 2)**2
c = 2 * np.arctan2(np.sqrt(a), np.sqrt(1 - a))
df["distance"] = R*c

The resulting dataframe looks like this:

lat1lon1lat2lon2dlondlatdistance00.4403410.6285320.897210.365213-0.2633190.456873181.1103910.4403410.6285320.897210.365213-0.2633190.456873181.1103920.4403410.6285320.897210.365213-0.2633190.456873181.1103930.4403410.6285320.897210.365213-0.2633190.456873181.1103940.4403410.6285320.897210.365213-0.2633190.456873181.11039

Solution 4:

You can put your calculation code in a function:

def calculate_distance(lat1,lon1,lat2,lon2):
  # approximate radius of earth in km
  R = 6373.0

  lat1 = radians(lat1)
  lon1 = radians(lon1)
  lat2 = radians(lat2)
  lon2 = radians(lon2)

  dlon = lon2 - lon1
  dlat = lat2 - lat1

  a = sin(dlat / 2)**2 + cos(lat1) * cos(lat2) * sin(dlon / 2)**2
  c = 2 * atan2(sqrt(a), sqrt(1 - a))

  distance = R * c

  return distance

And then apply it to each row with List Comprehension:

df['distance'] = [calculate_distance(row.lat1, row.lon1, row.lat2, row.lon2) for row in df.itertuples() ]

Solution 5:

Depending on how your data csv file is formatted something similar to the following could be used.

In essence you need to turn your calculation into a callable function and then call it on every row in your data file which can be imported into python using the csv library.

import math
import csv # Added import for importing csv into python.from math import sin, cos, sqrt, atan2, radians

# Import the data from the csv file.withopen('data.csv', newline='') as csvfile:
    data = list(csv.reader(csvfile))
    
# Approximate radius of earth in km.
R = 6373.0# Create a distance calculation function.defcalculate_distance(lat1_d, lon1_d, lat2_d, lon2_d):
    
    # Convert from degrees to radians.
    lat1 = radians(lat1_d)
    lon1 = radians(lon1_d)
    lat2 = radians(lat2_d)
    lon2 = radians(lon2_d)

    dlon = lon2 - lon1
    dlat = lat2 - lat1

    a = sin(dlat / 2)**2 + cos(lat1) * cos(lat2) * sin(dlon / 2)**2
    c = 2 * atan2(sqrt(a), sqrt(1 - a))
    
    distance = R * c
    return distance
    
# Use list comprehension to run function on every data row.
distances = [calculate_distance(row[0],row[1],row[2],row[3]) for row in data]

# Append distance column to original array to create output.
output = [row + [distance[index]] for index,row inenumerate(data)]

Please be aware that the row[0],row[1],row[2],row[3] refers to the order of the columns in the data array/csv file. These may need to be reordered as need be, to coincide with the function's declared input order, i.e: lat1_d, lon1_d, lat2_d, lon2_d.

# Import the data from the csv file.withopen('data.csv', newline='') as csvfile:
    data = list(csv.reader(csvfile))

These import parameters will also need to be tweaked to account for the format and name of the csv file.

alezinhacris