Skip to content Skip to sidebar Skip to footer

Combine Two Dataframes Based On Ranges Which May Partially Overlap Using Pandas And Track Multiple Values

I have two big dataframes (100K rows), One has 'values', one has 'types'. I want to assign a 'type' from df2 to a column in df1 based on depth. The depths are given as depth 'From

Solution 1:

From df2 create an auxiliary Series, marking each "starting point" of a unit (a range of length 1):

units = df2.set_index('Type').apply(lambda row: pd.Series(
    range(row.From, row.To)), axis=1).stack()\
    .reset_index(level=1, drop=True)

The result is:

Type
A0.0A1.0A2.0A3.0B4.0
C    5.0
D    6.0
D    7.0
E    8.0
E    9.0
dtype: float64

Then define a function generating Type for the current row:

defgetType(row):
    gr = units[units.ge(row.From) & units.lt(row.To)].groupby(level=0)
    if gr.ngroups == 1:
        return gr.ngroup().index[0]
    txt = []
    for key, grp in gr:
        siz = grp.size
        un = 'unit'if siz == 1else'units'
        txt.append(f'{siz}{un}{key}')
    return','.join(txt)

And to generate Type column, apply it to each row:

df1['Type'] = df1.apply(getType, axis=1)

The result is:

FromTo    val                Type
01.03.00.001A13.05.00.0051 unit A,1 unit B25.07.00.0021 unit C,1 unit D
37.010.00.0011 unit D,2 units E

This result is a bit different from your expected result, but I think you created it in a bit inconsequent way.

I think that my solution is correct (at least more consequent), because:

  1. Row 1.0 - 3.0 is entirely within the limits of 0 4 A, so the result is just A (like in your post).
  2. Row 3.0 - 5.0 can be "divided" into:
    • 3.0 - 4.0 is within the limits of 0 4 A (1 unit),
    • 4.0 - 5.0 is within the limits of 4 5 B (also 1 unit, but you want 2 units here).
  3. Row 5.0 - 7.0 can be again "divided" into:
    • 5.0 - 6.0 is within the limits of 5 6 C (1 unit),
    • 6.0 - 7.0 is within the limits of 6 8 D (1 unit, just like you did).
  4. Row 7.0 - 10.0 can be "divided" into:
    • 7.0 - 8.0 is within the limits of 6 8 D (1 unit, just like you did),
    • 8.0 - 10.0 is within the limits of 8 10 E (2 units, not 3 as you wrote).

Post a Comment for "Combine Two Dataframes Based On Ranges Which May Partially Overlap Using Pandas And Track Multiple Values"