Pandas How To Find Continuous Values In A Series Whose Differences Are Within A Certain Distance

November 29, 2024 Post a Comment

I have a pandas Series that is composed of ints a = np.array([1,2,3,5,7,10,13,16,20]) pd.Series(a) 0 1 1 2 2 3 3 5 4 7 5 10 6 13 7 16 8 20 now I want to cluster the seri

Solution 1:

Here's one approach -

np.split(a,np.flatnonzero(np.diff(a)>d)+1)

As a function to output list of lists -

def splitme(a,d) : 
    return list(map(list,np.split(a,np.flatnonzero(np.diff(a)>d)+1)))

For performance, I would suggest using zip to get the start, stop indices and then slicing, thus avoiding np.split which might prove to be the bottleneck -

defsplitme_zip(a,d) : 
    m = np.concatenate(([True],a[1:] > a[:-1] + d,[True]))
    idx = np.flatnonzero(m)
    l = a.tolist()
    return [l[i:j] for i,j inzip(idx[:-1],idx[1:])]

If you need the output as a list of arrays, skip the list conversion with .tolist/map(list,).

Sample runs -

In [122]: a = np.array([1,2,3,5,7,10,13,16,20])

In [123]: splitme(a,1)
Out[123]: [[1, 2, 3], [5], [7], [10], [13], [16], [20]]

In [124]: splitme(a,2)
Out[124]: [[1, 2, 3, 5, 7], [10], [13], [16], [20]]

In [125]: splitme(a,3)
Out[125]: [[1, 2, 3, 5, 7, 10, 13, 16], [20]]

Runtime test -

In [180]:a=np.sort(np.random.randint(1,10000*2,(10000)))In [181]:s=pd.Series(a)In [182]:d=3In [183]:%timeitpandas_way(s,d)#@cᴏʟᴅsᴘᴇᴇᴅ's soln10loops,best of 3:55.1msperloopIn [184]:%timeitnp.split(a,np.flatnonzero(np.diff(a)>d)+1)...:%timeitsplitme(a,d)...:%timeitsplitme_zip(a,d)1000 loops,best of 3:1.47msperloop100loops,best of 3:2.87msperloop1000 loops,best of 3:516µsperloopIn [185]:aOut[185]:array([2,2,2,...,19992,19996,19999])

Solution 2:

This is the pandas way, using groupby.

n = 1

s

0112233547510613716820
dtype: int64

m = ~s.diff().fillna(0).le(n)   
v = s.groupby(m.cumsum()).apply(lambda x: x.tolist()).tolist()

v
[[1, 2, 3], [5], [7], [10], [13], [16], [20]]

alezinhacris

Pandas How To Find Continuous Values In A Series Whose Differences Are Within A Certain Distance

Solution 1:

Solution 2:

Post a Comment for "Pandas How To Find Continuous Values In A Series Whose Differences Are Within A Certain Distance"

Widget HTML #3