Count Rows By Certain Combination Of Row Values Pandas
I have a dataframe (df) like this: v1 v2 v3 0 -30 -15 0 -30 -7.5 0 -30 -11.25 0 -30 -13.125 0 -30 -14.0625 0 -30 -13.59375 0 -10 -5
Solution 1:
I think in v1
and v2
are no NaN
s, so use transform
+ size
:
df = df[df.groupby(['v1', 'v2'])['v2'].transform('size') == 6]
print (df)
v1 v2 v3
0 0 -30 -15.00000
1 0 -30 -7.50000
2 0 -30 -11.25000
3 0 -30 -13.12500
4 0 -30 -14.06250
5 0 -30 -13.59375
6 0 -10 -5.00000
7 0 -10 -7.50000
8 0 -10 -6.25000
9 0 -10 -5.62500
10 0 -10 -5.93750
11 0 -10 -6.09375
Detail:
print (df.groupby(['v1', 'v2'])['v2'].transform('size') == 6)
0True1True2True3True4True5True6True7True8True9True10True11True12False13False14False
Name: v2, dtype: bool
Unfortunately filter
is really slow, so if need better performance use transform
:
np.random.seed(123)
N = 1000000
L = list('abcdefghijkl')
df = pd.DataFrame({'v1': np.random.choice(L, N),
'v2':np.random.randint(10000,size=N),
'value':np.random.randint(1000,size=N),
'value2':np.random.randint(5000,size=N)})
df = df.sort_values(['v1','v2']).reset_index(drop=True)
print (df.head(10))
In [290]: %timeit df.groupby(['v1', 'v2']).filter(lambda x: len(x) == 6)
1 loop, best of 3: 12.1 s per loop
In [291]: %timeit df[df.groupby(['v1', 'v2'])['v2'].transform('size') == 6]
1 loop, best of 3: 176 ms per loop
In [292]: %timeit df[df.groupby(['v1', 'v2']).v2.transform('count').eq(6)]
10 loops, best of 3: 175 ms per loop
N = 1000000
ngroups = 1000
df = pd.DataFrame(dict(A = np.random.randint(0,ngroups,size=N),B=np.random.randn(N)))
In [299]: %timeit df.groupby('A').filter(lambda x: len(x) > 1000)
1 loop, best of 3: 330 ms per loop
In [300]: %timeit df[df.groupby(['A'])['A'].transform('size') > 1000]
10 loops, best of 3: 101 ms per loop
Caveat
The results do not address performance given the number of groups, which will affect timings a lot for some of these solutions.
Solution 2:
You can use the filter
groupby method:
In [11]: df.groupby(['v1', 'v2']).filter(lambda x: len(x) == 6)
Out[11]:
v1 v2 v3
00 -30 -15.0000010 -30 -7.5000020 -30 -11.2500030 -30 -13.1250040 -30 -14.0625050 -30 -13.5937560 -10 -5.0000070 -10 -7.5000080 -10 -6.2500090 -10 -5.62500100 -10 -5.93750110 -10 -6.09375
Solution 3:
Use groupby
+ count/size
:
df.groupby(['v1', 'v2']).v3.transform('count')
06.016.026.036.046.056.066.076.086.096.0106.0116.0123.0133.0143.0
Name: v3, dtype: float64
Use the mask to filter df
:
df = df[df.groupby(['v1', 'v2']).v3.transform('count').eq(6)] # == 6
df
v1 v2 v3
00 -30 -15.0000010 -30 -7.5000020 -30 -11.2500030 -30 -13.1250040 -30 -14.0625050 -30 -13.5937560 -10 -5.0000070 -10 -7.5000080 -10 -6.2500090 -10 -5.62500100 -10 -5.93750110 -10 -6.09375
count
does not count NaN
s, while size
does. Use whatever is appropriate for you.
Post a Comment for "Count Rows By Certain Combination Of Row Values Pandas"