How To Count The Number Of Elements In A Set Of Rows Selected Based On A Condition?
I have a large DataFrame with many duplicate values. The unique values are stored in List1. I'd like to do the following: Select a few rows that contain each of the values present
Solution 1:
Suppose you have:
df
EQ1 EQ2 EQ3
0 A NaNNaN1 X Y NaN2 A X C
3 D E F4 G H B
Then, you may proceed as follows:
dft = df.T
output_set =set()
prune_set =set()
forcolumnin dft:
arr = dft[column].dropna().values
if len(arr) >=2:
output_set |=set(arr)
else:
prune_set |=set(arr)
sorted(output_set - prune_set)
['B', 'C', 'D', 'E', 'F', 'G', 'H', 'X', 'Y']
Solution 2:
The below identifies the set
of (unique) values that occur in rows with more than 2 non-NaN
values, eliminates those that also occur in rows with less than 2 nonNaN
values. Avoids using loops.
First, get set
of unique values in the part of df
that does not meet the missing values restriction (and adding .strip()
to address a data issue mentioned in the comments):
na_threshold = 1
not_enough_non_nan = df[df.count(axis=1) <= 1].values.flatten().astype(str)
not_enough_non_nan = set([str(l).strip() for l in not_enough_non_nan if not l == 'nan'])
{'A'}
Next, identify the set
of values that do meet your restriction:
enough_non_nan = df[df.count(axis=1) > 1].values.flatten().astype(str)
enough_non_nan = set([str(l).strip() for l in enough_non_nan if not l == 'nan'])
{'H', 'C', 'E', 'B', 'D', 'X', 'F', 'A', 'Y', 'G'}
Finally, take the set
difference between the above to eliminate values do not always meet the restriction:
result = sorted(enough_non_nan - not_enough_non_nan)
['B', 'C', 'D', 'E', 'F', 'G', 'H', 'X', 'Y']
Post a Comment for "How To Count The Number Of Elements In A Set Of Rows Selected Based On A Condition?"