Skip to content Skip to sidebar Skip to footer

How To Count The Number Of Elements In A Set Of Rows Selected Based On A Condition?

I have a large DataFrame with many duplicate values. The unique values are stored in List1. I'd like to do the following: Select a few rows that contain each of the values present

Solution 1:

Suppose you have:

df
    EQ1 EQ2 EQ3
0   A   NaNNaN1   X   Y   NaN2   A   X   C
3   D   E   F4   G   H   B

Then, you may proceed as follows:

dft = df.T
output_set =set()
prune_set =set()
forcolumnin dft:
    arr = dft[column].dropna().values
    if len(arr) >=2:
        output_set |=set(arr)
    else:
        prune_set |=set(arr)
sorted(output_set - prune_set)
['B', 'C', 'D', 'E', 'F', 'G', 'H', 'X', 'Y']

Solution 2:

The below identifies the set of (unique) values that occur in rows with more than 2 non-NaN values, eliminates those that also occur in rows with less than 2 nonNaN values. Avoids using loops.

First, get set of unique values in the part of df that does not meet the missing values restriction (and adding .strip() to address a data issue mentioned in the comments):

na_threshold = 1
not_enough_non_nan = df[df.count(axis=1) <= 1].values.flatten().astype(str)
not_enough_non_nan = set([str(l).strip() for l in not_enough_non_nan if not l == 'nan'])

{'A'}

Next, identify the set of values that do meet your restriction:

enough_non_nan = df[df.count(axis=1) > 1].values.flatten().astype(str)
enough_non_nan = set([str(l).strip() for l in enough_non_nan if not l == 'nan'])

{'H', 'C', 'E', 'B', 'D', 'X', 'F', 'A', 'Y', 'G'}

Finally, take the set difference between the above to eliminate values do not always meet the restriction:

result = sorted(enough_non_nan - not_enough_non_nan)

['B', 'C', 'D', 'E', 'F', 'G', 'H', 'X', 'Y']

Post a Comment for "How To Count The Number Of Elements In A Set Of Rows Selected Based On A Condition?"