How To Optimize Time While Converting List To Dataframe?(part Ii)
I didn't get any proper answers to my previous question: How to optimize time while converting list to dataframe? Let me explain the example more: Let's consider the data frame mor
Solution 1:
Use:
print (df)
First Name Last Name Country Address Age Age-Group Photo1 Photo2 \
0 Mark Shelby US Petersburg 42 Adult 1.jpg 2.jpg
1 Andy Carnot GE Freiburg 16 Teen 1.jpg NaN
Phototype
0 PP
1 PP
First is defined dictionary for keys by first value in final list with all columns strating with strings:
d = {'N':['First Name','Last Name', 'Country'],
'AG':['Age','Age-Group'],
'AD':['Address'],
'PH':['Photo','Phototype']}
Then filter DataFrames by lists from dictionary:
out = {k: df.loc[:, df.columns.str.startswith(tuple(v))] for k, v in d.items()}
For PH
is necessary melting for change format:
out['PH'] = (out['PH'].melt('Phototype',
value_name='Photo',
ignore_index=False)
.drop('variable',1)[['Photo','Phototype']]
.dropna(subset=['Photo']))
Last create same columns and join by concat
with sorting for correct ordering:
out = {k: v.set_axis(range(len(v.columns)), axis=1) for k, v in out.items()}
df = pd.concat(out).sort_index(level=1,sort_remaining=False).reset_index(level=0).fillna('')
print (df)
level_0 0120 N Mark Shelby US
0 AG 42 Adult
0 AD Petersburg
0 PH 1.jpg PP
0 PH 2.jpg PP
1 N Andy Carnot GE
1 AG 16 Teen
1 AD Freiburg
1 PH 1.jpg PP
Last create lists with different lengths by remove empty strings:
fin = [x[x!= ''].tolist() for x in df.to_numpy() ]
print (fin)
[['N', 'Mark', 'Shelby', 'US'],
['AG', 42, 'Adult'],
['AD', 'Petersburg'],
['PH', '1.jpg', 'PP'],
['PH', '2.jpg', 'PP'],
['N', 'Andy', 'Carnot', 'GE'],
['AG', 16, 'Teen'],
['AD', 'Freiburg'],
['PH', '1.jpg', 'PP']]
EDIT: For match Photo
with digits is use regex, so instead startswith
is used contains
with joined values of lists by |
for regex OR
:
d = {'N':['First Name','Last Name', 'Country'],
'AG':['Age','Age-Group'],
'AD':['Address'],
'PH':['Photo\d+','Phototype']}
out = {k: df.loc[:, df.columns.str.contains('|'.join(v))] for k, v in d.items()}
print (out)
{'N': First Name Last Name Country
0 Mark Shelby US
1 Andy Carnot GE, 'AG': Age Age-Group
042 Adult
116 Teen, 'AD': Address
0 Petersburg
1 Freiburg, 'PH': Photo1 Photo2 Phototype
01.jpg 2.jpg PP
11.jpg NaN PP}
EDIT: Trick is add ^
to start of strings and $
to end of string for exact match values, then is necessary for correct working Photo
+ 'digit':
print (df)
First Name Last Name Country Address Age Age-Group Photo1 Photo2 \
0 Mark Shelby US Petersburg 42 Adult 1.jpg 2.jpg
1 Andy Carnot GE Freiburg 16 Teen 1.jpg NaN
Phototype Age Detail Address Detail
0 PP Young Far
1 PP Too Young Near
d = {'N':['First Name','Last Name', 'Country'],
'AG':['Age','Age-Group'],
'AD':['Address'],
'PH':['Photo\d+','Phototype']}
d = {k: [rf'^{x}$'for x in v] for k, v in d.items()}
print (d)
{'N': ['^First Name$', '^Last Name$', '^Country$'],
'AG': ['^Age$', '^Age-Group$'],
'AD': ['^Address$'],
'PH': ['^Photo\\d+$', '^Phototype$']}
out = {k: df.loc[:, df.columns.str.contains('|'.join(v))] for k, v in d.items()}
print (out['AG'])
Age Age-Group
0 42 Adult
1 16 Teen
print (out['AD'])
Address
0 Petersburg
1 Freiburg
Post a Comment for "How To Optimize Time While Converting List To Dataframe?(part Ii)"