Skip to content Skip to sidebar Skip to footer

Python Pandas Dtypes Detection From Sql

I am quite troubled by the behaviour of Pandas DataFrame about Dtype detection. I use 'read_sql_query' to retrieve data from a database to build a DataFrame, and then dump it into

Solution 1:

Your to-csv operation does not convert all specified date fields because as you mention, not all datetime columns are read in as datetime format but show as string (object dtype) in current dataframe. This is the unfortunate side effect of reading from external sources as the imported system --this includes Python, SAS, Stata, R, Excel, etc.-- attempts to define columns usually by first few rows unless otherwise explicitly defined.

Fortunately, pandas's read_sql_query() maintains a parameter for parse_dates. So consider defining the dates during the read in operation as this argument takes a list or dictionary:

df = read_sql_query('select * from DOMMAGE_INTERET where rownum < 100', engine, 
                    parse_dates = ['doi_date_decision', 'doi_date_mod', 'doi_date_update'])

Alternatively, convert with pd.to_datetime() just after reading in and before to_csv:

df['doi_date_decision'] = pd.to_datetime(df['doi_date_decision'])

And most RDMS maintains datetime in YYYY-MM-DD HH:MM:SS format, aligning to pandas format.

Solution 2:

It is difficult to dig into your issue without some data samples. However, you probably face either of the two cases:

  • Some of the rows you select in your second case contain NULL values, which stops pandas interpret automatically your column as a datetime
  • You have a different MDY convention in your database and some dates lower than 13rd of the month are parsed as dates while others aren't and are kept as strings until you convert manually in DMY

Solution 3:

Thanks to Boud and Parfait. Their answers are right :

All my tests shows that missing date fields can make Dtype detection to fail.

read_sql_query() has a parameter to define fields with date type. I guess to cure this problem.

Sadly, since now, I have been using a complete generic treatment to process a hundred of tables. To use 'read_sql_query' parameter 'parse_dates' would implies to do a prior work of metadata definition (like a json file describing each table).

Actually, I also found out that integers are changed to float when there is NaN field in the column...

If I would be reading csv flat files, I could understand that data type can be hard to detect... but from a database (read_sql_query)! Pandas has SqlAlchelmy as a dependence. And SqlAlchemy (and even any lower level Python database driver (cx_Oracle, DB API)) has reflection mechanism to detect data types. Pandas could have been using those metadatas to keep data types integrity.

Post a Comment for "Python Pandas Dtypes Detection From Sql"