Use all options of pandas DataFrame info()
When working with DataFrames, perhaps the most commonly used function to get information about your data is info()
. This function has a few parameters that can be useful, particularily when working with large datasets.
# Import libraries
import pandas as pd
%%bigquery df
# Get large sample data
SELECT *
FROM `bigquery-public-data.hacker_news.stories`
Without parameters
Printing information without any options will indicate a memory usage of 179MB. Also, note that it doesn’t display the number of non-Null values for each column, because this dataset is too large.
# Basic usage without parameters
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1959809 entries, 0 to 1959808
Data columns (total 12 columns):
# Column Dtype
--- ------ -----
0 id int64
1 by object
2 score float64
3 time float64
4 time_ts datetime64[ns, UTC]
5 title object
6 url object
7 text object
8 deleted object
9 dead object
10 descendants float64
11 author object
dtypes: datetime64[ns, UTC](1), float64(3), int64(1), object(7)
memory usage: 179.4+ MB
Real memory usage
Specifying memory_usage='deep'
will enable deep memory introspection, and show real memory usage, which is often much higher than the standard estimation. In this example, it is five times higher than previously estimated.
# Full memory usage
df.info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1959809 entries, 0 to 1959808
Data columns (total 12 columns):
# Column Dtype
--- ------ -----
0 id int64
1 by object
2 score float64
3 time float64
4 time_ts datetime64[ns, UTC]
5 title object
6 url object
7 text object
8 deleted object
9 dead object
10 descendants float64
11 author object
dtypes: datetime64[ns, UTC](1), float64(3), int64(1), object(7)
memory usage: 1007.3 MB
Force count of Null values
On large datasets, counting Null values in each columns is deactivated, to avoid lengthy calculation. To force the display, set show_counts=True
.
# Force Null values counting
df.info(show_counts=True)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1959809 entries, 0 to 1959808
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 1959809 non-null int64
1 by 1841269 non-null object
2 score 1841269 non-null float64
3 time 1934088 non-null float64
4 time_ts 1934088 non-null datetime64[ns, UTC]
5 title 1841267 non-null object
6 url 1839757 non-null object
7 text 1425167 non-null object
8 deleted 92818 non-null object
9 dead 394344 non-null object
10 descendants 1741598 non-null float64
11 author 1841269 non-null object
dtypes: datetime64[ns, UTC](1), float64(3), int64(1), object(7)
memory usage: 179.4+ MB
Compact mode
If you don’t want to display the full summary with details about each columns, use the compact mode with verbose=False
.
# Non verbose mode
df.info(verbose=False)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1959809 entries, 0 to 1959808
Columns: 12 entries, id to author
dtypes: datetime64[ns, UTC](1), float64(3), int64(1), object(7)
memory usage: 179.4+ MB