When working with large datasets in Pandas, understanding the structure, data types, and memory usage of your DataFrame is crucial. The info()
method provides a concise summary of these aspects, and it comes with several parameters that can be tailored to your specific needs. In this post, we'll explore these options using a sample dataset from BigQuery's public datasets.
Setup
We'll start by importing the necessary libraries and querying a large dataset from BigQuery:
# Import libraries
import pandas as pd
%%bigquery df
# Get large sample data
SELECT *
FROM `bigquery-public-data.hacker_news.stories`
Basic usage: without parameters
By default, info()
provides a summary of the DataFrame, including the number of entries, columns, data types, and estimated memory usage.
Printing information on our example DataFrame without any options will indicate a memory usage of 179MB. Also, note that it doesn’t display the number of non-Null values for each column, because this dataset is too large.
# Basic usage without parameters
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1959809 entries, 0 to 1959808
Data columns (total 12 columns):
# Column Dtype
--- ------ -----
0 id int64
1 by object
2 score float64
3 time float64
4 time_ts datetime64[ns, UTC]
5 title object
6 url object
7 text object
8 deleted object
9 dead object
10 descendants float64
11 author object
dtypes: datetime64[ns, UTC](1), float64(3), int64(1), object(7)
memory usage: 179.4+ MB
Real memory usage
Specifying memory_usage='deep'
will enable deep memory introspection, and show real memory usage, which can be particularly useful when working with large datasets, where standard memory estimation may be significantly lower than the actual usage.
In this example, it is five times higher than previously estimated:
# Full memory usage
df.info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1959809 entries, 0 to 1959808
Data columns (total 12 columns):
# Column Dtype
--- ------ -----
0 id int64
1 by object
2 score float64
3 time float64
4 time_ts datetime64[ns, UTC]
5 title object
6 url object
7 text object
8 deleted object
9 dead object
10 descendants float64
11 author object
dtypes: datetime64[ns, UTC](1), float64(3), int64(1), object(7)
memory usage: 1007.3 MB
Force count of Null values
On large datasets, counting Null values in each columns is deactivated, to avoid lengthy calculation. To force the display, set show_counts=True
.
# Force Null values counting
df.info(show_counts=True)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1959809 entries, 0 to 1959808
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 1959809 non-null int64
1 by 1841269 non-null object
2 score 1841269 non-null float64
3 time 1934088 non-null float64
4 time_ts 1934088 non-null datetime64[ns, UTC]
5 title 1841267 non-null object
6 url 1839757 non-null object
7 text 1425167 non-null object
8 deleted 92818 non-null object
9 dead 394344 non-null object
10 descendants 1741598 non-null float64
11 author 1841269 non-null object
dtypes: datetime64[ns, UTC](1), float64(3), int64(1), object(7)
memory usage: 179.4+ MB
Compact mode
If you don’t want to display the full summary with details about each columns, use the compact mode with verbose=False
. This mode provides a high-level overview of the DataFrame, including the number of entries, columns, and memory usage:
# Non verbose mode
df.info(verbose=False)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1959809 entries, 0 to 1959808
Columns: 12 entries, id to author
dtypes: datetime64[ns, UTC](1), float64(3), int64(1), object(7)
memory usage: 179.4+ MB