Notes for Data Scientists
Here I share technical notes, code chunks and short analyses that I found useful when working with data. Most of them are related to Python, SQL and Google Cloud. New post-its are added regularly, every time I discover gems ๐
Python ๐
- Format numbers the right way
- Get common elements between lists
- List files in a folder
- Open and read/write a file
- Convert nested field to a dict
- Get today or yesterday date
- Plot beautiful time series with seaborn
- Manage markers for seaborn lineplots
- Get the perfect color palette with Seaborn
- Plot great categorical distribution graphs
- Rearrange plot layers z-index
- Calculate and plot confidence intervals
- Read and write to Google Sheets
- Handle infinity values
- Split data into training and test sets
- Train ML algorithms with scikit-learn
- Estimate customer lifetime value
Pandas ๐ผ
- Handle dates in pandas
- Use date indexes with DataFrames
- Flatten nested JSON files
- Pivot DataFrames with non-numeric values
- Rename columns with method chaining
- Flatten multi-index columns
- Aggregate by multiple functions
- Calculate moving averages and rolling windows
- Select rows with NA values in any column
- Apply computation within groups in a DataFrame
- Read multiple CSV files into a DataFrame
- Explode and regroup list into rows
- Split values into columns
- Display more DataFrame cells in notebooks
- Use a column value if not NA, or another column
- Get differences between two DataFrames
- Get top or random rows within groups
- Format dates in bar plots
- Back-fill and forward-fill NaN values
- Use all options for DataFrames info
- Go up a hierarchical tree
- Cross-join DataFrames
- Make full use of value_counts()
- Create a DataFrame of zeros or nulls
- Encode categorical columns as numbers
SQL ๐งฎ
- Import foreign PostgreSQL database
- Join with OR condition
- Get first non-Null value on multiple tables
- Avoid division by zero errors
- Count words in PostgreSQL
- Compute time intervals in PostgreSQL
- Substract days to a date
- Round timestamp in PostgreSQL
- Convert Epoch to datetime in Redshift
- Use CASE in PostgreSQL
- Filter with HAVING clause
- Rank results with the adequate function
- Get cumulated sum
- Get rolling average
- Query new vs returning users
- Compute sessions and streaks
- Query on time windows
- Calculate period-to-period change
- Pivot rows to columns in BigQuery
Google Cloud Platform โ ๏ธ
- Introduction to Google AI Notebooks
- Create GCP notebooks with custom disk size
- Authenticate to Google API on a cloud-based notebook
- Authenticate to BigQuery in a Jupyter notebook
- Use Jupyter magic commands for BigQuery
- Schedule start and stop for a Compute Engine instance
- Copy large distant files to Google Cloud Storage
- Schedule BigQuery jobs
- Query Google Analytics 360 raw logs
Other topics ๐งช
- Combine cd & ls commands
- Batch convert files encoding
- Delete .DS_Store files
- Find files matching pattern in Terminal
- Compress and decompress files in Linux
- Change Git commit authors
- Manage multiple Git accounts
- Squash and reorder commits in Git
- Merge two Git repositories
- Use hash functions for group assignment