pandas is a Python library for data manipulation and analysis.
- Modin - speed up your Pandas workflows by changing a single line of code - :megaphone: 🐙 🐍
- Pandas Bokeh - a Bokeh plotting backend for Pandas and GeoPandas - 🐙 🐍
- jardin - a
pandas.DataFrame-based ORM - 🐙 🐍
- PrettyPandas - is a Pandas DataFrame Styler class that helps you create report quality tables - 🐙 🐍
- pandas-datareader - up to date remote data access for pandas - 🐙 🐍
- Pandas Profiling - Generates profile reports from a pandas
DataFrame- 🐙 🐍
- Pandaral·lel - A simple and efficient tool to parallelize your pandas operations on all your CPUs on Linux & macOS - 🐙 🐍
- A Gentle Visual Intro to Data Analysis in Python Using Pandas by Jay Alammar, 29 October 2018.
- Improve panda’s Excel Output by Chris Moffitt, 8 April 2015.
- Building a Repeatable Data Analysis Process with Jupyter Notebooks by Chris Moffitt, 20 November 2018.
pandas.io.json.json_normalizeis a function to normalize structured JSON into a flat
dataframe. Useful for working with data that comes from an JSON API.
Connect to a SQLite database
import pandas as pd import sqlite conn = sqlite3.connect("database.sqlite") df = pd.read_sql_query("SELECT * FROM table_name;", conn) df.head()
Using a SQLAlchemy engine to connect to a database
import pandas as pd import numpy as np from sqlalchemy import create_engine engine = create_engine("postgresql:///database") df = pd.read_sql_query("SELECT * FROM table;", con=engine) df.head()
Python compatible column names with slugify
Usually I'm dealing with data from external sources that don't have pretty columns names. I like to use slugify to convert them to Python compatible keys.
from slugify import slugify df.columns = [slugify(c, separator="_", to_lower=True) for c in df.columns]
Read CSV file with all cells as strings
>>> df = pd.read_csv("data/source/example.csv", dtype=str) >>> df.dtypes ID object NAME object SALARY object
DataFrame and view all rows
>>> with pd.option_context("display.max_rows", None): ... print(df.head(1).transpose()) ID 1 NAME Myles Braithwaite SALARY $10,000
Convert a column from continuous to categorical
>>> df["age_groups"] = pd.cut( ... df.age, ... bins=[0, 18, 65, 99], ... labels=["child", "adult", "elderly"] ... )
Kevin Markham (justmarkham) - https://twitter.com/justmarkham/status/1146040449678925824
Read a CSV file data in chunk size
Sometimes a CSV is just to large for the memory on your computer. You can tell the argument
chunksize how many rows of data you would like to load.
df_chunks = pd.read_csv("data/source/data.csv", chunksize=10000) df_chunks.head()
If you would like to load the scale down the data and load the it into one
def chunk_preprocessing(df): """Preprocess a chunk of the data before analysis. Arguments ————————- df : `pd.DataFrame` The chunk of the data source that needs to be processed. """ pass df_chunks_processed = [ chunk_preprocessing(df) for df in df_chunks ] df_concat = pd.concat(df_chunks_processed)
Pandas/SQL Rosetta Stone
SELECT * FROM table WHERE city IN ( "Toronto", "Richmond Hill" );
# City is ether Toronto or Richmond Hill: df[df['city'].isin(['Toronto', 'Richmond Hill'])] # City is not Markdale or Oakville: df[~df['city'].isin(['Markdale', 'Oakville'])]
See the pandas documentation for more information on