How to use StrapvizPy

This notebook shows how you can utilize the strapvizpy package within a project. We will be using the toy dataset iris from the seaborn package to demonstrate the usage. This dataset contains different iris flowers and their characteristics.

import strapvizpy

print(strapvizpy.__version__)

0.1.0

# import seaborn to load example data set
import seaborn as sns

iris = sns.load_dataset("iris")
iris.head()

	sepal_length	sepal_width	petal_length	petal_width	species
0	5.1	3.5	1.4	0.2	setosa
1	4.9	3.0	1.4	0.2	setosa
2	4.7	3.2	1.3	0.2	setosa
3	4.6	3.1	1.5	0.2	setosa
4	5.0	3.6	1.4	0.2	setosa

1. Bootstrapping functions

There are two functions in the bootstrap module, bootstrap_distribution and calculate_boot_stats. These two functions perform the bootstrapping and calculate the relevant statistics.

from strapvizpy.bootstrap import bootstrap_distribution, calculate_boot_stats

# Select a single column from data to work with
ex_data = iris["sepal_width"]

1.1 `bootstrap_distribution`

This function performs the bootstrap and returns an array of the results.

Function Inputs

sample : the data that will be bootstrapped
rep : the number of repetitions of bootstrapping
- this controls the size of the output array
n : number of samples in each bootstrap
- default is auto which means the distribution will be the same size as the original sample
estimator : what sample statistic we are calculating with the bootstrap
- mean, median, var (i.e. variance), or sd (i.e. standard deviation)
random_seed : we can set this for reproducibility

# returns 50 sample means via bootstrapping
bootstrap_distribution(ex_data, 50, random_seed=20)

array([3.07466667, 3.06533333, 3.044     , 3.04866667, 3.07466667,
       3.13266667, 3.062     , 3.08466667, 3.042     , 3.07333333,
       3.034     , 3.05933333, 3.054     , 3.12466667, 3.07      ,
       3.06133333, 3.102     , 3.04466667, 3.126     , 3.036     ,
       3.03933333, 3.06666667, 3.09466667, 3.07333333, 3.044     ,
       3.06466667, 3.096     , 3.03266667, 3.10266667, 3.078     ,
       3.1       , 3.04066667, 3.09866667, 3.046     , 3.09866667,
       3.08666667, 3.06066667, 3.03      , 3.07133333, 3.15333333,
       2.992     , 3.09266667, 3.03133333, 2.99666667, 3.04066667,
       3.056     , 3.00733333, 3.13666667, 3.12866667, 3.088     ])

# returns 75 sample variance via bootstrapping 
bootstrap_distribution(ex_data, 75, estimator="var", random_seed=20)

array([0.18535822, 0.19506489, 0.16939733, 0.18449822, 0.17415822,
       0.21633289, 0.205156  , 0.20529822, 0.20416933, 0.17728889,
       0.17771067, 0.15054622, 0.21435067, 0.19905822, 0.16276667,
       0.20583822, 0.19952933, 0.17273822, 0.20445733, 0.19337067,
       0.20251956, 0.16008889, 0.19503822, 0.21795556, 0.18179733,
       0.18468489, 0.18091733, 0.19379956, 0.16785956, 0.17344933,
       0.20373333, 0.18107956, 0.15573156, 0.14915067, 0.21959822,
       0.20155556, 0.22758622, 0.18743333, 0.18191156, 0.20728889,
       0.14540267, 0.18921289, 0.15881822, 0.16472222, 0.20494622,
       0.19459733, 0.17761289, 0.20365556, 0.19684489, 0.161056  ,
       0.17415289, 0.19053333, 0.20593956, 0.16809822, 0.23328889,
       0.19525156, 0.17689067, 0.17509156, 0.16032222, 0.19157156,
       0.16348267, 0.19065289, 0.22592222, 0.17166267, 0.182884  ,
       0.15885733, 0.17381156, 0.22510933, 0.19329822, 0.15343333,
       0.21625289, 0.16119289, 0.17781733, 0.19305067, 0.193664  ])

1.2 `calculate_bootstrap_stats`

This function performs bootstrapping and returns a dictionary of the sampling distribution statistics.

Inputs

sample : the data that will be bootstrapped
rep : the number of repetitions of bootstrapping
- this controls the size of the output array
n : number of samples in each bootstrap
- default is auto which means the distribution will be the same size as the original sample
level : the significance level of interest for the sampling distribution
- a value between 0 and 1
estimator : what sample statistic we are calculating with the bootstrap
- mean, median, var (i.e. variance), or sd (i.e. standard deviation)
random_seed : we can set this for reproducibility
pass_dist : specifies if the sampling distribution is returned from the function

# Get 100 sample means via bootstrapping and calcualte statistics at the
# 95% confidence interval
st = calculate_boot_stats(ex_data, 100, level=0.95, random_seed=20)
st

{'lower': 3.003133333333333,
 'upper': 3.1347666666666667,
 'sample_mean': 3.0573333333333337,
 'std_err': 0.03242293495523054,
 'level': 0.95,
 'sample_size': 150,
 'n': 'auto',
 'rep': 100,
 'estimator': 'mean'}

# Output data type
type(st)

dict

# Get 50 sample variances via bootstrapping at a 90% confidence level
# and return the bootstrap distribution along with the statistics

st = calculate_boot_stats(
    ex_data, 50, level=0.90, random_seed=20, estimator="var", pass_dist=True)
st

({'lower': 0.1528796222222222,
  'upper': 0.21722535555555553,
  'sample_var': 0.1887128888888889,
  'std_err': 0.01984580483579152,
  'level': 0.9,
  'sample_size': 150,
  'n': 'auto',
  'rep': 50,
  'estimator': 'var'},
 array([0.18535822, 0.19506489, 0.16939733, 0.18449822, 0.17415822,
        0.21633289, 0.205156  , 0.20529822, 0.20416933, 0.17728889,
        0.17771067, 0.15054622, 0.21435067, 0.19905822, 0.16276667,
        0.20583822, 0.19952933, 0.17273822, 0.20445733, 0.19337067,
        0.20251956, 0.16008889, 0.19503822, 0.21795556, 0.18179733,
        0.18468489, 0.18091733, 0.19379956, 0.16785956, 0.17344933,
        0.20373333, 0.18107956, 0.15573156, 0.14915067, 0.21959822,
        0.20155556, 0.22758622, 0.18743333, 0.18191156, 0.20728889,
        0.14540267, 0.18921289, 0.15881822, 0.16472222, 0.20494622,
        0.19459733, 0.17761289, 0.20365556, 0.19684489, 0.161056  ]))

# Data type when the distribution is returned along with the statistics
type(st)

tuple

type(st[0])

dict

type(st[1])

numpy.ndarray

2. Visualizations

There are two functions in the display module, plot_ci and tabulate_stats. These use the bootstrapping statistics to create report-ready visualizations and tables of the sampling distribution.

from strapvizpy.display import plot_ci, tabulate_stats

2.1 `plot_ci`

This function creates a histogram of a sampling distribution with its confidence interval and sample mean

Inputs:

sample : the data that will be bootstrapped
rep : the number of repetitions of bootstrapping
bin_size : the number of bins data will be split into for the histogram
n : number of samples in each bootstrap
- default is auto which means the distribution will be the same size as the original sample
ci_level : the significance level of interest for the sampling distribution
- value between 0 and 1
ci_random_seed : we can set this for reproducibility
title : title of the histogram
x_axis : name of the x axis
y_axis : name of the y axis
path: path to where function should be saved
- default is None which means the plot will not be saved

# Plot sampling distibution of 1000 sample means at a 95% confidence interval
plot_ci(ex_data, rep=1000, ci_level=0.95, ci_random_seed=123);

(<module 'matplotlib.pyplot' from 'C:\\Users\\margo.DESKTOP-T66VM01\\miniconda3\\envs\\strapy\\lib\\site-packages\\matplotlib\\pyplot.py'>,
 '')

# Plot sampling distibution of 1000 sample means at a 99% confidence interval
# with a unique title and a bin size of 50

title = "Bootstrapped Iris Septal Width"
plot_ci(
    ex_data, rep=1000, bin_size=50, ci_level=0.99,
    title=title, ci_random_seed=123);

2.2 `tabulate_stats`

This function creates two tables that summarize the sampling distribution and the parameters for creating the bootstrapped samples and saves them as html files.

Inputs

stat : summary statistics produced by the calculate_boot_stats() function
precision : the precision of the table values
- how many decimal places are shown
estimator : indicates if the bootstrapped statistic is shown in the summary statistics table
alpha : indicates if the significance level should be shown in the summary statistics table
path : can specify a folder path where you want to save the tables

st = calculate_boot_stats(ex_data, 1000, level=0.95, random_seed=20)

stats_table, parameter_table = tabulate_stats(st)

stats_table

Bootstrapping sample statistics from sample with 150 records
Lower Bound CI	Upper Bound CI	Standard Error	Sample mean	Significance Level
2.99	3.12	0.03	3.06	0.050

parameter_table

Parameters used for bootstrapping
Sample Size	Repetition	Significance Level
150	1000	0.050

# tables are pandas styler objects not pandas dataframes
type(stats_table)

pandas.io.formats.style.Styler

# Changes the styler object back to a pandas dataframe
stats_table.data

	Lower Bound CI	Upper Bound CI	Standard Error	Sample mean	Significance Level
0	2.9933	3.124667	0.033979	3.057333	0.05

type(stats_table.data)

pandas.core.frame.DataFrame

How to use StrapvizPy

1. Bootstrapping functions

1.1 bootstrap_distribution

1.2 calculate_bootstrap_stats

2. Visualizations

2.1 plot_ci

2.2 tabulate_stats

1.1 `bootstrap_distribution`

1.2 `calculate_bootstrap_stats`

2.1 `plot_ci`

2.2 `tabulate_stats`