esat.model package#
Submodules#
esat.model.batch_sa module#
- class esat.model.batch_sa.BatchSA(V: ndarray, U: ndarray, factors: int, models: int = 20, method: str = 'ls-nmf', seed: int = 42, H: ndarray | None = None, W: ndarray | None = None, H_ratio: float = 0.9, init_method: str = 'column_mean', init_norm: bool = True, fuzziness: float = 5.0, max_iter: int = 20000, converge_delta: float = 0.1, converge_n: int = 100, best_robust: bool = True, robust_mode: bool = False, robust_n: int = 200, robust_alpha: float = 4.0, parallel: bool = True, cpus: int = -1, optimized: bool = True, verbose: bool = True)#
Bases:
object
The batch SA class is used to create multiple SA models, using the same input configuration and different random seeds for initialization of W and H matrices.
The batch SA class allows for the parallel execution of multiple NMF models.
The set of parameters for the batch include both the initialization parameters and the model run parameters.
- Parameters:
V (np.ndarray) – The input data matrix containing M samples (rows) by N features (columns).
U (np.ndarray) – The uncertainty associated with the data points in the V matrix, of shape M x N.
factors (int) – The number of factors, sources, SA will create through the W and H matrices.
models (int) – The number of SA models to create. Default = 20.
method (str) – The NMF algorithm to be used for updating the W and H matrices. Options are: ‘ls-nmf’ and ‘ws-nmf’.
seed (int) – The random seed used for initializing the W and H matrices. Default is 42.
H (np.ndarray) – Optional, predefined factor profile matrix. Accepts profiles of size one to ‘factors’.
W (np.ndarray) – Optional, predefined factor contribution matrix.
H_ratio (float) – Optional, when H has been provided and contains one or more profiles. The H_ratio defines how much weight the predefined profiles have relative to the randomly created one, where a value of 1.0 would mean that the predefined profiles are only relative to each other and account for 100% of their features. Default: 0.9
init_method (str) – The default option is column means, though any option other than ‘kmeans’ or ‘cmeans’ will use the column means initialization when W and/or H is not provided.
init_norm (bool) – When using init_method either ‘kmeans’ or ‘cmeans’, this option allows for normalizing the input dataset prior to clustering.
fuzziness (float) – The amount of fuzziness to apply to fuzzy c-means clustering. Default is 5.
max_iter (int) – The maximum number of iterations to update W and H matrices. Default: 20000
converge_delta (float) – The change in the loss value where the model will be considered converged. Default: 0.1
converge_n (int) – The number of iterations where the change in the loss value is less than converge_delta for the model to be considered converged. Default: 100
best_robust (bool) – Use the Q(robust) loss value to determine which model is the best, instead of Q(true). Default = True.
robust_mode (bool) – Used to turn on the robust mode, use the robust loss value in the update algorithm. Default: False
robust_n (int) – When robust_mode=True, the number of iterations to use the default mode before turning on the robust mode to prevent reducing the impact of non-outliers. Default: 200
robust_alpha (float) – When robust_mode=True, the cutoff of the uncertainty scaled residuals to decrease the weights. Robust weights are calculated as the uncertainty multiplied by the square root of the scaled residuals over robust_alpha. Default: 4.0
parallel (bool) – Run the individual models in parallel, not the same as the optimized parallelized option for an SA ws-nmf model. Default = True.
cpus (int) – The number of cpus to use for parallel processing. Default is the number of cpus - 1.
optimized (bool) – The two update algorithms have also been written in Rust, which can be compiled with maturin, providing an optimized implementation for rapid training of SA models. Setting optimized to True will run the compiled Rust functions.
verbose (bool) – Allows for increased verbosity of the initialization and model training steps.
- details()#
- static load(file_path: str)#
Load a previously saved Batch SA pickle file.
- Parameters:
file_path (str) – File path to a previously saved Batch SA pickle file
- Returns:
On successful load, will return a previously saved Batch SA object. Will return None on load fail.
- Return type:
- save(batch_name: str, output_directory: str, pickle_model: bool = False, pickle_batch: bool = True, header: list | None = None)#
Save the collection of SA models. They can be saved as individual files (csv and json files), as individual pickle models (each SA model), or as a single SA model of the batch SA object.
- Parameters:
batch_name (str) – The name to use for the batch save files.
output_directory – The output directory to save the batch sa files to.
pickle_model (bool) – Pickle the individual models, creating a separate pickle file for each SA model. Default = False.
pickle_batch (bool) – Pickle the batch SA object, which will contain all the SA objects. Default = True.
header (list) – A list of headers, feature names, to add to the top of the csv files. Default: None
- Returns:
The path to the output directory, if pickle=False or the path to the pickle file. If save fails returns None
- Return type:
str
- train(min_limit: int | None = None)#
Execute the training sequence for the batch of SA models using the shared configuration parameters.
- Parameters:
min_limit (int) – The maximum allowed time limit for training, in minutes. Default is None and specifying this parameter will not enforce the time limit on the current iteration but will halt the training process if a single model exceeds the limit.
- Returns:
True and “” if the model train is successful, if training fails then the function will return False with an error message explaining the reason training ended.
- Return type:
bool, str
esat.model.ls_nmf module#
- class esat.model.ls_nmf.LSNMF#
Bases:
object
- static update(V: ndarray, We: ndarray, W: ndarray, H: ndarray)#
The update procedure for the least-squares nmf (ls-nmf) algorithm.
The ls-nmf algorithm is described in the publication ‘LS-NMF: A modified non-negative matrix factorization algorithm utilizing uncertainty estimates’ (https://doi.org/10.1186/1471-2105-7-175).
- Parameters:
V (np.ndarray) – The input dataset.
We (np.ndarray) – The weights calculated from the input uncertainty dataset.
W (np.ndarray) – The factor contribution matrix.
H (np.ndarray) – The factor profile matrix.
- Returns:
The updated W and H matrices.
- Return type:
np.ndarray, np.ndarray
esat.model.recombinator module#
- class esat.model.recombinator.OptimalBlockLength(b_star_sb, b_star_cb)#
Bases:
NamedTuple
- b_star_cb: float#
Alias for field number 1
- b_star_sb: float#
Alias for field number 0
- esat.model.recombinator.lam(kk: ndarray) ndarray #
Helper function, calculates the flattop kernel weights.
Adapted for Python August 12, 2018 by Michael C. Nowotny
- esat.model.recombinator.mlag(x: ndarray, n: int | None = 1, init: float | None = 0.0) ndarray #
Purpose: generates a matrix of n lags from a matrix (or vector) containing a set of vectors (For use in var routines)
Usage: xlag = mlag(x,nlag) or: xlag1 = mlag(x), which defaults to 1-lag where: x is a nobs by nvar NumPy array
- Parameters:
x (nlag = # of contiguous lags for each vector in)
= (init) – (default = 0)
Returns: xlag = a matrix of lags (nobs x nvar*nlag) x1(t-1), x1(t-2), … x1(t-nlag), x2(t-1), … x2(t-nlag) …
original Matlab version written by: James P. LeSage, Dept of Economics University of Toledo 2801 W. Bancroft St, Toledo, OH 43606 jpl@jpl.econ.utoledo.edu
Adapted for Python August 12, 2018 by Michael C. Nowotny
- esat.model.recombinator.optimal_block_length(data: ndarray) Sequence[OptimalBlockLength] #
This is a function to select the optimal (in the sense of minimising the MSE of the estimator of the long-run variance) block length for the stationary bootstrap or circular bootstrap. The code follows Politis and White, 2001, “Automatic Block-Length Selection for the Dependent Bootstrap”.
DECEMBER 2007: CORRECTED TO DEAL WITH ERROR IN LAHIRI’S PAPER, PUBLISHED BY NORDMAN IN THE ANNALS OF STATISTICS
- NOTE: The optimal average block length for the stationary bootstrap,
and it does not need to be an integer. The optimal block length for the circular bootstrap should be an integer. Politis and White suggest rounding the output UP to the nearest integer.
- Args:
data, an nxk matrix
- Returns: a 2xk NumPy array of optimal bootstrap block lengths,
[[b_star_sb], [b_star_cb]], where b_star_sb: optimal block length for stationary bootstrap b_star_cb: optimal block length for circular bootstrap
original Matlab version written by: Andrew Patton
4 December, 2002 Revised (to include CB): 13 January, 2003.
Helpful suggestions for this code were received from Dimitris Politis and Kevin Sheppard.
Modified 23.8.2003 by Kevin Sheppard for speed issues
Adapted for Python August 12, 2018 by Michael C. Nowotny
esat.model.sa module#
- class esat.model.sa.SA(V: ndarray, U: ndarray, factors: int, method: str = 'ls-nmf', seed: int = 42, optimized: bool = True, parallelized: bool = True, verbose: bool = False)#
Bases:
object
The primary Source Apportionment model object which holds and manages the configuration, data, and meta-data for executing and analyzing ESAT output.
The SA class object contains all the parameters and data required for executing one of the implemented NMF algorithms.
The SA class contains the core logic for managing all the steps in the ESAT workflow. These include:
1) The initialization of the factor profile (H) and factor contribution matrices (W) where these matrices can be set using passed in values, or randomly determined based upon the input data through mean distributions, k-means, or fuzzy c-means clustering.
2) The executing of the specified NMF algorithm for updating the W and H matrices. The two currently implemented algorithms are least-squares nmf (LS-NMF) and weighted-semi nmf (WS-NMF).
- Parameters:
V (np.ndarray) – The input data matrix containing M samples (rows) by N features (columns).
U (np.ndarray) – The uncertainty associated with the data points in the V matrix, of shape M x N.
factors (int) – The number of factors, sources, SA will create in the W and H matrices.
method (str) – The NMF algorithm to be used for updating the W and H matrices. Options are: ‘ls-nmf’ and ‘ws-nmf’.
seed (int) – The random seed used for initializing the W and H matrices. Default is 42.
optimized (bool) – The two update algorithms have also been written in Rust, which can be compiled with maturin, providing an optimized implementation for rapid training of SA models. Setting optimized to True will run the compiled Rust functions.
parallelized (bool) – The Rust implementation of ‘ws-nmf’ has a parallelized version for increased optimization. This parameter is only used when method=’ws-nmf’ and optimized=True, then setting parallelized=True will run the parallel version of the function.
verbose (bool) – Allows for increased verbosity of the initialization and model training steps.
- initialize(H: ndarray | None = None, W: ndarray | None = None, init_method: str = 'column_mean', init_norm: bool = True, fuzziness: float = 5.0, H_ratio: float = 0.9)#
Initialize the factor profile (H) and factor contribution matrices (W).
The W and H matrices can be created using several methods or be passed in by the user. The shapes of these matrices are W: (M, factors) and H: (factors: N). There are three methods for initializing the W and H matrices: 1) K Means Clustering (‘kmeans’), which will cluster the input dataset into the number of factors set, then assign the contributions of to those factors, the H matrix is calculated from the centroids of those clusters. 2) Fuzzy C-Means Clustering (‘cmeans’), which will cluster the input dataset in the same way as kmeans but sets the contributions based upon the ratio of the distance to the clusters. 3) A random sampling based upon the square root of the mean of the features (columns), the default method.
- Parameters:
H (np.ndarray) – The factor profile matrix of shape (factors, N), provided by the user when not using one of the three initialization methods. H is always a non-negative matrix. If fewer than the specified factors are provided, the factor profiles in the H matrix are inserted into the randomly sampled factor profile matrix of shape (factors, N).
W (np.ndarray) – The factor contribution matrix of shape (M, factors), provided by the user when not using one of the three initialization methods. When using method=ws-nmf, the W matrix can contain negative values.
init_method (str) – The default option is column means ‘column_mean’, other valid options are: ‘kmeans’ or ‘cmeans’. Used when W and/or H is not provided.
init_norm (bool) – When using init_method either ‘kmeans’ or ‘cmeans’, this option allows for normalizing the input dataset prior to clustering.
fuzziness (float) – The amount of fuzziness to apply to fuzzy c-means clustering. Default is 5. See fuzzy c-means clustering documentation.
H_ratio (float) – When some number of factors has been provided, H is not None, than the H_ratio determines how much of the features in the provided H contribute to the complete feature profile ratio.
- static load(file_path: str)#
Load a previously saved SA pickle file.
- Parameters:
file_path (str) – File path to a previously saved SA pickle file
- Returns:
On successful load, will return a previously saved SA object. Will return None on load fail.
- Return type:
- save(model_name: str, output_directory: str, pickle_model: bool = False, header: list | None = None)#
Save the SA model to file.
Two options are provided for saving the output of SA to file, 1) saving the SA to separate files (csv and json) and 2) saving the SA model to a binary pickle object. The files are written to the provided output_directory path, if it exists, using the model_name for the file names.
- Parameters:
model_name (str) – The name for the model save files.
output_directory (str) – The path to save the files to, path must exist.
pickle_model (bool) – Saving the model to a pickle file, default = False.
header (list) – A list of headers, feature names, to add to the top of the csv files. Default: None
- Returns:
The path to the output directory, if pickle=False or the path to the pickle file. If save fails returns None
- Return type:
str
- summary()#
Provides a summary of the model configuration and results if completed.
- train(max_iter: int = 20000, converge_delta: float = 0.1, converge_n: int = 100, model_i: int = -1, robust_mode: bool = False, robust_n: int = 200, robust_alpha: float = 4, update_step: str | None = None)#
Train the SA model by iteratively updating the W and H matrices reducing the loss value Q until convergence.
The train method runs the update algorithm until the convergence criteria is met or the maximum number of iterations is reached. The stopping conditions are specified by the input parameters to the train method. The maximum number of iterations is set by the max_iter parameter, default is 2000, and the convergence criteria is defined as a change in the loss value Q less than converge_delta, default is 0.1, over converge_n steps, default is 100.
The loss function has an alternative mode, where the weights are modified to decrease the impact of data points that have a high uncertainty-scaled residual, greater than 4. This is the same loss function that calculates the Q(robust) value, turning robust_mode=True will switch to using the robust value for updating W and H. Robust_n is the number of iterations to run in the default mode before switching to the robust mode, waiting for a partial complete solution to be found before reducing the impact of those outlier residuals. Robust_alpha is both the cut off value of the uncertainty scaled residuals and the square root of the scaled residuals over robust_alpha is the adjustment made to the weights.
- Parameters:
max_iter (int) – The maximum number of iterations to update W and H matrices. Default: 20000
converge_delta (float) – The change in the loss value where the model will be considered converged. Default: 0.1
converge_n (int) – The number of iterations where the change in the loss value is less than converge_delta for the model to be considered converged. Default: 100
model_i (int) – The model index, used for identifying models for parallelized processing.
robust_mode (bool) – Used to turn on the robust mode, use the robust loss value in the update algorithm. Default: False
robust_n (int) – When robust_mode=True, the number of iterations to use the default mode before turning on the robust mode to prevent reducing the impact of non-outliers. Default: 200
robust_alpha (float) –
- When robust_mode=True, the cutoff of the uncertainty scaled residuals to decrease the weights. Robust weights
are calculated as the uncertainty multiplied by the square root of the scaled residuals over robust_alpha. Default: 4.0
update_step (str) – A replacement to the update method used for algorithm experimentation.
esat.model.ws_nmf module#
- class esat.model.ws_nmf.WSNMF#
Bases:
object
- static update(V: ndarray, We: ndarray, W: ndarray, H: ndarray)#
Weighted Semi-NMF algorithm.
The details of the semi-nmf algorithm are described in ‘Convex and Semi-Nonnegative Matrix Factorizations’ (https://doi.org/10.1109/TPAMI.2008.277). The algorithm described here does not include the use of uncertainty or weights. The paper ‘Semi-NMF and Weighted Semi-NMF Algorithms Comparison’ by Eric Velten de Melo and Jacques Wainer provides some additional details for part of the weighted semi-NMF algorithm as defined in this function.
The update procedure defined in this function was created by merging the main concepts of these two papers.
- Parameters:
V (np.ndarray) – The input dataset.
We (np.ndarray) – The weights calculated from the input uncertainty dataset.
W (np.ndarray) – The factor contribution matrix, prior W is not used by this algorithm but provided here for testing.
H (np.ndarray) – The factor profile matrix.
- Returns:
The updated W and H matrices.
- Return type:
np.ndarray, np.ndarray