esat.data package#

Submodules#

esat.data.analysis module#

class esat.data.analysis.BatchAnalysis(batch_sa: BatchSA, data_handler: DataHandler | None = None)#

Bases: object

Class for running batch solution analysis.

Parameters:

batch_sa (BatchSA) – A completed ESAT batch source apportionment to run solution analysis on.

plot_loss()#

Plot the loss value for each model in the batch solution as it changes over time.

A model will stop updating if the convergence criteria is met, which can be identified by the models that stop before reaching max iterations. The ideal loss curve should represent a y=1/x hyperbola, but because of the data uncertainty the curve may not be entirely smooth.

plot_loss_distribution()#

Plot the distribution of batch model Q(True) and Q(Robust).

A very broad distribution is often a result of a ‘loose’ convergence criteria, increasing converge_n and decreasing converge_delta will narrow the criteria. If the Q(True) and Q(Robust) distributions are very similar the solution may be overfit, where enough sources/factors are available to capture the majority of outline behavior. In this case, reducing the number of factors can resolve overfitting the model.

plot_temporal_residuals(feature_idx: int)#

Plot the temporal residuals for a specified feature, by index, of all models in the SA batch.

Parameters:

feature_idx (int) – The index of the feature to plot.

class esat.data.analysis.ModelAnalysis(datahandler: DataHandler, model: SA, selected_model: int | None = None)#

Bases: object

Class for running model analysis and generating plots. A collection of model statistic methods and plot generation functions.

Parameters:
  • datahandler (DataHandler) – The datahandler instance used for processing the input and uncertainty datasets used by the SA model.

  • model (SAModel) – A completed SA model with output used for calculating model statistics and generating plots.

  • selected_model (int) – If SA model is part of a batch, the model id/index that will be used for plot labels.

calculate_statistics(results: ndarray | None = None)#

Calculate general statistics from the results of the NMF model run.

Will generate a pd.DataFrame with a set of metrics for each feature. The resulting dataframe will be accessible as .statistics. These metrics focus on residual analysis, including Norm tests of the residuals with three different metrics for testing the norm.

Parameters:

results (np.ndarray) – The default behavior is for this function to use the ESAT model WH matrix for calculating metrics, this can be overriden by providing np.ndarray in the ‘results’ parameter. Default = None.

features_metrics(est_V: ndarray | None = None)#

Create a dataframe of the feature metrics and error for model analysis.

Parameters:

est_V (np.ndarray) – Overrides the use of the ESAT model’s WH matrix in the residual calculation. Default = None.

Returns:

The features of the input dataset compared to the results of the model, as a pd.DataFrame

Return type:

pd.DataFrame

plot_estimated_observed(feature_idx: int)#

Create a plot that shows the estimates concentrations of a feature vs the observed concentrations.

Parameters:

feature_idx (int) – The index of the feature to plot.

plot_estimated_timeseries(feature_idx: int)#

Create a plot that shows the estimated values of a timeseries for a specific feature, selected by feature index.

Parameters:

feature_idx (int) – The index of the feature to plot.

plot_factor_composition()#

Creates a radar plot of the composition of all the factors to all features.

plot_factor_contributions(feature_idx: int, contribution_threshold: float = 0.05)#

Create a plot of the factor contributions and the normalized contribution.

Parameters:
  • feature_idx (int) – The index of the feature to plot.

  • contribution_threshold (float) – The contribution percentage of a factor above which to include on the plot.

plot_factor_fingerprints(grouped: bool = False)#

Create a stacked bar plot of the factor profile, fingerprints.

plot_factor_profile(factor_idx: int, H: ndarray | None = None, W: ndarray | None = None)#

Create a bar plot of a factor profile.

Parameters:
  • factor_idx (int) – The index of the factor to plot (1 -> k).

  • H (np.ndarray) – Overrides the factor profile matrix in the ESAT model used for the plot.

  • W (np.ndarray) – Overrides the factor contribution matrix in the ESAT model used for the plot.

plot_factor_surface(factor_idx: int = 1, feature_idx: int | None = None, percentage: bool = True, zero_threshold: float = 0.0001)#

Creates a 3d surface plot of the specified factor_idx’s concentration percentage or mass.

Parameters:
  • factor_idx (int) – The factor index to plot showing all features for that factor, if factor_idx is none will show the feature_idx for all factors.

  • feature_idx (int) – The feature to include in the plot if factor_idx is none, otherwise will show all features for a specified factor_idx.

  • percentage (bool) – Plot the concentration as a scaled value, percentage of the sum of all factors, or as the calculated mass. Default = True.

  • zero_threshold (float) – Values below this threshold are considered zero on the plot.

plot_g_space(factor_1: int, factor_2: int)#

Create a scatter plot showing a factor contributions vs another factor contributions.

Parameters:
  • factor_1 (int) – The index of the factor to plot along the x-axis.

  • factor_2 (int) – The index of the factor to plot along the y-axis.

plot_residual_histogram(feature_idx: int, abs_threshold: float = 3.0, est_V: ndarray | None = None)#

Create a plot of a histogram of the residuals for a specific feature.

Parameters:
  • feature_idx (int) – The index of the feature for the plot.

  • abs_threshold (float) – The function generates a list of residuals that exceed this limit, the absolute value of the limit.

  • est_V (np.ndarray) – Overrides the use of the ESAT model’s WH matrix in the residual calculation. Default = None.

Returns:

The list of residuals that exceed the absolute value of the threshold, as a pd.DataFrame

Return type:

pd.DataFrame

esat.data.datahandler module#

class esat.data.datahandler.DataHandler(input_path: str, uncertainty_path: str, index_col: str | None = None, drop_col: list | None = None, sn_threshold: float = 2.0, load: bool = True)#

Bases: object

The class for cleaning and preparing input datasets for use in ESAT.

The DataHandler class is intended to provide a standardized way of cleaning and preparing data from file to ESAT models.

The input and uncertainty data files are specified by their file paths. Input files can be .csv or tab separated text files. Other file formats are not supported at this time.

#TODO: Add additional supported file formats by expanding the __read_data function.

Parameters:
  • input_path (str) – The file path to the input dataset.

  • uncertainty_path (str) – The file path to the uncertainty dataset. #TODO: Add the option of generating an uncertainty dataset from a provided input dataset, using a random selection of some percentage range of the input dataset cell values.

  • index_col (str) – The name of the index column if it is not the first column in the dataset. Default = None, which will use the 1st column.

  • drop_col (list) – A list of columns to drop from the dataset. Default = None.

  • sn_threshold (float) – The threshold for the signal to noise ratio values.

  • load (bool) – Load the input and uncertainty data files, used internally for load_dataframe.

get_data()#

Get the processed input and uncertainty dataset ready for use in ESAT. :returns: The processed input dataset and the processed uncertainty dataset as numpy arrays. :rtype: np.ndarray, np.ndarray

static load_dataframe(input_df: DataFrame, uncertainty_df: DataFrame)#

Pass in pandas dataframes for the input and uncertainty datasets, instead of using files.

Parameters:
  • input_df

  • uncertainty_df

Returns:

Instance of DataHandler using dataframes as input.

Return type:

DataHandler

plot_data_uncertainty(feature_idx)#

Create a plot of the data vs the uncertainty for a specified feature, by the feature index.

Parameters:

feature_idx (int) – The index of the feature, column, of the input and uncertainty dataset to plot.

plot_feature_data(x_idx, y_idx)#

Create a plot of a data feature, column, vs another data feature, column. Specified by the feature indices.

Parameters:
  • x_idx (int) – The feature index for the x-axis values.

  • y_idx (int) – The feature index for the y-axis values.

plot_feature_timeseries(feature_selection)#

Create a plot of a feature, or list of features, as a timeseries.

Parameters:

feature_selection (int or list) – A single or list of feature indices to plot as a timeseries.

set_category(feature: str, category: str = 'strong')#

Set the S/N category for the feature, options are ‘strong’, ‘weak’ or ‘bad’. All features are set to ‘strong’ by default, which doesn’t modify the feature’s behavior in models. Features categorized as ‘weak’ triples their uncertainty and ‘bad’ features are excluded from analysis.

Parameters:
  • feature (str) – The name or label of the feature.

  • category (str) – The new category of the feature

Returns:

True if the change was successful, otherwise False.

Return type:

bool

esat.data.test_tools module#

class esat.data.test_tools.CompareAnalyzer(input_df, pmf_profile_df, pmf_contributions_df, ls_profile_df, ws_profile_df, ls_mapping, ws_mapping, ls_contributions_df, ws_contributions_df, features, datetimestamps)#

Bases: object

Compare ESAT output with the PMF5 output.

feature_histogram(feature: str | None = None, feature_i: int = 0, normalized: bool = False, threshold: float = 3.0)#
plot_factor_contribution(feature: str | None = None, feature_i: int = 0)#
plot_factors()#
plot_feature_timeseries(factor_n: int, feature_n, show_input: bool = True)#
plot_fingerprints(ls_nmf_r2: -1, ws_nmf_r2: -1)#
timeseries_plot(feature: str | None = None, feature_i: int = 0)#

Module contents#