esat.data package#

Submodules#

esat.data.analysis module#

class esat.data.analysis.BatchAnalysis(batch_sa: BatchSA, data_handler: DataHandler = None)#

Bases: object

Class for running batch solution analysis.

Parameters:

batch_sa (BatchSA) – A completed ESAT batch source apportionment to run solution analysis on.

plot_loss(show: bool = True)#

Plot the loss value for each model in the batch solution as it changes over time.

A model will stop updating if the convergence criteria is met, which can be identified by the models that stop before reaching max iterations. The ideal loss curve should represent a y=1/x hyperbola, but because of the data uncertainty the curve may not be entirely smooth.

plot_loss_distribution(show: bool = True)#

Plot the distribution of batch model Q(True) and Q(Robust).

A very broad distribution is often a result of a ‘loose’ convergence criteria, increasing converge_n and decreasing converge_delta will narrow the criteria. If the Q(True) and Q(Robust) distributions are very similar the solution may be overfit, where enough sources/factors are available to capture the majority of outline behavior. In this case, reducing the number of factors can resolve overfitting the model.

plot_temporal_residuals(feature_idx: int, show: bool = True)#

Plot the temporal residuals for a specified feature. Only the best model’s residuals are visible initially; others are legendonly.

class esat.data.analysis.ModelAnalysis(datahandler: DataHandler, model: SA, selected_model: int = None)#

Bases: object

Class for running model analysis and generating plots. A collection of model statistic methods and plot generation functions.

Parameters:
  • datahandler (DataHandler) – The datahandler instance used for processing the input and uncertainty datasets used by the SA model.

  • model (SAModel) – A completed SA model with output used for calculating model statistics and generating plots.

  • selected_model (int) – If SA model is part of a batch, the model id/index that will be used for plot labels.

aggregate_factors_for_plotting()#

Aggregate each factor’s V_prime_k for plotting, reducing to max_samples using DataHandler’s binning.

Returns:

Dictionary mapping factor index to aggregated V_prime_k DataFrame.

Return type:

dict

calculate_statistics(results: ndarray = None)#

Calculate general statistics from the results of the NMF model run.

Will generate a pd.DataFrame with a set of metrics for each feature. The resulting dataframe will be accessible as .statistics. These metrics focus on residual analysis, including Norm tests of the residuals with three different metrics for testing the norm.

Parameters:

results (np.ndarray) – The default behavior is for this function to use the ESAT model WH matrix for calculating metrics, this can be overriden by providing np.ndarray in the ‘results’ parameter. Default = None.

features_metrics(est_V: ndarray = None)#

Create a dataframe of the feature metrics and error for model analysis.

Parameters:

est_V (np.ndarray) – Overrides the use of the ESAT model’s WH matrix in the residual calculation. Default = None.

Returns:

The features of the input dataset compared to the results of the model, as a pd.DataFrame

Return type:

pd.DataFrame

plot_all_factors(factor_list: list = None, H: ndarray = None, W: ndarray = None, show: bool = True)#

Create a vertical set of subplots for all factor profiles, similar to plot_factor_profile.

Parameters:
  • factor_list (list) – A list of factor indices to plot, if None will plot all factors.

  • H (np.ndarray) – Overrides the factor profile matrix in the ESAT model used for the plot.

  • W (np.ndarray) – Overrides the factor contribution matrix in the ESAT model used for the plot.

plot_all_factors_3d(H=None, W=None, show: bool = True, plot_type: str = 'profile')#

Create a 3D bar plot of the factor profiles and their contributions. :param H: The factor profile matrix, if None will use the model’s H matrix. :type H: np.ndarray, optional :param W: The factor contribution matrix, if None will use the model’s W matrix. :type W: np.ndarray, optional :param show: If True, the plot will be displayed. Default is True. :type show: bool :param plot_type: Should be either “profile”, “conc”, or “both”. :type plot_type: str

plot_estimated_observed(feature_idx: int = None, feature_name: str = None, show: bool = True)#

Create a plot that shows the estimated concentrations of a feature vs the observed concentrations.

Parameters:
  • feature_idx (int, optional) – The index of the feature to plot.

  • feature_name (str, optional) – The name of the feature to plot.

  • show (bool) – If True, the plot will be displayed. Default is True.

plot_estimated_timeseries(feature_idx: int = None, feature_name: str = None, show: bool = True)#

Create a plot that shows the estimated values of a timeseries for a specific feature.

Parameters:
  • feature_idx (int, optional) – The index of the feature to plot.

  • feature_name (str, optional) – The name of the feature to plot.

  • show (bool) – If True, the plot will be displayed. Default is True.

plot_factor_composition()#

Creates a radar plot of the composition of all the factors to all features.

plot_factor_contributions(feature_idx: int, contribution_threshold: float = 0.05, show: bool = True)#

Create a plot of the factor contributions and the normalized contribution.

Parameters:
  • feature_idx (int) – The index of the feature to plot.

  • contribution_threshold (float) – The contribution percentage of a factor above which to include on the plot.

  • show (bool) – If True, the plot will be displayed. Default is True.

plot_factor_fingerprints(grouped: bool = False, show: bool = True)#

Create a stacked bar plot of the factor profile, fingerprints.

plot_factor_profile(factor_idx: int, H: ndarray = None, W: ndarray = None, show: bool = True)#

Create a bar plot of a factor profile.

Parameters:
  • factor_idx (int) – The index of the factor to plot (1 -> k).

  • H (np.ndarray) – Overrides the factor profile matrix in the ESAT model used for the plot.

  • W (np.ndarray) – Overrides the factor contribution matrix in the ESAT model used for the plot.

  • show (bool) – If True, the plot will be displayed. Default is True.

plot_factor_surface(factor_idx: int = 1, feature_idx: int = None, percentage: bool = True, zero_threshold: float = 0.0001)#

Creates a 3d surface plot of the specified factor_idx’s concentration percentage or mass.

Parameters:
  • factor_idx (int) – The factor index to plot showing all features for that factor, if factor_idx is none will show the feature_idx for all factors.

  • feature_idx (int) – The feature to include in the plot if factor_idx is none, otherwise will show all features for a specified factor_idx.

  • percentage (bool) – Plot the concentration as a scaled value, percentage of the sum of all factors, or as the calculated mass. Default = True.

  • zero_threshold (float) – Values below this threshold are considered zero on the plot.

plot_g_space(factor_1: int, factor_2: int, show: bool = True)#

Create a scatter plot showing a factor contributions vs another factor contributions.

Parameters:
  • factor_1 (int) – The index of the factor to plot along the x-axis.

  • factor_2 (int) – The index of the factor to plot along the y-axis.

  • show (bool) – If True, the plot will be displayed. Default is True.

plot_residual_histogram(feature_idx: int = None, feature_name: str = None, abs_threshold: float = 3.0, est_V: ndarray = None, show: bool = True)#

Create a plot of a histogram of the residuals for a specific feature.

Parameters:
  • feature_idx (int, optional) – The index of the feature to plot.

  • feature_name (str, optional) – The name of the feature to plot.

  • abs_threshold (float) – The function generates a list of residuals that exceed this limit, the absolute value of the limit.

  • est_V (np.ndarray) – Overrides the use of the ESAT model’s WH matrix in the residual calculation. Default = None.

  • show (bool) – If True, the plot will be displayed. Default is True.

esat.data.datahandler module#

class esat.data.datahandler.DataHandler(input_path: str, uncertainty_path: str, index_col: str = None, drop_col: list = None, drop_nans: bool = True, loc_cols: str | list = None, sn_threshold: float = 2.0, load: bool = True, loc_metadata: dict = None, max_plotting_n: int = 10000)#

Bases: object

The class for cleaning and preparing input datasets for use in ESAT.

The DataHandler class is intended to provide a standardized way of cleaning and preparing data from file to ESAT models.

The input and uncertainty data files are specified by their file paths. Input files can be .csv or tab separated text files. Other file formats are not supported at this time.

#TODO: Add additional supported file formats by expanding the __read_data function.

Parameters:
  • input_path (str) – The file path to the input dataset.

  • uncertainty_path (str) – The file path to the uncertainty dataset. #TODO: Add the option of generating an uncertainty dataset from a provided input dataset, using a random selection of some percentage range of the input dataset cell values.

  • index_col (str) – The name of the index column if it is not the first column in the dataset. Default = None, which will use the 1st column.

  • drop_col (list) – A list of columns to drop from the dataset. Default = None.

  • loc_cols (str | list) – Location information columns, such as latitude/longitude or other identifier, that are used to identify the location of the data.

  • sn_threshold (float) – The threshold for the signal to noise ratio values.

  • load (bool) – Load the input and uncertainty data files, used internally for load_dataframe.

  • loc_metadata (dict) – Optional dictionary containing metadata about the locations in the dataset, such as latitude and longitude.

aggregate_output(output_array: ndarray) DataFrame#

Aggregate an output numpy array using the same bins/labels as used in _aggregate_data. Returns a pandas DataFrame.

get_data()#

Get the processed input and uncertainty dataset ready for use in ESAT. :returns: The processed input dataset and the processed uncertainty dataset as numpy arrays. :rtype: np.ndarray, np.ndarray

static load_dataframe(input_df: DataFrame, uncertainty_df: DataFrame)#

Pass in pandas dataframes for the input and uncertainty datasets, instead of using files.

Parameters:
  • input_df

  • uncertainty_df

Returns:

Instance of DataHandler using dataframes as input.

Return type:

DataHandler

merge(data_handlers: list, source_labels: list)#

Merge a list of DataHandler instances into this DataHandler instance. All instances must have the same features as the current instance. Adds a ‘source_label’ column indicating the origin of each row.

Parameters:
  • data_handlers (list) – A list of DataHandler instances to merge.

  • source_labels (list) – A list of labels (str) indicating the source of each DataHandler.

Returns:

True if merging was successful, otherwise False.

Return type:

bool

plot_2d_histogram(x_col: str, y_col: str, show: bool = True, nbins: int = 100)#

Plots a 2D histogram of two features in the input data. :param x_col: The name of the feature to plot on the x-axis. :type x_col: str :param y_col: The name of the feature to plot on the y-axis. :type y_col: str :param show: Whether to display the plot immediately. :type show: bool :param nbins: The number of bins to use for the histogram in both x and y dimensions. :type nbins: int

Returns:

The Plotly figure object containing the 2D histogram.

Return type:

Plotly.graph_objects.Figure

plot_data_uncertainty(show: bool = True, include_menu: bool = True, feature_idx: int = None)#

Create a plot of the data vs the uncertainty for a specified feature, with a dropdown menu for feature selection.

plot_feature_correlation_heatmap(method: str = 'pearson', show: bool = True)#

Plots a correlation heatmap for the features in the DataFrame.

Parameters:
  • df (pd.DataFrame) – The input DataFrame with features as columns.

  • method (str) – Correlation method: ‘pearson’, ‘spearman’, or ‘kendall’.

  • show (bool) – Whether to display the plot immediately.

Returns:

The Plotly heatmap figure.

Return type:

plotly.graph_objects.Figure

plot_feature_data(x_idx, y_idx, show: bool = True)#

Create a plot of a data feature, column, vs another data feature, column. Specified by the feature indices.

Parameters:
  • x_idx (int) – The feature index for the x-axis values.

  • y_idx (int) – The feature index for the y-axis values.

plot_feature_timeseries(feature_selection, show: bool = True)#

Create a plot of a feature, or list of features, as a timeseries.

Parameters:

feature_selection (int or list) – A single or list of feature indices to plot as a timeseries.

plot_ridgeline(log_x=True, fill=False, max_height=800, min_spacing=0.5, max_spacing=1.5, nbins=500, show=True)#

Create a ridgeline plot of the feature distributions in the input data.

Parameters:
  • log_x (bool) – Whether to use a logarithmic scale for the x-axis.

  • fill (bool) – Whether to fill the area under the curves.

  • max_height (int) – The maximum height of the plot in pixels.

  • min_spacing (float) – The minimum spacing between the ridgelines.

  • max_spacing (float) – The maximum spacing between the ridgelines.

  • nbins (int) – The number of bins to use for the histogram in the x-axis.

  • show (bool) – Whether to display the plot immediately.

Returns:

The Plotly figure object containing the ridgeline plot.

Return type:

plotly.graph_objects.Figure

plot_superimposed_histograms(show: bool = True, nbins: int = 50)#

Plots superimposed histograms for each feature in the input data using a colormap.

set_category(feature: str, category: str = 'strong')#

Set the S/N category for the feature, options are ‘strong’, ‘weak’ or ‘bad’. All features are set to ‘strong’ by default, which doesn’t modify the feature’s behavior in models. Features categorized as ‘weak’ triples their uncertainty and ‘bad’ features are excluded from analysis.

Parameters:
  • feature (str) – The name or label of the feature.

  • category (str) – The new category of the feature

Returns:

True if the change was successful, otherwise False.

Return type:

bool

split_locations()#

When the input data has location information, this function returns splits the data and uncertainty into separate DataHandler instances for each location.

Returns:

A list of DataHandler instances, one for each unique location in the input data.

Return type:

list

esat.data.test_tools module#

class esat.data.test_tools.CompareAnalyzer(input_df, pmf_profile_df, pmf_contributions_df, ls_profile_df, ws_profile_df, ls_mapping, ws_mapping, ls_contributions_df, ws_contributions_df, features, datetimestamps)#

Bases: object

Compare ESAT output with the PMF5 output.

feature_histogram(feature: str = None, feature_i: int = 0, normalized: bool = False, threshold: float = 3.0)#
plot_factor_contribution(feature: str = None, feature_i: int = 0)#
plot_factors()#
plot_feature_timeseries(factor_n: int, feature_n, show_input: bool = True)#
plot_fingerprints(ls_nmf_r2: -1, ws_nmf_r2: -1)#
timeseries_plot(feature: str = None, feature_i: int = 0)#

Module contents#