eval package#
Submodules#
eval.factor_comparison module#
- class eval.factor_comparison.FactorCompare(input_df: DataFrame, uncertainty_df: DataFrame, base_profile_df: DataFrame, base_contribution_df: DataFrame, batch_sa: BatchSA, sa_output_file: str | None = None, method: str = 'all')#
Bases:
object
Compare the results between a single base solution and a collection of solutions. Used for comparing the output of ESAT to those generated by PMF5 and by the Simulator for comparing the output of models using synthetic data and known synthetic profiles.
#TODO: Factor compare for models with a different number of factors
- Parameters:
input_df (pd.DataFrame) – The input dataset dataframe.
uncertainty_df (pd.DataFrame) – The uncertainty dataset dataframe.
base_profile_df (pd.DataFrame) – The base profile (H) dataframe.
base_contribution_df (pd.DataFrame) – The base contribution (W) dataframe.
factors_columns (list) – The list of factor names.
features (list) – The list of feature names.
batch_sa (BatchSA) – A completed instance of BatchSA whose models will be evaluated against the base model.
sa_output_file (str) – The path to a completed BatchSA save file.
method (str) – The selection method for best mapping, correlation of ‘W’, ‘H’, ‘WH’, or ‘all’.
- static calculate_correlation(factor1, factor2)#
Calculates the correlation between two factors.
- combine_factors(factors, model_correlation, model_contributions, factor_contributions, base_k: bool = False)#
Combine the results from parallelized calculations.
- compare(verbose: bool = True)#
Run the model comparison.
- Parameters:
verbose (bool) – Display the results of the comparison.
- static load_pmf_output(factors: int, input_df: DataFrame, uncertainty_df: DataFrame, pmf_profile_file: str, pmf_contribution_file: str, batch_sa: BatchSA)#
Load the output of a completed PMF5 base model, specifying the profile and contribution files.
- Parameters:
factors (int) – The number of factors used in the PMF5 models.
input_df (pd.DataFrame) – The input dataset dataframe.
uncertainty_df (pd.DataFrame) – The uncertainty dataset dataframe.
pmf_profile_file (str) – The path to the PMF5 output factor profile file.
pmf_contribution_file (str) – The path to the PMF5 output factor contribution file.
batch_sa (BatchSA) – The completed BatchSA instance which will be compared to the PMF5 output.
- Returns:
An initialized instance of FactorCompare.
- Return type:
- print_results(model: int | None = None)#
Print the results of the comparison, defaulting to the model with the highest correlation mapping unless model is specified.
- Parameters:
model (int) – The model index for printing the results of a specific model.
eval.simulator module#
- class eval.simulator.Simulator(seed: int, factors_n: int, features_n: int, samples_n: int, outliers: bool = True, outlier_p: float = 0.1, outlier_mag: float = 2.0, contribution_max: int = 10, noise_mean_min: float = 0.1, noise_mean_max: float = 0.12, noise_scale: float = 0.02, uncertainty_mean_min: float = 0.05, uncertainty_mean_max: float = 0.05, uncertainty_scale: float = 0.01, verbose: bool = True)#
Bases:
object
The ESAT Simulator provides methods for generating customized synthetic source profiles and datasets. These synthetic datasets can then be passed to SA or BatchSA instances. The results of those model runs can be evaluated against the known synthetic profiles using the Simulator compare function. A visualization of the comparison is available with the plot_comparison function.
The synthetic profile matrix (H) is generated from a uniform distribution [0.0, 1.0). The synthetic contribution matrix (W) is generated from a uniform distribution [0.0, 1.0) * contribution_max. The synthetic dataset is the matrix multiplication product of WH + noise + outliers. Noise is added to the dataset from a normal distribution, scaled by the dataset. Outliers are added to the dataset at random, for the decimal percentage outlier_p parameters, multiplying the dataset value by outlier_mag. Uncertainty is generated from a normal distribution, scaled by the dataset.
#TODO: Looper, batch simulator mode
- Parameters:
seed (int) – The seed for the random number generator.
factors_n (int) – The number of synthetic factors to generate.
features_n (int) – The number of synthetic features in the dataset.
samples_n (int) – The number of synthetic samples in the dataset.
outliers (bool) – Include outliers in the synthetic dataset.
outlier_p (float) – The decimal percentage of outliers in the dataset.
outlier_mag (float) – The magnitude of the outliers on the dataset elements.
contribution_max (int) – The maximum value in the synthetic contribution matrix (W).
noise_mean_min (float) – The minimum value for the randomly selected mean decimal percentage of the synthetic dataset for noise, by feature.
noise_mean_max (float) – The maximum value for the randomly selected mean decimal percentage of the synthetic dataset for noise, by feature.
noise_scale (float) – The scale of the normal distribution for the noise, standard deviation of the distribution.
uncertainty_mean_min (float) – The minimum value for the randomly selected mean decimal percentage of the uncertainty of the synthetic dataset, by feature.
uncertainty_mean_max (float) – The maximum value for the randomly selected mean decimal percentage of the uncertainty of the synthetic dataset, by feature.
uncertainty_scale (float) – The scale of the normal distribution for the uncertainty, standard deviation of the distribution.
verbose (bool) – Turn on verbosity for added logging.
- compare(batch_sa: BatchSA, selected_model: int | None = None)#
Run the profile comparison, evaluating the results of each of the models in the BatchSA instance. All models are evaluated with the results for each model available in simulator.factor_compare.model_results
The model with the highest average R squared value for the factor mapping is defined as the best_model, which can be different from the most optimal model, model with the lowest loss value. If they are different the best mapping for the most optimal model is also provided.
A mapping details for a specific model can also be found by specifying the selected_model parameter, model by index. Requires that compare has already been completed on the instance.
- Parameters:
batch_sa (BatchSA) – Completed instance of BatchSA to compare the output models to the known synthetic profiles.
selected_model (int) – If specified, displays the best mapping for the specified model.
- generate_profiles(profiles: ndarray | None = None)#
Generate the synthetic profiles. Run on Simulator initialization, but customized profiles can be used inplace of the randomly generated synthetic profile by passing in a profile matrix
- Parameters:
profiles (np.ndarray) – A custom profile matrix to be used in place of the random synthetic profile. Matrix must have shape (factors_n, features_n)
- get_data()#
Get the synthetic data and uncertainty dataframes to use with the DataHandler.
- Return type:
pd.DataFrame, pd.DataFrame
- static load(file_path: str)#
Load a previously saved ESAT Simulator pickle file.
- Parameters:
file_path (str) – File path to a previously saved ESAT Simulator pickle file
- Returns:
On successful load, will return a previously saved Simulator object. Will return None on load fail.
- Return type:
- plot_comparison(model_i: int | None = None)#
Plot the results of the output comparison for the model with the highest correlated mapping, if model_i is not specified. Otherwise, plots the output comparison of model_i to the synthetic profiles.
- Parameters:
model_i (int) – The model index for the comparison, when not specified will default to the model with the highest correlation mapping.
- plot_profile_comparison(model_i: int | None = None)#
Plot the results of the output comparison for the model with the highest correlated mapping, if model_i is not specified. Otherwise, plots the output comparison of model_i to the synthetic profiles.
- Parameters:
model_i (int) – The model index for the comparison, when not specified will default to the model with the highest correlation mapping.
- plot_synthetic_contributions()#
Plot the factor contribution matrix.
- save(sim_name: str = 'synthetic', output_directory: str = '.')#
Save the generated synthetic data and uncertainty datasets, and the simulator instance binary.
- Parameters:
sim_name (str) – The name for the data and uncertainty dataset files.
output_directory (str) – The path to the directory where the files will be saved.
- Returns:
True if save is successful, otherwise False.
- Return type:
bool
- update_contribution(factor_i: int, curve_type: str, scale: float = 0.1, frequency: float = 0.5, maximum: float = 1.0, minimum: float = 0.1)#
Update the contributions for a specific factor to follow the curve type. The values are randomly sampled from a normal distribution around the curve type as defined by the magnitude, frequency and/or slope values. The input and uncertainty data are recalculated with each update.
- Parameters:
factor_i (int) – The factor contribution to update, by index.
curve_type (str) – The type of curve that describes the factor contribution. Options include: ‘uniform’, ‘increasing’, ‘decreasing’, ‘logistic’, ‘periodic’. Default: uniform.
scale (float) – The scale of the normal distribution of the curve value to be resampled.
frequency (float) – The frequency of slope change in periodic and logistic curves.
maximum (float) – The maximum value for all curves.
minimum (float) – The minimum value for all curves.