Skip to content

dsiae

Data Space Inversion (DSI) Autoencoder (AE) emulator implementation.

AdaptiveLoss

Bases: LossBase

Adaptive loss that balances reconstruction and distribution terms dynamically.

Automatically adjusts the weighting between reconstruction and distribution preservation based on their relative magnitudes during training.

AutoEncoder

__init__(input_dim, latent_dim=2, hidden_dims=(128, 64), lr=0.001, activation='relu', loss='Huber', dropout_rate=0.0, random_state=0)

Initialize AutoEncoder.

Parameters:

Name Type Description Default
input_dim int

Input feature dimension.

required
latent_dim int

Latent space dimension.

2
hidden_dims tuple

Tuple of hidden layer sizes for encoder (reversed for decoder).

(128, 64)
lr float

Learning rate.

0.001
activation str

Activation function name.

'relu'
loss str

Loss function name.

'Huber'
dropout_rate float

Dropout rate (0.0-1.0).

0.0
random_state int

Random seed.

0

decode(Z)

Decode latent representation back to input space.

Parameters

Z : np.ndarray Latent representation with shape (n_samples, latent_dim).

Returns

np.ndarray Reconstructed data with shape (n_samples, input_dim).

encode(X)

Encode input data to latent representation.

Parameters

X : np.ndarray, pd.DataFrame, or pd.Series Input data to encode to latent space.

Returns

np.ndarray Latent representation with shape (n_samples, latent_dim).

fit(X, X_val=None, epochs=100, batch_size=32, validation_split=0.1, early_stopping=True, patience=10, lr_schedule=None, verbose=2, sample_weight=None, validation_sample_weight=None)

Train the autoencoder.

Parameters:

Name Type Description Default
X ndarray

Training data.

required
X_val Optional[ndarray]

Validation data (optional).

None
epochs int

Max epochs.

100
batch_size int

Batch size.

32
validation_split float

Validation split fraction (if X_val is None).

0.1
early_stopping bool

Enable early stopping.

True
patience int

Early stopping patience.

10
lr_schedule Optional[Any]

Learning rate scheduler callback.

None
verbose int

Verbosity level.

2
sample_weight Optional[ndarray]

Training sample weights.

None
validation_sample_weight Optional[ndarray]

Validation sample weights.

None

Returns:

Type Description
Any

Training history.

Perform grid search over autoencoder hyperparameters.

Systematically evaluates different combinations of latent dimensions, network architectures, and learning rates to find optimal configurations based on validation loss performance.

Parameters

X : np.ndarray Training data for hyperparameter optimization.

list of int, default [2, 3, 5]

Latent space dimensions to evaluate.

list of tuple, default [(64, 32), (128, 64)]

Network architectures to test. Each tuple specifies hidden layer sizes.

list of float, default [1e-2, 1e-3]

Learning rates to evaluate.

int, default 50

Training epochs for each configuration.

int, default 32

Batch size for training.

int, default 42

Random seed for reproducible train/validation splits.

Returns

dict Mapping from (latent_dim, hidden_dims, lr) tuples to validation loss values. Lower values indicate better performance.

Notes

Uses 10% of data for validation via train_test_split. Each configuration is trained independently with early stopping disabled to ensure fair comparison across hyperparameter combinations.

Examples

results = AutoEncoder.hyperparam_search(X_train, epochs=100) best_params = min(results.keys(), key=results.get) print(f"Best configuration: {best_params}")

load(folder)

Load trained models from disk.

save(folder)

Save trained models to disk.

DSIAE

Bases: Emulator

Data Space Inversion Autoencoder (DSIAE) emulator.

__init__(pst=None, data=None, transforms=None, latent_dim=None, energy_threshold=1.0, verbose=False)

Initialize the DSIAE emulator.

Parameters:

Name Type Description Default
pst Optional[Pst]

PEST control file object.

None
data Optional[Union[DataFrame, ObservationEnsemble]]

Training data (DataFrame or ObservationEnsemble).

None
transforms Optional[List[Dict[str, Any]]]

List of dicts defining preprocessing transformations.

None
latent_dim Optional[int]

Latent space dimension. If None, determined from energy_threshold.

None
energy_threshold float

Variance threshold for automatic latent dimension (0.0-1.0).

1.0
verbose bool

Enable verbose logging.

False

check_for_pdc()

Check for Prior data conflict.

encode(X)

Encode input data into latent space representation.

This method transforms input observation data into the lower-dimensional latent space learned by the autoencoder. The encoding process applies any configured data transformations before passing the data through the encoder network.

Parameters

X : np.ndarray or pd.DataFrame Input observation data to encode. Should have the same feature structure as the training data. If DataFrame, the index will be preserved in the output. Shape should be (n_samples, n_features) where n_features matches the original observation space dimension.

Returns

pd.DataFrame Encoded latent space representation with shape (n_samples, latent_dim). If input was a DataFrame, the original index is preserved. Column names will be generated automatically for the latent dimensions.

Raises

ValueError If the encoder has not been fitted (emulator not trained). If input data shape is incompatible with the trained model.

Notes

This method automatically applies the same data transformations that were used during training, ensuring consistent preprocessing. The transformations are applied via the stored transformer_pipeline.

The latent space representation can be used for: - Dimensionality reduction and visualization - Parameter space exploration - Input to optimization routines - Analysis of model behavior in reduced space

Examples

Encode training data

latent_repr = emulator.encode(training_data)

Encode new observations

new_latent = emulator.encode(new_observations) print(f"Latent dimensions: {new_latent.shape[1]}")

fit(validation_split=0.1, hidden_dims=(128, 64), lr=0.001, epochs=300, batch_size=128, early_stopping=True, dropout_rate=0.0, random_state=42, loss_type='energy', loss_kwargs=None, sample_weight=None)

Fit the autoencoder emulator to training data.

Parameters

validation_split : float, default 0.1 Fraction of data to use for validation. hidden_dims : tuple, default (128, 64) Hidden layer dimensions for encoder/decoder. lr : float, default 1e-3 Learning rate for Adam optimizer. epochs : int, default 300 Maximum training epochs. batch_size : int, default 128 Training batch size. early_stopping : bool, default True Whether to use early stopping on validation loss. dropout_rate : float, default 0.0 Dropout rate for regularization during training. random_state : int, default 42 Random seed for reproducibility. loss_type : str, default 'energy' Type of loss function to use. Options: 'energy', 'mmd', 'wasserstein', 'statistical', 'adaptive', 'mse', 'huber'. loss_kwargs : dict, optional Additional parameters for the loss function. sample_weight : np.ndarray, optional Sample weights for training. Shape should be (n_samples,).

Returns

DSIAE Self (fitted emulator instance).

Grid search over autoencoder hyperparameters.

Parameters

latent_dims : list of int, optional Latent dimensions to test. If None, uses latent_dim_mults. latent_dim_mults : list of float, default [0.5, 1.0, 2.0] Multipliers for current latent_dim if latent_dims not provided. hidden_dims_list : list of tuple, default [(64, 32), (128, 64)] Hidden layer architectures to test. lrs : list of float, default [1e-2, 1e-3] Learning rates to test. epochs : int, default 50 Training epochs for each configuration. batch_size : int, default 32 Training batch size. random_state : int, default 0 Random seed for reproducibility.

Returns

dict Mapping from (latent_dim, hidden_dims, lr) to validation loss.

load(filename) classmethod

Load the emulator from a file.

predict(pvals)

Generate predictions from the emulator.

Parameters

pvals : np.ndarray, pd.Series, or pd.DataFrame Parameter values for prediction in latent space. Shape should match latent_dim.

Returns

pd.Series Predicted observation values in original scale.

Raises

ValueError If emulator not fitted or input dimensions incorrect.

prepare_dsivc(decvar_names, t_d=None, pst=None, oe=None, track_stack=False, dsi_args=None, percentiles=[0.25, 0.75, 0.5], mou_population_size=None, ies_exe_path='pestpp-ies')

Prepare Data Space Inversion Variable Control (DSIVC) control files.

Parameters

decvar_names : list or str Names of decision variables for optimization. t_d : str, optional Template directory path. Uses existing if None. pst : Pst, optional PST control file object. Uses existing if None. oe : ObservationEnsemble, optional Observation ensemble. Uses existing if None. track_stack : bool, default False Whether to include individual ensemble realizations as observations. dsi_args : dict, optional DSI configuration arguments. percentiles : list, default [0.25, 0.75, 0.5] Percentiles to calculate from ensemble statistics. mou_population_size : int, optional Population size for multi-objective optimization. ies_exe_path : str, default "pestpp-ies" Path to PEST++ IES executable.

Returns

Pst PEST++ control file object for DSIVC optimization.

Notes

Sets up multi-objective optimization with decision variables constrained to training data bounds. Creates stack statistics observations for ensemble matching and configures PEST++-MOU options.

prepare_pestpp(t_d, pst=None, verbose=False, use_runstor=False)

Prepare PEST++ interface for DSIAE. Wraps base implementation.

save(filename)

Save the emulator to a file.

Bundles the pickled object and the TensorFlow model into a zip archive.

EnergyLoss

Bases: LossBase

Energy distance loss combining MSE reconstruction with energy distance.

The energy distance measures dissimilarity between probability distributions and helps ensure the reconstructed samples preserve the overall data distribution.

MMDLoss

Bases: LossBase

Maximum Mean Discrepancy loss for distribution matching.

MMD measures the distance between distributions in a reproducing kernel Hilbert space. More computationally efficient than energy distance.

StatisticalLoss

Bases: LossBase

Multi-component statistical loss for comprehensive distribution matching.

Combines reconstruction error with multiple statistical measures: - Moment matching (mean, variance, skewness, kurtosis) - Correlation structure preservation - Optional distribution distance (MMD or Energy)

WassersteinLoss

Bases: LossBase

Sliced Wasserstein distance loss for distribution matching.

Uses random projections to approximate the Wasserstein-1 distance, which is particularly effective for high-dimensional distributions.

correlation_loss(x, y)

Penalize differences in correlation structure between datasets.

create_distribution_loss(loss_type='energy', **kwargs)

Factory function to create distribution-aware loss functions.

Parameters

loss_type : str Type of loss function to create: - 'energy': EnergyLoss (default, robust but computationally expensive) - 'mmd': MMDLoss (efficient, good for high-dim data) - 'wasserstein': WassersteinLoss (good for smooth distributions) - 'statistical': StatisticalLoss (comprehensive statistical matching) - 'adaptive': AdaptiveLoss (automatically balances terms) - 'mse': Standard MSE (no distribution matching) - 'huber': Huber loss (robust to outliers, no distribution matching) **kwargs : dict Additional parameters specific to each loss type

Returns

tf.keras.losses.Loss Configured loss function

Examples

Energy loss with custom weighting

loss = create_distribution_loss('energy', lambda_energy=1e-3)

MMD loss with RBF kernel

loss = create_distribution_loss('mmd', lambda_mmd=1e-2, sigma=2.0)

Statistical loss with all components

loss = create_distribution_loss('statistical', ... lambda_moments=1e-2, ... lambda_corr=1e-3, ... lambda_dist=5e-3)

create_observation_weights(data, observed_values, critical_features, weight_type='inverse_distance', temperature=1.0, normalize=True, clip_range=(0.1, 10.0))

Create sample weights based on proximity to observed values.

Parameters

data : pd.DataFrame or np.ndarray Training data with shape (n_samples, n_features) observed_values : list of float Target observed values at critical features critical_features : list of int Column indices of critical observation features
weight_type : str, default 'inverse_distance' Type of weighting: 'inverse_distance', 'gaussian', 'exponential' temperature : float, default 1.0 Temperature parameter for weight decay (lower = sharper weighting) normalize : bool, default True Whether to normalize weights to mean = 1.0 clip_range : tuple, default (0.1, 10.0) Range to clip extreme weights (min, max)

Returns

np.ndarray Sample weights with shape (n_samples,)

create_pest_observation_weights(pst, emulator_data, weight_scaling=1.0, **kwargs)

Create sample weights using PEST observation data.

Parameters

pst : Pst PEST control file object with observation data emulator_data : pd.DataFrame Training data for the emulator weight_scaling : float, default 1.0 Overall scaling factor for weights **kwargs Additional arguments passed to create_observation_weights

Returns

np.ndarray Sample weights based on PEST observations

maximum_mean_discrepancy(x, y, kernel='rbf', sigma=1.0)

Compute Maximum Mean Discrepancy between two distributions.

wasserstein_distance_sliced(x, y, num_projections=50)

Approximate Wasserstein-1 distance using sliced Wasserstein distance.