dsiae
Data Space Inversion (DSI) Autoencoder (AE) emulator implementation.
AdaptiveLoss
Bases: LossBase
Adaptive loss that balances reconstruction and distribution terms dynamically.
Automatically adjusts the weighting between reconstruction and distribution preservation based on their relative magnitudes during training.
AutoEncoder
__init__(input_dim, latent_dim=2, hidden_dims=(128, 64), lr=0.001, activation='relu', loss='Huber', dropout_rate=0.0, random_state=0)
Initialize AutoEncoder.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input_dim
|
int
|
Input feature dimension. |
required |
latent_dim
|
int
|
Latent space dimension. |
2
|
hidden_dims
|
tuple
|
Tuple of hidden layer sizes for encoder (reversed for decoder). |
(128, 64)
|
lr
|
float
|
Learning rate. |
0.001
|
activation
|
str
|
Activation function name. |
'relu'
|
loss
|
str
|
Loss function name. |
'Huber'
|
dropout_rate
|
float
|
Dropout rate (0.0-1.0). |
0.0
|
random_state
|
int
|
Random seed. |
0
|
decode(Z)
Decode latent representation back to input space.
Parameters
Z : np.ndarray Latent representation with shape (n_samples, latent_dim).
Returns
np.ndarray Reconstructed data with shape (n_samples, input_dim).
encode(X)
Encode input data to latent representation.
Parameters
X : np.ndarray, pd.DataFrame, or pd.Series Input data to encode to latent space.
Returns
np.ndarray Latent representation with shape (n_samples, latent_dim).
fit(X, X_val=None, epochs=100, batch_size=32, validation_split=0.1, early_stopping=True, patience=10, lr_schedule=None, verbose=2, sample_weight=None, validation_sample_weight=None)
Train the autoencoder.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X
|
ndarray
|
Training data. |
required |
X_val
|
Optional[ndarray]
|
Validation data (optional). |
None
|
epochs
|
int
|
Max epochs. |
100
|
batch_size
|
int
|
Batch size. |
32
|
validation_split
|
float
|
Validation split fraction (if X_val is None). |
0.1
|
early_stopping
|
bool
|
Enable early stopping. |
True
|
patience
|
int
|
Early stopping patience. |
10
|
lr_schedule
|
Optional[Any]
|
Learning rate scheduler callback. |
None
|
verbose
|
int
|
Verbosity level. |
2
|
sample_weight
|
Optional[ndarray]
|
Training sample weights. |
None
|
validation_sample_weight
|
Optional[ndarray]
|
Validation sample weights. |
None
|
Returns:
| Type | Description |
|---|---|
Any
|
Training history. |
hyperparam_search(X, latent_dims=[2, 3, 5], hidden_dims_list=[(64, 32), (128, 64)], lrs=[0.01, 0.001], epochs=50, batch_size=32, random_state=42)
staticmethod
Perform grid search over autoencoder hyperparameters.
Systematically evaluates different combinations of latent dimensions, network architectures, and learning rates to find optimal configurations based on validation loss performance.
Parameters
X : np.ndarray Training data for hyperparameter optimization.
list of int, default [2, 3, 5]
Latent space dimensions to evaluate.
list of tuple, default [(64, 32), (128, 64)]
Network architectures to test. Each tuple specifies hidden layer sizes.
list of float, default [1e-2, 1e-3]
Learning rates to evaluate.
int, default 50
Training epochs for each configuration.
int, default 32
Batch size for training.
int, default 42
Random seed for reproducible train/validation splits.
Returns
dict Mapping from (latent_dim, hidden_dims, lr) tuples to validation loss values. Lower values indicate better performance.
Notes
Uses 10% of data for validation via train_test_split. Each configuration is trained independently with early stopping disabled to ensure fair comparison across hyperparameter combinations.
Examples
results = AutoEncoder.hyperparam_search(X_train, epochs=100) best_params = min(results.keys(), key=results.get) print(f"Best configuration: {best_params}")
load(folder)
Load trained models from disk.
save(folder)
Save trained models to disk.
DSIAE
Bases: Emulator
Data Space Inversion Autoencoder (DSIAE) emulator.
__init__(pst=None, data=None, transforms=None, latent_dim=None, energy_threshold=1.0, verbose=False)
Initialize the DSIAE emulator.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
pst
|
Optional[Pst]
|
PEST control file object. |
None
|
data
|
Optional[Union[DataFrame, ObservationEnsemble]]
|
Training data (DataFrame or ObservationEnsemble). |
None
|
transforms
|
Optional[List[Dict[str, Any]]]
|
List of dicts defining preprocessing transformations. |
None
|
latent_dim
|
Optional[int]
|
Latent space dimension. If None, determined from energy_threshold. |
None
|
energy_threshold
|
float
|
Variance threshold for automatic latent dimension (0.0-1.0). |
1.0
|
verbose
|
bool
|
Enable verbose logging. |
False
|
check_for_pdc()
Check for Prior data conflict.
encode(X)
Encode input data into latent space representation.
This method transforms input observation data into the lower-dimensional latent space learned by the autoencoder. The encoding process applies any configured data transformations before passing the data through the encoder network.
Parameters
X : np.ndarray or pd.DataFrame Input observation data to encode. Should have the same feature structure as the training data. If DataFrame, the index will be preserved in the output. Shape should be (n_samples, n_features) where n_features matches the original observation space dimension.
Returns
pd.DataFrame Encoded latent space representation with shape (n_samples, latent_dim). If input was a DataFrame, the original index is preserved. Column names will be generated automatically for the latent dimensions.
Raises
ValueError If the encoder has not been fitted (emulator not trained). If input data shape is incompatible with the trained model.
Notes
This method automatically applies the same data transformations that were
used during training, ensuring consistent preprocessing. The transformations
are applied via the stored transformer_pipeline.
The latent space representation can be used for: - Dimensionality reduction and visualization - Parameter space exploration - Input to optimization routines - Analysis of model behavior in reduced space
Examples
Encode training data
latent_repr = emulator.encode(training_data)
Encode new observations
new_latent = emulator.encode(new_observations) print(f"Latent dimensions: {new_latent.shape[1]}")
fit(validation_split=0.1, hidden_dims=(128, 64), lr=0.001, epochs=300, batch_size=128, early_stopping=True, dropout_rate=0.0, random_state=42, loss_type='energy', loss_kwargs=None, sample_weight=None)
Fit the autoencoder emulator to training data.
Parameters
validation_split : float, default 0.1 Fraction of data to use for validation. hidden_dims : tuple, default (128, 64) Hidden layer dimensions for encoder/decoder. lr : float, default 1e-3 Learning rate for Adam optimizer. epochs : int, default 300 Maximum training epochs. batch_size : int, default 128 Training batch size. early_stopping : bool, default True Whether to use early stopping on validation loss. dropout_rate : float, default 0.0 Dropout rate for regularization during training. random_state : int, default 42 Random seed for reproducibility. loss_type : str, default 'energy' Type of loss function to use. Options: 'energy', 'mmd', 'wasserstein', 'statistical', 'adaptive', 'mse', 'huber'. loss_kwargs : dict, optional Additional parameters for the loss function. sample_weight : np.ndarray, optional Sample weights for training. Shape should be (n_samples,).
Returns
DSIAE Self (fitted emulator instance).
hyperparam_search(latent_dims=None, latent_dim_mults=[0.5, 1.0, 2.0], hidden_dims_list=[(64, 32), (128, 64)], lrs=[0.01, 0.001], epochs=50, batch_size=32, random_state=0)
Grid search over autoencoder hyperparameters.
Parameters
latent_dims : list of int, optional Latent dimensions to test. If None, uses latent_dim_mults. latent_dim_mults : list of float, default [0.5, 1.0, 2.0] Multipliers for current latent_dim if latent_dims not provided. hidden_dims_list : list of tuple, default [(64, 32), (128, 64)] Hidden layer architectures to test. lrs : list of float, default [1e-2, 1e-3] Learning rates to test. epochs : int, default 50 Training epochs for each configuration. batch_size : int, default 32 Training batch size. random_state : int, default 0 Random seed for reproducibility.
Returns
dict Mapping from (latent_dim, hidden_dims, lr) to validation loss.
load(filename)
classmethod
Load the emulator from a file.
predict(pvals)
Generate predictions from the emulator.
Parameters
pvals : np.ndarray, pd.Series, or pd.DataFrame Parameter values for prediction in latent space. Shape should match latent_dim.
Returns
pd.Series Predicted observation values in original scale.
Raises
ValueError If emulator not fitted or input dimensions incorrect.
prepare_dsivc(decvar_names, t_d=None, pst=None, oe=None, track_stack=False, dsi_args=None, percentiles=[0.25, 0.75, 0.5], mou_population_size=None, ies_exe_path='pestpp-ies')
Prepare Data Space Inversion Variable Control (DSIVC) control files.
Parameters
decvar_names : list or str Names of decision variables for optimization. t_d : str, optional Template directory path. Uses existing if None. pst : Pst, optional PST control file object. Uses existing if None. oe : ObservationEnsemble, optional Observation ensemble. Uses existing if None. track_stack : bool, default False Whether to include individual ensemble realizations as observations. dsi_args : dict, optional DSI configuration arguments. percentiles : list, default [0.25, 0.75, 0.5] Percentiles to calculate from ensemble statistics. mou_population_size : int, optional Population size for multi-objective optimization. ies_exe_path : str, default "pestpp-ies" Path to PEST++ IES executable.
Returns
Pst PEST++ control file object for DSIVC optimization.
Notes
Sets up multi-objective optimization with decision variables constrained to training data bounds. Creates stack statistics observations for ensemble matching and configures PEST++-MOU options.
prepare_pestpp(t_d, pst=None, verbose=False, use_runstor=False)
Prepare PEST++ interface for DSIAE. Wraps base implementation.
save(filename)
Save the emulator to a file.
Bundles the pickled object and the TensorFlow model into a zip archive.
EnergyLoss
Bases: LossBase
Energy distance loss combining MSE reconstruction with energy distance.
The energy distance measures dissimilarity between probability distributions and helps ensure the reconstructed samples preserve the overall data distribution.
MMDLoss
Bases: LossBase
Maximum Mean Discrepancy loss for distribution matching.
MMD measures the distance between distributions in a reproducing kernel Hilbert space. More computationally efficient than energy distance.
StatisticalLoss
Bases: LossBase
Multi-component statistical loss for comprehensive distribution matching.
Combines reconstruction error with multiple statistical measures: - Moment matching (mean, variance, skewness, kurtosis) - Correlation structure preservation - Optional distribution distance (MMD or Energy)
WassersteinLoss
Bases: LossBase
Sliced Wasserstein distance loss for distribution matching.
Uses random projections to approximate the Wasserstein-1 distance, which is particularly effective for high-dimensional distributions.
correlation_loss(x, y)
Penalize differences in correlation structure between datasets.
create_distribution_loss(loss_type='energy', **kwargs)
Factory function to create distribution-aware loss functions.
Parameters
loss_type : str Type of loss function to create: - 'energy': EnergyLoss (default, robust but computationally expensive) - 'mmd': MMDLoss (efficient, good for high-dim data) - 'wasserstein': WassersteinLoss (good for smooth distributions) - 'statistical': StatisticalLoss (comprehensive statistical matching) - 'adaptive': AdaptiveLoss (automatically balances terms) - 'mse': Standard MSE (no distribution matching) - 'huber': Huber loss (robust to outliers, no distribution matching) **kwargs : dict Additional parameters specific to each loss type
Returns
tf.keras.losses.Loss Configured loss function
Examples
Energy loss with custom weighting
loss = create_distribution_loss('energy', lambda_energy=1e-3)
MMD loss with RBF kernel
loss = create_distribution_loss('mmd', lambda_mmd=1e-2, sigma=2.0)
Statistical loss with all components
loss = create_distribution_loss('statistical', ... lambda_moments=1e-2, ... lambda_corr=1e-3, ... lambda_dist=5e-3)
create_observation_weights(data, observed_values, critical_features, weight_type='inverse_distance', temperature=1.0, normalize=True, clip_range=(0.1, 10.0))
Create sample weights based on proximity to observed values.
Parameters
data : pd.DataFrame or np.ndarray
Training data with shape (n_samples, n_features)
observed_values : list of float
Target observed values at critical features
critical_features : list of int
Column indices of critical observation features
weight_type : str, default 'inverse_distance'
Type of weighting: 'inverse_distance', 'gaussian', 'exponential'
temperature : float, default 1.0
Temperature parameter for weight decay (lower = sharper weighting)
normalize : bool, default True
Whether to normalize weights to mean = 1.0
clip_range : tuple, default (0.1, 10.0)
Range to clip extreme weights (min, max)
Returns
np.ndarray Sample weights with shape (n_samples,)
create_pest_observation_weights(pst, emulator_data, weight_scaling=1.0, **kwargs)
Create sample weights using PEST observation data.
Parameters
pst : Pst PEST control file object with observation data emulator_data : pd.DataFrame Training data for the emulator weight_scaling : float, default 1.0 Overall scaling factor for weights **kwargs Additional arguments passed to create_observation_weights
Returns
np.ndarray Sample weights based on PEST observations
maximum_mean_discrepancy(x, y, kernel='rbf', sigma=1.0)
Compute Maximum Mean Discrepancy between two distributions.
wasserstein_distance_sliced(x, y, num_projections=50)
Approximate Wasserstein-1 distance using sliced Wasserstein distance.