pyuncertainnumber.calibration.knn¶
Classes¶
Unified kNN-based calibrator for black-box models or precomputed simulations. |
Functions¶
|
Estimate the posterior distribution p(θ) using k-Nearest Neighbors (kNN) on a simulation archive. |
Module Contents¶
- class pyuncertainnumber.calibration.knn.KNNCalibrator(knn: int = 100, a_tol: float = 0.05, evaluate_model: bool = False)¶
Bases:
CalibratorUnified kNN-based calibrator for black-box models or precomputed simulations.
- Parameters:
knn (int) – Number of neighbors per observed row. Default: 100.
a_tol (float) – Tolerance for matching simulated \(\xi\) to a requested \(\xi^*\) (when reusing). A simulation is kept if \(\|\xi_{\text{sim}} - \xi^*\|_\infty < a_{\text{tol}}\). Default: 0.05.
evaluate_model (bool) – If True, call the black-box model for each \(\xi\) in
xi_liston a shared \(\theta\) grid. If False, reusesimulated_data(requires y/theta/xi arrays). Default: False.random_state (int) – Seed for reproducibility (affects theta_sampler and resampling). Default: 42.
Note
Setup (unified approach):
If
evaluate_model=Falseandsimulated_datais provided:Reuse pre-computed simulations
Build a per-design kNN index by filtering rows with \(\|\xi - \xi^*\|_\infty < a_{\text{tol}}\) for each \(\xi^*\) in
xi_list
If
evaluate_model=True:Simulate \(y = \text{model}(\theta, \xi)\) for each \(\xi\) in
xi_listUse a shared \(\theta\) grid drawn once from
theta_sampler(n_samples)Build per-design kNN indices on this shared grid
Calibration workflow (single/multi-design):
For each observation pair \((y_{\text{obs}}, \xi)\):
Standardize \(y_{\text{obs}}\) using the per-design scaler
Find k nearest neighbors in y-space
Map neighbor indices to \(\theta\) values for that design
Stack \(\theta\) samples across all observations/designs (or apply voting/intersection)
KNN calibration illustration.¶
- knn = 100¶
- a_tol¶
- evaluate_model = False¶
- random_state = 42¶
- _theta_grid: numpy.ndarray | None = None¶
- _theta_by_xi: Dict[Tuple[float, Ellipsis], numpy.ndarray]¶
- _y_by_xi: Dict[Tuple[float, Ellipsis], numpy.ndarray]¶
- _scaler_by_xi: Dict[Tuple[float, Ellipsis], sklearn.preprocessing.StandardScaler]¶
- _neigh_by_xi: Dict[Tuple[float, Ellipsis], sklearn.neighbors.NearestNeighbors]¶
- _grid_idx_by_xi: Dict[Tuple[float, Ellipsis], numpy.ndarray]¶
- _posterior: Dict[str, Any] | None = None¶
- _sim_y: numpy.ndarray | None = None¶
- _sim_theta: numpy.ndarray | None = None¶
- _sim_xi: numpy.ndarray | None = None¶
- static _key_from_xi(xi) Tuple[float, Ellipsis]¶
Stable tuple key for a scalar/vector design ξ.
- setup(model: Callable[[numpy.ndarray, float | numpy.ndarray], numpy.ndarray] | None = None, theta_sampler: Callable[[int], numpy.ndarray] | None = None, simulated_data: Dict[str, numpy.ndarray] | None = None, xi_list: List[float | numpy.ndarray] | None = None, n_samples: int = 10000)¶
Prepare per-design kNN structures by either reusing simulated_data or simulating for each design.
- Parameters:
model (callable, optional) – Black-box simulator with signature
model(theta, xi) -> y(vectorized over theta).theta_sampler (callable, optional) – Sampler for \(\theta\); required when
evaluate_model=True.simulated_data (dict, optional) – Dict with keys {“y”: (n, dy), “theta”: (n, dθ), “xi”: (n, dξ)} when reusing sims.
xi_list (list, optional) – List of designs; each item can be scalar or array-like. If None, defaults to [0.0].
n_samples (int) – Number of \(\theta\) samples to draw when
evaluate_model=True. Default: 10000.
- nearest(y: numpy.ndarray | List[float], xi: float | numpy.ndarray, k: int | None = None, return_dist: bool = False)¶
Return k nearest neighbors for y at design xi.
- Parameters:
y (array-like) – Query outputs, shape (m, d_y) or (d_y,).
xi (scalar or array-like) – Design key to select the per-design index.
k (int, optional) – Number of neighbors; defaults to
self.knn.return_dist (bool) – If True, also return distances and raw indices. Default: False.
- Returns:
Shape (m*k, dθ) stacked \(\theta\) for all query rows. distances (ndarray, optional): Returned if
return_dist=True. indices (ndarray, optional): Returned ifreturn_dist=True.- Return type:
theta_neighbors (ndarray)
- calibrate(observations, resample_n: int | None = None, combine: str = 'stack', combine_params: dict | None = None)¶
Run kNN calibration and aggregate posterior θ across neighbor-hit blocks.
- Parameters:
observations – Observed simulator or model outputs to calibrate against.
resample_n (int | None) – If set, resample posterior θ samples to this size. If None, return all aggregated θ without resampling.
combine (str) –
Aggregation mode. One of:
’stack’: concatenate all kNN θ; optional de-duplication.
’intersect’: retain θ hit at least min_count times across neighbor blocks.
combine_params (dict | None) –
Optional parameters controlling aggregation and KDE weighting.
Supported keys:
dedup (bool): Default False. Remove duplicate θ (only for ‘stack’).
theta_match_tol (float): Default 1e-9. Tolerance or rounding quantum for comparing/merging θ values.
min_count (int | None): Minimum occurrences for ‘intersect’. Default is max(1, ceil(0.5 * total_blocks)), meaning θ must appear in about half of neighbor lists.
use_kde (bool): Default False. If True, fit KDE on aggregated θ to compute log-scores and normalized weights.
kde_bandwidth (float | None): Bandwidth for KDE. If None (default), use Scott’s rule.
Tip
Two aggregation modes are supported:
stack: Concatenate all kNN θ into a single array. Supports optional de-duplication of nearly identical θ values.
intersect: Keep θ values that occur in at least min_count neighbor blocks across all observations/design points (default ≈ half of all blocks).
Optional density weighting via KDE can be applied after aggregation to compute normalized posterior weights.
- Returns:
- A dictionary with keys:
’mode’ (str): Always ‘knn’.
’theta’ (ndarray): Posterior samples of shape (N, dθ); resampled if resample_n is provided.
’weights’ (ndarray | None): None for stack/intersect, or a length-N array of KDE weights if use_kde=True.
’meta’ (dict): Aggregation info; may include KDE bandwidth if density weighting is used.
- Return type:
dict
- _round_rows(A: numpy.ndarray, tol: float) tuple[numpy.ndarray, numpy.ndarray]¶
Round rows of A to multiples of tol and return (unique_rows, counts).
- Parameters:
A (ndarray) – Input array to process.
tol (float) – Tolerance for rounding. If tol <= 0, exact matching is used.
- Returns:
- (unique_rows, counts) where unique_rows are the deduplicated rows
and counts are the occurrence counts.
- Return type:
tuple
- _kde_logweights(X, bw=0.5, n_max_exact=5000)¶
Compute KDE-based log-weights for posterior samples X.
- Parameters:
X (ndarray) – Posterior samples, shape (n, d).
bw (float) – Bandwidth for Gaussian kernel. Default: 0.5.
n_max_exact (int) – Max n for exact pairwise KDE. Above this, fall back to sklearn.KernelDensity. Default: 5000.
- Returns:
logp (ndarray): Log-density values at X, shape (n,).
w (ndarray): Normalized weights, shape (n,).
- Return type:
tuple
- get_posterior() Any¶
Return the last computed posterior dict; raises if calibrate() hasn’t been called.
- pyuncertainnumber.calibration.knn.estimate_p_theta_knn(observed_data, simulated_data, xi_star, knn: int = 20, a_tol: float = 0.05)¶
Estimate the posterior distribution p(θ) using k-Nearest Neighbors (kNN) on a simulation archive.
This method restricts the simulation archive to runs at (or near) the target design \(\xi^*\), then fits a kNN model in output (y) space. For each observed output y_obs, it retrieves the k-nearest simulated outputs and returns the corresponding \(\theta\) values as approximate posterior samples.
- Parameters:
observed_data (ndarray) – Array of observed outputs y_obs, shape (n_obs, d_y). Must match the dimensionality of simulated outputs.
simulated_data (list) –
List of arrays [y, θ, ξ], containing:
y (ndarray): Simulation output, shape (n, d_y), e.g., transformed y with only KPIs
θ (ndarray): Parameters and variables to be calibrated, shape (n, d_theta)
ξ (ndarray): Conditioning controllable factors, shape (n, d_xi), e.g., design parameters
knn (int) – Number of nearest neighbors to query per observed sample. Default: 20.
xi_star (scalar or array-like) – Target design \(\xi^*\) at which the posterior is estimated.
a_tol (float) – Tolerance for matching simulations to \(\xi^*\). Default: 0.05. A simulation is kept if \(\|\xi_{\text{sim}} - \xi^*\|_\infty < a_{\text{tol}}\).
- Returns:
- \(\theta\) samples from the posterior, stacked across all observed y.
Shape: (n_obs × knn, d_theta).
- Return type:
ndarray
- Raises:
ValueError – If filtering leaves no simulations at \(\xi^*\).
RuntimeError – If kNN search fails due to inconsistent dimensions.
Note
Scaling of outputs y is performed internally via StandardScaler for robustness against different KPI magnitudes.
The parameter
knnacts as a smoothing parameter: higher values broaden the posterior but reduce sharpness.The choice of
a_toltrades off strict design conditioning vs. sample size. Too small → few matches; too large → weaker conditioning.
Example
>>> import numpy as np >>> from sklearn.preprocessing import StandardScaler >>> from sklearn.neighbors import NearestNeighbors >>> # Fake simulator archive >>> theta_sim = np.random.uniform(-5, 5, size=(5000, 2)) >>> xi_sim = np.zeros((5000, 1)) >>> y_sim = np.sum(theta_sim**2, axis=1, keepdims=True) ... + 0.1*np.random.randn(5000, 1) >>> simulated_data = [y_sim, theta_sim, xi_sim] >>> # Observed data >>> theta_true = np.array([1.5, -2.0]) >>> y_obs = np.sum(theta_true**2) + 0.1*np.random.randn(1) >>> # Estimate posterior >>> theta_post = estimate_p_theta_knn( ... observed_data=np.array([[y_obs]]), ... simulated_data=simulated_data, ... knn=50, ... xi_star=0.0 ... ) >>> theta_post.shape (50, 2) >>> theta_post.mean(axis=0) array([ 1.4, -2.1]) # close to true [1.5, -2.0]