pyuncertainnumber.calibration.knn¶

Classes¶

KNNCalibrator

Unified kNN-based calibrator for black-box models or precomputed simulations.

Functions¶

estimate_p_theta_knn(observed_data, simulated_data, ...)

Estimate the posterior distribution p(θ) using k-Nearest Neighbors (kNN) on a simulation archive.

Module Contents¶

class pyuncertainnumber.calibration.knn.KNNCalibrator(knn: int = 100, a_tol: float = 0.05, evaluate_model: bool = False)¶

Bases: Calibrator

Unified kNN-based calibrator for black-box models or precomputed simulations.

Parameters:

knn (int) – Number of neighbors per observed row. Default: 100.
a_tol (float) – Tolerance for matching simulated \(\xi\) to a requested \(\xi^*\) (when reusing). A simulation is kept if \(\|\xi_{\text{sim}} - \xi^*\|_\infty < a_{\text{tol}}\). Default: 0.05.
evaluate_model (bool) – If True, call the black-box model for each \(\xi\) in xi_list on a shared \(\theta\) grid. If False, reuse simulated_data (requires y/theta/xi arrays). Default: False.
random_state (int) – Seed for reproducibility (affects theta_sampler and resampling). Default: 42.

Note

Setup (unified approach):

If evaluate_model=False and simulated_data is provided:
- Reuse pre-computed simulations
- Build a per-design kNN index by filtering rows with \(\|\xi - \xi^*\|_\infty < a_{\text{tol}}\) for each \(\xi^*\) in xi_list
If evaluate_model=True:
- Simulate \(y = \text{model}(\theta, \xi)\) for each \(\xi\) in xi_list
- Use a shared \(\theta\) grid drawn once from theta_sampler(n_samples)
- Build per-design kNN indices on this shared grid

Calibration workflow (single/multi-design):

For each observation pair \((y_{\text{obs}}, \xi)\):

Standardize \(y_{\text{obs}}\) using the per-design scaler
Find k nearest neighbors in y-space
Map neighbor indices to \(\theta\) values for that design
Stack \(\theta\) samples across all observations/designs (or apply voting/intersection)

knn_illustration — KNN calibration illustration.¶

knn = 100¶

a_tol¶

evaluate_model = False¶

random_state = 42¶

_theta_grid: numpy.ndarray | None = None¶

_theta_by_xi: Dict[Tuple[float, Ellipsis], numpy.ndarray]¶

_y_by_xi: Dict[Tuple[float, Ellipsis], numpy.ndarray]¶

_scaler_by_xi: Dict[Tuple[float, Ellipsis], sklearn.preprocessing.StandardScaler]¶

_neigh_by_xi: Dict[Tuple[float, Ellipsis], sklearn.neighbors.NearestNeighbors]¶

_grid_idx_by_xi: Dict[Tuple[float, Ellipsis], numpy.ndarray]¶

_posterior: Dict[str, Any] | None = None¶

_sim_y: numpy.ndarray | None = None¶

_sim_theta: numpy.ndarray | None = None¶

_sim_xi: numpy.ndarray | None = None¶

static _key_from_xi(xi) → Tuple[float, Ellipsis]¶: Stable tuple key for a scalar/vector design ξ.

Prepare per-design kNN structures by either reusing simulated_data or simulating for each design.

Parameters:

model (callable, optional) – Black-box simulator with signature model(theta, xi) -> y (vectorized over theta).
theta_sampler (callable, optional) – Sampler for \(\theta\); required when evaluate_model=True.
simulated_data (dict, optional) – Dict with keys {“y”: (n, dy), “theta”: (n, dθ), “xi”: (n, dξ)} when reusing sims.
xi_list (list, optional) – List of designs; each item can be scalar or array-like. If None, defaults to [0.0].
n_samples (int) – Number of \(\theta\) samples to draw when evaluate_model=True. Default: 10000.

nearest(y: numpy.ndarray | List[float], xi: float | numpy.ndarray, k: int | None = None, return_dist: bool = False)¶

Return k nearest neighbors for y at design xi.

Parameters:

y (array-like) – Query outputs, shape (m, d_y) or (d_y,).
xi (scalar or array-like) – Design key to select the per-design index.
k (int, optional) – Number of neighbors; defaults to self.knn.
return_dist (bool) – If True, also return distances and raw indices. Default: False.

Returns:

Shape (m*k, dθ) stacked \(\theta\) for all query rows. distances (ndarray, optional): Returned if return_dist=True. indices (ndarray, optional): Returned if return_dist=True.

Return type:

theta_neighbors (ndarray)

calibrate(observations, resample_n: int | None = None, combine: str = 'stack', combine_params: dict | None = None)¶

Run kNN calibration and aggregate posterior θ across neighbor-hit blocks.

Parameters:

observations – Observed simulator or model outputs to calibrate against.
resample_n (int | None) – If set, resample posterior θ samples to this size. If None, return all aggregated θ without resampling.
combine (str) –
Aggregation mode. One of:
- ’stack’: concatenate all kNN θ; optional de-duplication.
- ’intersect’: retain θ hit at least min_count times across neighbor blocks.
combine_params (dict | None) –
Optional parameters controlling aggregation and KDE weighting.

Supported keys:
- dedup (bool): Default False. Remove duplicate θ (only for ‘stack’).
- theta_match_tol (float): Default 1e-9. Tolerance or rounding quantum for comparing/merging θ values.
- min_count (int | None): Minimum occurrences for ‘intersect’. Default is max(1, ceil(0.5 * total_blocks)), meaning θ must appear in about half of neighbor lists.
- use_kde (bool): Default False. If True, fit KDE on aggregated θ to compute log-scores and normalized weights.
- kde_bandwidth (float | None): Bandwidth for KDE. If None (default), use Scott’s rule.

Tip

Two aggregation modes are supported:

stack: Concatenate all kNN θ into a single array. Supports optional de-duplication of nearly identical θ values.
intersect: Keep θ values that occur in at least min_count neighbor blocks across all observations/design points (default ≈ half of all blocks).

Optional density weighting via KDE can be applied after aggregation to compute normalized posterior weights.

Returns:

A dictionary with keys:

’mode’ (str): Always ‘knn’.
’theta’ (ndarray): Posterior samples of shape (N, dθ); resampled if resample_n is provided.
’weights’ (ndarray | None): None for stack/intersect, or a length-N array of KDE weights if use_kde=True.
’meta’ (dict): Aggregation info; may include KDE bandwidth if density weighting is used.

Return type:

dict

_round_rows(A: numpy.ndarray, tol: float) → tuple[numpy.ndarray, numpy.ndarray]¶

Round rows of A to multiples of tol and return (unique_rows, counts).

Parameters:

A (ndarray) – Input array to process.
tol (float) – Tolerance for rounding. If tol <= 0, exact matching is used.

Returns:

(unique_rows, counts) where unique_rows are the deduplicated rows: and counts are the occurrence counts.

Return type:

tuple

_kde_logweights(X, bw=0.5, n_max_exact=5000)¶

Compute KDE-based log-weights for posterior samples X.

Parameters:

X (ndarray) – Posterior samples, shape (n, d).
bw (float) – Bandwidth for Gaussian kernel. Default: 0.5.
n_max_exact (int) – Max n for exact pairwise KDE. Above this, fall back to sklearn.KernelDensity. Default: 5000.

Returns:

logp (ndarray): Log-density values at X, shape (n,).
w (ndarray): Normalized weights, shape (n,).

Return type:

tuple

get_posterior() → Any¶: Return the last computed posterior dict; raises if calibrate() hasn’t been called.

pyuncertainnumber.calibration.knn.estimate_p_theta_knn(observed_data, simulated_data, xi_star, knn: int = 20, a_tol: float = 0.05)¶

Estimate the posterior distribution p(θ) using k-Nearest Neighbors (kNN) on a simulation archive.

This method restricts the simulation archive to runs at (or near) the target design \(\xi^*\), then fits a kNN model in output (y) space. For each observed output y_obs, it retrieves the k-nearest simulated outputs and returns the corresponding \(\theta\) values as approximate posterior samples.

Parameters:

observed_data (ndarray) – Array of observed outputs y_obs, shape (n_obs, d_y). Must match the dimensionality of simulated outputs.
simulated_data (list) –
List of arrays [y, θ, ξ], containing:
- y (ndarray): Simulation output, shape (n, d_y), e.g., transformed y with only KPIs
- θ (ndarray): Parameters and variables to be calibrated, shape (n, d_theta)
- ξ (ndarray): Conditioning controllable factors, shape (n, d_xi), e.g., design parameters
knn (int) – Number of nearest neighbors to query per observed sample. Default: 20.
xi_star (scalar or array-like) – Target design \(\xi^*\) at which the posterior is estimated.
a_tol (float) – Tolerance for matching simulations to \(\xi^*\). Default: 0.05. A simulation is kept if \(\|\xi_{\text{sim}} - \xi^*\|_\infty < a_{\text{tol}}\).

Returns:

\(\theta\) samples from the posterior, stacked across all observed y.: Shape: (n_obs × knn, d_theta).

Return type:

ndarray

Raises:

ValueError – If filtering leaves no simulations at \(\xi^*\).
RuntimeError – If kNN search fails due to inconsistent dimensions.

Note

Scaling of outputs y is performed internally via StandardScaler for robustness against different KPI magnitudes.
The parameter knn acts as a smoothing parameter: higher values broaden the posterior but reduce sharpness.
The choice of a_tol trades off strict design conditioning vs. sample size. Too small → few matches; too large → weaker conditioning.

Example

>>> import numpy as np
>>> from sklearn.preprocessing import StandardScaler
>>> from sklearn.neighbors import NearestNeighbors
>>> # Fake simulator archive
>>> theta_sim = np.random.uniform(-5, 5, size=(5000, 2))
>>> xi_sim = np.zeros((5000, 1))
>>> y_sim = np.sum(theta_sim**2, axis=1, keepdims=True)         ...         + 0.1*np.random.randn(5000, 1)
>>> simulated_data = [y_sim, theta_sim, xi_sim]
>>> # Observed data
>>> theta_true = np.array([1.5, -2.0])
>>> y_obs = np.sum(theta_true**2) + 0.1*np.random.randn(1)
>>> # Estimate posterior
>>> theta_post = estimate_p_theta_knn(
...     observed_data=np.array([[y_obs]]),
...     simulated_data=simulated_data,
...     knn=50,
...     xi_star=0.0
... )
>>> theta_post.shape
(50, 2)
>>> theta_post.mean(axis=0)
array([ 1.4, -2.1])  # close to true [1.5, -2.0]