pyuncertainnumber.calibration.knn

Classes

KNNCalibrator

Unified kNN-based calibrator for black-box models or precomputed simulations.

Functions

estimate_p_theta_knn(observed_data, simulated_data, ...)

Estimate the posterior distribution p(θ) using k-Nearest Neighbors (kNN) on a simulation archive.

Module Contents

class pyuncertainnumber.calibration.knn.KNNCalibrator(knn: int = 100, a_tol: float = 0.05, evaluate_model: bool = False)

Bases: Calibrator

Unified kNN-based calibrator for black-box models or precomputed simulations.

Parameters:
  • knn (int) – Number of neighbors per observed row. Default: 100.

  • a_tol (float) – Tolerance for matching simulated \(\xi\) to a requested \(\xi^*\) (when reusing). A simulation is kept if \(\|\xi_{\text{sim}} - \xi^*\|_\infty < a_{\text{tol}}\). Default: 0.05.

  • evaluate_model (bool) – If True, call the black-box model for each \(\xi\) in xi_list on a shared \(\theta\) grid. If False, reuse simulated_data (requires y/theta/xi arrays). Default: False.

  • random_state (int) – Seed for reproducibility (affects theta_sampler and resampling). Default: 42.

Note

Setup (unified approach):

  • If evaluate_model=False and simulated_data is provided:

    • Reuse pre-computed simulations

    • Build a per-design kNN index by filtering rows with \(\|\xi - \xi^*\|_\infty < a_{\text{tol}}\) for each \(\xi^*\) in xi_list

  • If evaluate_model=True:

    • Simulate \(y = \text{model}(\theta, \xi)\) for each \(\xi\) in xi_list

    • Use a shared \(\theta\) grid drawn once from theta_sampler(n_samples)

    • Build per-design kNN indices on this shared grid

Calibration workflow (single/multi-design):

For each observation pair \((y_{\text{obs}}, \xi)\):

  1. Standardize \(y_{\text{obs}}\) using the per-design scaler

  2. Find k nearest neighbors in y-space

  3. Map neighbor indices to \(\theta\) values for that design

  4. Stack \(\theta\) samples across all observations/designs (or apply voting/intersection)

knn_illustration

KNN calibration illustration.

knn = 100
a_tol
evaluate_model = False
random_state = 42
_theta_grid: numpy.ndarray | None = None
_theta_by_xi: Dict[Tuple[float, Ellipsis], numpy.ndarray]
_y_by_xi: Dict[Tuple[float, Ellipsis], numpy.ndarray]
_scaler_by_xi: Dict[Tuple[float, Ellipsis], sklearn.preprocessing.StandardScaler]
_neigh_by_xi: Dict[Tuple[float, Ellipsis], sklearn.neighbors.NearestNeighbors]
_grid_idx_by_xi: Dict[Tuple[float, Ellipsis], numpy.ndarray]
_posterior: Dict[str, Any] | None = None
_sim_y: numpy.ndarray | None = None
_sim_theta: numpy.ndarray | None = None
_sim_xi: numpy.ndarray | None = None
static _key_from_xi(xi) Tuple[float, Ellipsis]

Stable tuple key for a scalar/vector design ξ.

setup(model: Callable[[numpy.ndarray, float | numpy.ndarray], numpy.ndarray] | None = None, theta_sampler: Callable[[int], numpy.ndarray] | None = None, simulated_data: Dict[str, numpy.ndarray] | None = None, xi_list: List[float | numpy.ndarray] | None = None, n_samples: int = 10000)

Prepare per-design kNN structures by either reusing simulated_data or simulating for each design.

Parameters:
  • model (callable, optional) – Black-box simulator with signature model(theta, xi) -> y (vectorized over theta).

  • theta_sampler (callable, optional) – Sampler for \(\theta\); required when evaluate_model=True.

  • simulated_data (dict, optional) – Dict with keys {“y”: (n, dy), “theta”: (n, dθ), “xi”: (n, dξ)} when reusing sims.

  • xi_list (list, optional) – List of designs; each item can be scalar or array-like. If None, defaults to [0.0].

  • n_samples (int) – Number of \(\theta\) samples to draw when evaluate_model=True. Default: 10000.

nearest(y: numpy.ndarray | List[float], xi: float | numpy.ndarray, k: int | None = None, return_dist: bool = False)

Return k nearest neighbors for y at design xi.

Parameters:
  • y (array-like) – Query outputs, shape (m, d_y) or (d_y,).

  • xi (scalar or array-like) – Design key to select the per-design index.

  • k (int, optional) – Number of neighbors; defaults to self.knn.

  • return_dist (bool) – If True, also return distances and raw indices. Default: False.

Returns:

Shape (m*k, dθ) stacked \(\theta\) for all query rows. distances (ndarray, optional): Returned if return_dist=True. indices (ndarray, optional): Returned if return_dist=True.

Return type:

theta_neighbors (ndarray)

calibrate(observations, resample_n: int | None = None, combine: str = 'stack', combine_params: dict | None = None)

Run kNN calibration and aggregate posterior θ across neighbor-hit blocks.

Parameters:
  • observations – Observed simulator or model outputs to calibrate against.

  • resample_n (int | None) – If set, resample posterior θ samples to this size. If None, return all aggregated θ without resampling.

  • combine (str) –

    Aggregation mode. One of:

    • ’stack’: concatenate all kNN θ; optional de-duplication.

    • ’intersect’: retain θ hit at least min_count times across neighbor blocks.

  • combine_params (dict | None) –

    Optional parameters controlling aggregation and KDE weighting.

    Supported keys:

    • dedup (bool): Default False. Remove duplicate θ (only for ‘stack’).

    • theta_match_tol (float): Default 1e-9. Tolerance or rounding quantum for comparing/merging θ values.

    • min_count (int | None): Minimum occurrences for ‘intersect’. Default is max(1, ceil(0.5 * total_blocks)), meaning θ must appear in about half of neighbor lists.

    • use_kde (bool): Default False. If True, fit KDE on aggregated θ to compute log-scores and normalized weights.

    • kde_bandwidth (float | None): Bandwidth for KDE. If None (default), use Scott’s rule.

Tip

Two aggregation modes are supported:

  • stack: Concatenate all kNN θ into a single array. Supports optional de-duplication of nearly identical θ values.

  • intersect: Keep θ values that occur in at least min_count neighbor blocks across all observations/design points (default ≈ half of all blocks).

Optional density weighting via KDE can be applied after aggregation to compute normalized posterior weights.

Returns:

A dictionary with keys:
  • ’mode’ (str): Always ‘knn’.

  • ’theta’ (ndarray): Posterior samples of shape (N, dθ); resampled if resample_n is provided.

  • ’weights’ (ndarray | None): None for stack/intersect, or a length-N array of KDE weights if use_kde=True.

  • ’meta’ (dict): Aggregation info; may include KDE bandwidth if density weighting is used.

Return type:

dict

_round_rows(A: numpy.ndarray, tol: float) tuple[numpy.ndarray, numpy.ndarray]

Round rows of A to multiples of tol and return (unique_rows, counts).

Parameters:
  • A (ndarray) – Input array to process.

  • tol (float) – Tolerance for rounding. If tol <= 0, exact matching is used.

Returns:

(unique_rows, counts) where unique_rows are the deduplicated rows

and counts are the occurrence counts.

Return type:

tuple

_kde_logweights(X, bw=0.5, n_max_exact=5000)

Compute KDE-based log-weights for posterior samples X.

Parameters:
  • X (ndarray) – Posterior samples, shape (n, d).

  • bw (float) – Bandwidth for Gaussian kernel. Default: 0.5.

  • n_max_exact (int) – Max n for exact pairwise KDE. Above this, fall back to sklearn.KernelDensity. Default: 5000.

Returns:

  • logp (ndarray): Log-density values at X, shape (n,).

  • w (ndarray): Normalized weights, shape (n,).

Return type:

tuple

get_posterior() Any

Return the last computed posterior dict; raises if calibrate() hasn’t been called.

pyuncertainnumber.calibration.knn.estimate_p_theta_knn(observed_data, simulated_data, xi_star, knn: int = 20, a_tol: float = 0.05)

Estimate the posterior distribution p(θ) using k-Nearest Neighbors (kNN) on a simulation archive.

This method restricts the simulation archive to runs at (or near) the target design \(\xi^*\), then fits a kNN model in output (y) space. For each observed output y_obs, it retrieves the k-nearest simulated outputs and returns the corresponding \(\theta\) values as approximate posterior samples.

Parameters:
  • observed_data (ndarray) – Array of observed outputs y_obs, shape (n_obs, d_y). Must match the dimensionality of simulated outputs.

  • simulated_data (list) –

    List of arrays [y, θ, ξ], containing:

    • y (ndarray): Simulation output, shape (n, d_y), e.g., transformed y with only KPIs

    • θ (ndarray): Parameters and variables to be calibrated, shape (n, d_theta)

    • ξ (ndarray): Conditioning controllable factors, shape (n, d_xi), e.g., design parameters

  • knn (int) – Number of nearest neighbors to query per observed sample. Default: 20.

  • xi_star (scalar or array-like) – Target design \(\xi^*\) at which the posterior is estimated.

  • a_tol (float) – Tolerance for matching simulations to \(\xi^*\). Default: 0.05. A simulation is kept if \(\|\xi_{\text{sim}} - \xi^*\|_\infty < a_{\text{tol}}\).

Returns:

\(\theta\) samples from the posterior, stacked across all observed y.

Shape: (n_obs × knn, d_theta).

Return type:

ndarray

Raises:
  • ValueError – If filtering leaves no simulations at \(\xi^*\).

  • RuntimeError – If kNN search fails due to inconsistent dimensions.

Note

  • Scaling of outputs y is performed internally via StandardScaler for robustness against different KPI magnitudes.

  • The parameter knn acts as a smoothing parameter: higher values broaden the posterior but reduce sharpness.

  • The choice of a_tol trades off strict design conditioning vs. sample size. Too small → few matches; too large → weaker conditioning.

Example

>>> import numpy as np
>>> from sklearn.preprocessing import StandardScaler
>>> from sklearn.neighbors import NearestNeighbors
>>> # Fake simulator archive
>>> theta_sim = np.random.uniform(-5, 5, size=(5000, 2))
>>> xi_sim = np.zeros((5000, 1))
>>> y_sim = np.sum(theta_sim**2, axis=1, keepdims=True)         ...         + 0.1*np.random.randn(5000, 1)
>>> simulated_data = [y_sim, theta_sim, xi_sim]
>>> # Observed data
>>> theta_true = np.array([1.5, -2.0])
>>> y_obs = np.sum(theta_true**2) + 0.1*np.random.randn(1)
>>> # Estimate posterior
>>> theta_post = estimate_p_theta_knn(
...     observed_data=np.array([[y_obs]]),
...     simulated_data=simulated_data,
...     knn=50,
...     xi_star=0.0
... )
>>> theta_post.shape
(50, 2)
>>> theta_post.mean(axis=0)
array([ 1.4, -2.1])  # close to true [1.5, -2.0]