pyuncertainnumber.calibration.knn ================================= .. py:module:: pyuncertainnumber.calibration.knn Classes ------- .. autoapisummary:: pyuncertainnumber.calibration.knn.KNNCalibrator Functions --------- .. autoapisummary:: pyuncertainnumber.calibration.knn.estimate_p_theta_knn Module Contents --------------- .. py:class:: KNNCalibrator(knn: int = 100, a_tol: float = 0.05, evaluate_model: bool = False) Bases: :py:obj:`Calibrator` Unified kNN-based calibrator for black-box models or precomputed simulations. :param knn: Number of neighbors per observed row. Default: 100. :type knn: int :param a_tol: Tolerance for matching simulated :math:`\xi` to a requested :math:`\xi^*` (when reusing). A simulation is kept if :math:`\|\xi_{\text{sim}} - \xi^*\|_\infty < a_{\text{tol}}`. Default: 0.05. :type a_tol: float :param evaluate_model: If True, call the black-box model for each :math:`\xi` in ``xi_list`` on a shared :math:`\theta` grid. If False, reuse ``simulated_data`` (requires y/theta/xi arrays). Default: False. :type evaluate_model: bool :param random_state: Seed for reproducibility (affects theta_sampler and resampling). Default: 42. :type random_state: int .. note:: **Setup (unified approach)**: - If ``evaluate_model=False`` and ``simulated_data`` is provided: * Reuse pre-computed simulations * Build a per-design kNN index by filtering rows with :math:`\|\xi - \xi^*\|_\infty < a_{\text{tol}}` for each :math:`\xi^*` in ``xi_list`` - If ``evaluate_model=True``: * Simulate :math:`y = \text{model}(\theta, \xi)` for each :math:`\xi` in ``xi_list`` * Use a shared :math:`\theta` grid drawn once from ``theta_sampler(n_samples)`` * Build per-design kNN indices on this shared grid **Calibration workflow (single/multi-design)**: For each observation pair :math:`(y_{\text{obs}}, \xi)`: 1. Standardize :math:`y_{\text{obs}}` using the per-design scaler 2. Find k nearest neighbors in y-space 3. Map neighbor indices to :math:`\theta` values for that design 4. Stack :math:`\theta` samples across all observations/designs (or apply voting/intersection) .. figure:: /_static/knn_illustration.png :alt: knn_illustration :align: center :width: 50% KNN calibration illustration. .. py:attribute:: knn :value: 100 .. py:attribute:: a_tol .. py:attribute:: evaluate_model :value: False .. py:attribute:: random_state :value: 42 .. py:attribute:: _theta_grid :type: Optional[numpy.ndarray] :value: None .. py:attribute:: _theta_by_xi :type: Dict[Tuple[float, Ellipsis], numpy.ndarray] .. py:attribute:: _y_by_xi :type: Dict[Tuple[float, Ellipsis], numpy.ndarray] .. py:attribute:: _scaler_by_xi :type: Dict[Tuple[float, Ellipsis], sklearn.preprocessing.StandardScaler] .. py:attribute:: _neigh_by_xi :type: Dict[Tuple[float, Ellipsis], sklearn.neighbors.NearestNeighbors] .. py:attribute:: _grid_idx_by_xi :type: Dict[Tuple[float, Ellipsis], numpy.ndarray] .. py:attribute:: _posterior :type: Optional[Dict[str, Any]] :value: None .. py:attribute:: _sim_y :type: Optional[numpy.ndarray] :value: None .. py:attribute:: _sim_theta :type: Optional[numpy.ndarray] :value: None .. py:attribute:: _sim_xi :type: Optional[numpy.ndarray] :value: None .. py:method:: _key_from_xi(xi) -> Tuple[float, Ellipsis] :staticmethod: Stable tuple key for a scalar/vector design ξ. .. py:method:: setup(model: Optional[Callable[[numpy.ndarray, Union[float, numpy.ndarray]], numpy.ndarray]] = None, theta_sampler: Optional[Callable[[int], numpy.ndarray]] = None, simulated_data: Optional[Dict[str, numpy.ndarray]] = None, xi_list: Optional[List[Union[float, numpy.ndarray]]] = None, n_samples: int = 10000) Prepare per-design kNN structures by either reusing simulated_data or simulating for each design. :param model: Black-box simulator with signature ``model(theta, xi) -> y`` (vectorized over theta). :type model: callable, optional :param theta_sampler: Sampler for :math:`\theta`; required when ``evaluate_model=True``. :type theta_sampler: callable, optional :param simulated_data: Dict with keys {"y": (n, dy), "theta": (n, dθ), "xi": (n, dξ)} when reusing sims. :type simulated_data: dict, optional :param xi_list: List of designs; each item can be scalar or array-like. If None, defaults to [0.0]. :type xi_list: list, optional :param n_samples: Number of :math:`\theta` samples to draw when ``evaluate_model=True``. Default: 10000. :type n_samples: int .. py:method:: nearest(y: Union[numpy.ndarray, List[float]], xi: Union[float, numpy.ndarray], k: Optional[int] = None, return_dist: bool = False) Return k nearest neighbors for y at design xi. :param y: Query outputs, shape (m, d_y) or (d_y,). :type y: array-like :param xi: Design key to select the per-design index. :type xi: scalar or array-like :param k: Number of neighbors; defaults to ``self.knn``. :type k: int, optional :param return_dist: If True, also return distances and raw indices. Default: False. :type return_dist: bool :returns: Shape (m*k, dθ) stacked :math:`\theta` for all query rows. distances (ndarray, optional): Returned if ``return_dist=True``. indices (ndarray, optional): Returned if ``return_dist=True``. :rtype: theta_neighbors (ndarray) .. py:method:: calibrate(observations, resample_n: int | None = None, combine: str = 'stack', combine_params: dict | None = None) Run kNN calibration and aggregate posterior θ across neighbor-hit blocks. :param observations: Observed simulator or model outputs to calibrate against. :param resample_n: If set, resample posterior θ samples to this size. If `None`, return all aggregated θ without resampling. :type resample_n: int | None :param combine: Aggregation mode. One of: - **'stack'**: concatenate all kNN θ; optional de-duplication. - **'intersect'**: retain θ hit at least `min_count` times across neighbor blocks. :type combine: str :param combine_params: Optional parameters controlling aggregation and KDE weighting. Supported keys: - **dedup** (bool): Default `False`. Remove duplicate θ (only for 'stack'). - **theta_match_tol** (float): Default `1e-9`. Tolerance or rounding quantum for comparing/merging θ values. - **min_count** (int | None): Minimum occurrences for 'intersect'. Default is `max(1, ceil(0.5 * total_blocks))`, meaning θ must appear in about half of neighbor lists. - **use_kde** (bool): Default `False`. If `True`, fit KDE on aggregated θ to compute log-scores and normalized weights. - **kde_bandwidth** (float | None): Bandwidth for KDE. If `None` (default), use Scott's rule. :type combine_params: dict | None .. tip:: Two aggregation modes are supported: - **stack**: Concatenate all kNN θ into a single array. Supports optional de-duplication of nearly identical θ values. - **intersect**: Keep θ values that occur in at least `min_count` neighbor blocks across all observations/design points (default ≈ half of all blocks). Optional density weighting via KDE can be applied after aggregation to compute normalized posterior weights. :returns: A dictionary with keys: - **'mode'** (str): Always `'knn'`. - **'theta'** (ndarray): Posterior samples of shape `(N, dθ)`; resampled if `resample_n` is provided. - **'weights'** (ndarray | None): `None` for stack/intersect, or a length-`N` array of KDE weights if `use_kde=True`. - **'meta'** (dict): Aggregation info; may include KDE bandwidth if density weighting is used. :rtype: dict .. py:method:: _round_rows(A: numpy.ndarray, tol: float) -> tuple[numpy.ndarray, numpy.ndarray] Round rows of A to multiples of tol and return (unique_rows, counts). :param A: Input array to process. :type A: ndarray :param tol: Tolerance for rounding. If tol <= 0, exact matching is used. :type tol: float :returns: (unique_rows, counts) where unique_rows are the deduplicated rows and counts are the occurrence counts. :rtype: tuple .. py:method:: _kde_logweights(X, bw=0.5, n_max_exact=5000) Compute KDE-based log-weights for posterior samples X. :param X: Posterior samples, shape (n, d). :type X: ndarray :param bw: Bandwidth for Gaussian kernel. Default: 0.5. :type bw: float :param n_max_exact: Max n for exact pairwise KDE. Above this, fall back to sklearn.KernelDensity. Default: 5000. :type n_max_exact: int :returns: - **logp** (ndarray): Log-density values at X, shape (n,). - **w** (ndarray): Normalized weights, shape (n,). :rtype: tuple .. py:method:: get_posterior() -> Any Return the last computed posterior dict; raises if calibrate() hasn't been called. .. py:function:: estimate_p_theta_knn(observed_data, simulated_data, xi_star, knn: int = 20, a_tol: float = 0.05) Estimate the posterior distribution p(θ) using k-Nearest Neighbors (kNN) on a simulation archive. This method restricts the simulation archive to runs at (or near) the target design :math:`\xi^*`, then fits a kNN model in output (y) space. For each observed output y_obs, it retrieves the k-nearest simulated outputs and returns the corresponding :math:`\theta` values as approximate posterior samples. :param observed_data: Array of observed outputs y_obs, shape (n_obs, d_y). Must match the dimensionality of simulated outputs. :type observed_data: ndarray :param simulated_data: List of arrays [y, θ, ξ], containing: - **y** (ndarray): Simulation output, shape (n, d_y), e.g., transformed y with only KPIs - **θ** (ndarray): Parameters and variables to be calibrated, shape (n, d_theta) - **ξ** (ndarray): Conditioning controllable factors, shape (n, d_xi), e.g., design parameters :type simulated_data: list :param knn: Number of nearest neighbors to query per observed sample. Default: 20. :type knn: int :param xi_star: Target design :math:`\xi^*` at which the posterior is estimated. :type xi_star: scalar or array-like :param a_tol: Tolerance for matching simulations to :math:`\xi^*`. Default: 0.05. A simulation is kept if :math:`\|\xi_{\text{sim}} - \xi^*\|_\infty < a_{\text{tol}}`. :type a_tol: float :returns: :math:`\theta` samples from the posterior, stacked across all observed y. Shape: (n_obs × knn, d_theta). :rtype: ndarray :raises ValueError: If filtering leaves no simulations at :math:`\xi^*`. :raises RuntimeError: If kNN search fails due to inconsistent dimensions. .. note:: - Scaling of outputs y is performed internally via StandardScaler for robustness against different KPI magnitudes. - The parameter ``knn`` acts as a smoothing parameter: higher values broaden the posterior but reduce sharpness. - The choice of ``a_tol`` trades off strict design conditioning vs. sample size. Too small → few matches; too large → weaker conditioning. .. rubric:: Example >>> import numpy as np >>> from sklearn.preprocessing import StandardScaler >>> from sklearn.neighbors import NearestNeighbors >>> # Fake simulator archive >>> theta_sim = np.random.uniform(-5, 5, size=(5000, 2)) >>> xi_sim = np.zeros((5000, 1)) >>> y_sim = np.sum(theta_sim**2, axis=1, keepdims=True) ... + 0.1*np.random.randn(5000, 1) >>> simulated_data = [y_sim, theta_sim, xi_sim] >>> # Observed data >>> theta_true = np.array([1.5, -2.0]) >>> y_obs = np.sum(theta_true**2) + 0.1*np.random.randn(1) >>> # Estimate posterior >>> theta_post = estimate_p_theta_knn( ... observed_data=np.array([[y_obs]]), ... simulated_data=simulated_data, ... knn=50, ... xi_star=0.0 ... ) >>> theta_post.shape (50, 2) >>> theta_post.mean(axis=0) array([ 1.4, -2.1]) # close to true [1.5, -2.0]