pyuncertainnumber.calibration.knn
=================================

.. py:module:: pyuncertainnumber.calibration.knn


Classes
-------

.. autoapisummary::

   pyuncertainnumber.calibration.knn.KNNCalibrator


Functions
---------

.. autoapisummary::

   pyuncertainnumber.calibration.knn.estimate_p_theta_knn


Module Contents
---------------

.. py:class:: KNNCalibrator(knn: int = 100, a_tol: float = 0.05, evaluate_model: bool = False)

   Bases: :py:obj:`Calibrator`


   Unified kNN-based calibrator for black-box models or precomputed simulations.

   :param knn: Number of neighbors per observed row. Default: 100.
   :type knn: int
   :param a_tol: Tolerance for matching simulated :math:`\xi` to a requested :math:`\xi^*`
                 (when reusing). A simulation is kept if :math:`\|\xi_{\text{sim}} - \xi^*\|_\infty < a_{\text{tol}}`.
                 Default: 0.05.
   :type a_tol: float
   :param evaluate_model: If True, call the black-box model for each :math:`\xi` in ``xi_list``
                          on a shared :math:`\theta` grid. If False, reuse ``simulated_data`` (requires y/theta/xi arrays).
                          Default: False.
   :type evaluate_model: bool
   :param random_state: Seed for reproducibility (affects theta_sampler and resampling). Default: 42.
   :type random_state: int

   .. note::

      **Setup (unified approach)**:
      
      - If ``evaluate_model=False`` and ``simulated_data`` is provided:
      
        * Reuse pre-computed simulations
        * Build a per-design kNN index by filtering rows with :math:`\|\xi - \xi^*\|_\infty < a_{\text{tol}}`
          for each :math:`\xi^*` in ``xi_list``
      
      - If ``evaluate_model=True``:
      
        * Simulate :math:`y = \text{model}(\theta, \xi)` for each :math:`\xi` in ``xi_list``
        * Use a shared :math:`\theta` grid drawn once from ``theta_sampler(n_samples)``
        * Build per-design kNN indices on this shared grid
      
      **Calibration workflow (single/multi-design)**:
      
      For each observation pair :math:`(y_{\text{obs}}, \xi)`:
      
      1. Standardize :math:`y_{\text{obs}}` using the per-design scaler
      2. Find k nearest neighbors in y-space
      3. Map neighbor indices to :math:`\theta` values for that design
      4. Stack :math:`\theta` samples across all observations/designs (or apply voting/intersection)

   .. figure:: /_static/knn_illustration.png
       :alt: knn_illustration
       :align: center
       :width: 50%

       KNN calibration illustration.


   .. py:attribute:: knn
      :value: 100


   .. py:attribute:: a_tol


   .. py:attribute:: evaluate_model
      :value: False


   .. py:attribute:: random_state
      :value: 42


   .. py:attribute:: _theta_grid
      :type:  Optional[numpy.ndarray]
      :value: None


   .. py:attribute:: _theta_by_xi
      :type:  Dict[Tuple[float, Ellipsis], numpy.ndarray]


   .. py:attribute:: _y_by_xi
      :type:  Dict[Tuple[float, Ellipsis], numpy.ndarray]


   .. py:attribute:: _scaler_by_xi
      :type:  Dict[Tuple[float, Ellipsis], sklearn.preprocessing.StandardScaler]


   .. py:attribute:: _neigh_by_xi
      :type:  Dict[Tuple[float, Ellipsis], sklearn.neighbors.NearestNeighbors]


   .. py:attribute:: _grid_idx_by_xi
      :type:  Dict[Tuple[float, Ellipsis], numpy.ndarray]


   .. py:attribute:: _posterior
      :type:  Optional[Dict[str, Any]]
      :value: None


   .. py:attribute:: _sim_y
      :type:  Optional[numpy.ndarray]
      :value: None


   .. py:attribute:: _sim_theta
      :type:  Optional[numpy.ndarray]
      :value: None


   .. py:attribute:: _sim_xi
      :type:  Optional[numpy.ndarray]
      :value: None


   .. py:method:: _key_from_xi(xi) -> Tuple[float, Ellipsis]
      :staticmethod:


      Stable tuple key for a scalar/vector design ξ.


   .. py:method:: setup(model: Optional[Callable[[numpy.ndarray, Union[float, numpy.ndarray]], numpy.ndarray]] = None, theta_sampler: Optional[Callable[[int], numpy.ndarray]] = None, simulated_data: Optional[Dict[str, numpy.ndarray]] = None, xi_list: Optional[List[Union[float, numpy.ndarray]]] = None, n_samples: int = 10000)

      Prepare per-design kNN structures by either reusing simulated_data or simulating for each design.

      :param model: Black-box simulator with signature ``model(theta, xi) -> y``
                    (vectorized over theta).
      :type model: callable, optional
      :param theta_sampler: Sampler for :math:`\theta`; required when ``evaluate_model=True``.
      :type theta_sampler: callable, optional
      :param simulated_data: Dict with keys {"y": (n, dy), "theta": (n, dθ), "xi": (n, dξ)}
                             when reusing sims.
      :type simulated_data: dict, optional
      :param xi_list: List of designs; each item can be scalar or array-like.
                      If None, defaults to [0.0].
      :type xi_list: list, optional
      :param n_samples: Number of :math:`\theta` samples to draw when ``evaluate_model=True``. Default: 10000.
      :type n_samples: int


   .. py:method:: nearest(y: Union[numpy.ndarray, List[float]], xi: Union[float, numpy.ndarray], k: Optional[int] = None, return_dist: bool = False)

      Return k nearest neighbors for y at design xi.

      :param y: Query outputs, shape (m, d_y) or (d_y,).
      :type y: array-like
      :param xi: Design key to select the per-design index.
      :type xi: scalar or array-like
      :param k: Number of neighbors; defaults to ``self.knn``.
      :type k: int, optional
      :param return_dist: If True, also return distances and raw indices. Default: False.
      :type return_dist: bool

      :returns: Shape (m*k, dθ) stacked :math:`\theta` for all query rows.
                distances (ndarray, optional): Returned if ``return_dist=True``.
                indices (ndarray, optional): Returned if ``return_dist=True``.
      :rtype: theta_neighbors (ndarray)


   .. py:method:: calibrate(observations, resample_n: int | None = None, combine: str = 'stack', combine_params: dict | None = None)

      Run kNN calibration and aggregate posterior θ across neighbor-hit blocks.

      :param observations: Observed simulator or model outputs to calibrate against.
      :param resample_n: If set, resample posterior θ samples to this size.
                         If `None`, return all aggregated θ without resampling.
      :type resample_n: int | None
      :param combine: Aggregation mode. One of:

                      - **'stack'**: concatenate all kNN θ; optional de-duplication.

                      - **'intersect'**: retain θ hit at least `min_count` times across neighbor blocks.
      :type combine: str
      :param combine_params: Optional parameters controlling aggregation and KDE weighting.

                             Supported keys:

                             - **dedup** (bool): Default `False`. Remove duplicate θ (only for 'stack').

                             - **theta_match_tol** (float): Default `1e-9`. Tolerance or rounding quantum for comparing/merging θ values.

                             - **min_count** (int | None): Minimum occurrences for 'intersect'. Default is `max(1, ceil(0.5 * total_blocks))`, meaning θ must appear in about half of neighbor lists.

                             - **use_kde** (bool): Default `False`. If `True`, fit KDE on aggregated θ to compute log-scores and normalized weights.

                             - **kde_bandwidth** (float | None): Bandwidth for KDE. If `None` (default), use Scott's rule.
      :type combine_params: dict | None

      .. tip::

         Two aggregation modes are supported:
         
         - **stack**: Concatenate all kNN θ into a single array. Supports optional de-duplication of nearly identical θ values.
         
         - **intersect**: Keep θ values that occur in at least `min_count` neighbor blocks across all observations/design points (default ≈ half of all blocks).
         
         Optional density weighting via KDE can be applied after aggregation to compute normalized posterior weights.

      :returns:

                A dictionary with keys:
                    - **'mode'** (str): Always `'knn'`.
                    - **'theta'** (ndarray): Posterior samples of shape `(N, dθ)`;
                      resampled if `resample_n` is provided.
                    - **'weights'** (ndarray | None): `None` for stack/intersect,
                      or a length-`N` array of KDE weights if `use_kde=True`.
                    - **'meta'** (dict): Aggregation info; may include KDE bandwidth
                      if density weighting is used.
      :rtype: dict


   .. py:method:: _round_rows(A: numpy.ndarray, tol: float) -> tuple[numpy.ndarray, numpy.ndarray]

      Round rows of A to multiples of tol and return (unique_rows, counts).

      :param A: Input array to process.
      :type A: ndarray
      :param tol: Tolerance for rounding. If tol <= 0, exact matching is used.
      :type tol: float

      :returns:

                (unique_rows, counts) where unique_rows are the deduplicated rows
                    and counts are the occurrence counts.
      :rtype: tuple


   .. py:method:: _kde_logweights(X, bw=0.5, n_max_exact=5000)

      Compute KDE-based log-weights for posterior samples X.

      :param X: Posterior samples, shape (n, d).
      :type X: ndarray
      :param bw: Bandwidth for Gaussian kernel. Default: 0.5.
      :type bw: float
      :param n_max_exact: Max n for exact pairwise KDE. Above this, fall back to
                          sklearn.KernelDensity. Default: 5000.
      :type n_max_exact: int

      :returns:     - **logp** (ndarray): Log-density values at X, shape (n,).
                    - **w** (ndarray): Normalized weights, shape (n,).
      :rtype: tuple


   .. py:method:: get_posterior() -> Any

      Return the last computed posterior dict; raises if calibrate() hasn't been called.


.. py:function:: estimate_p_theta_knn(observed_data, simulated_data, xi_star, knn: int = 20, a_tol: float = 0.05)

   Estimate the posterior distribution p(θ) using k-Nearest Neighbors (kNN) on a simulation archive.

   This method restricts the simulation archive to runs at (or near) the target design :math:`\xi^*`,
   then fits a kNN model in output (y) space. For each observed output y_obs, it retrieves the
   k-nearest simulated outputs and returns the corresponding :math:`\theta` values as approximate
   posterior samples.

   :param observed_data: Array of observed outputs y_obs, shape (n_obs, d_y).
                         Must match the dimensionality of simulated outputs.
   :type observed_data: ndarray
   :param simulated_data: List of arrays [y, θ, ξ], containing:

                          - **y** (ndarray): Simulation output, shape (n, d_y), e.g., transformed y with only KPIs
                          - **θ** (ndarray): Parameters and variables to be calibrated, shape (n, d_theta)
                          - **ξ** (ndarray): Conditioning controllable factors, shape (n, d_xi), e.g., design parameters
   :type simulated_data: list
   :param knn: Number of nearest neighbors to query per observed sample. Default: 20.
   :type knn: int
   :param xi_star: Target design :math:`\xi^*` at which the posterior is estimated.
   :type xi_star: scalar or array-like
   :param a_tol: Tolerance for matching simulations to :math:`\xi^*`. Default: 0.05.
                 A simulation is kept if :math:`\|\xi_{\text{sim}} - \xi^*\|_\infty < a_{\text{tol}}`.
   :type a_tol: float

   :returns:

             :math:`\theta` samples from the posterior, stacked across all observed y.
                 Shape: (n_obs × knn, d_theta).
   :rtype: ndarray

   :raises ValueError: If filtering leaves no simulations at :math:`\xi^*`.
   :raises RuntimeError: If kNN search fails due to inconsistent dimensions.

   .. note::

      - Scaling of outputs y is performed internally via StandardScaler for robustness
        against different KPI magnitudes.
      - The parameter ``knn`` acts as a smoothing parameter: higher values broaden the
        posterior but reduce sharpness.
      - The choice of ``a_tol`` trades off strict design conditioning vs. sample size.
        Too small → few matches; too large → weaker conditioning.

   .. rubric:: Example

   >>> import numpy as np
   >>> from sklearn.preprocessing import StandardScaler
   >>> from sklearn.neighbors import NearestNeighbors
   >>> # Fake simulator archive
   >>> theta_sim = np.random.uniform(-5, 5, size=(5000, 2))
   >>> xi_sim = np.zeros((5000, 1))
   >>> y_sim = np.sum(theta_sim**2, axis=1, keepdims=True)         ...         + 0.1*np.random.randn(5000, 1)
   >>> simulated_data = [y_sim, theta_sim, xi_sim]
   >>> # Observed data
   >>> theta_true = np.array([1.5, -2.0])
   >>> y_obs = np.sum(theta_true**2) + 0.1*np.random.randn(1)
   >>> # Estimate posterior
   >>> theta_post = estimate_p_theta_knn(
   ...     observed_data=np.array([[y_obs]]),
   ...     simulated_data=simulated_data,
   ...     knn=50,
   ...     xi_star=0.0
   ... )
   >>> theta_post.shape
   (50, 2)
   >>> theta_post.mean(axis=0)
   array([ 1.4, -2.1])  # close to true [1.5, -2.0]