ClusterAnalysis: End-to-End Clustering-Based Bound Estimation (Machine Gnostics)¶
The ClusterAnalysis
class provides a robust, automated workflow for estimating main cluster bounds in a dataset using Gnostic Distribution Functions (GDFs) and advanced clustering analysis. It is designed for interpretable, reproducible interval estimation in scientific, engineering, and data science applications.
Overview¶
ClusterAnalysis orchestrates the entire process of fitting a GDF (ELDF/EGDF), assessing data homogeneity, performing cluster boundary detection, and returning interpretable lower and upper cluster bounds (LCB, UCB) for the main data cluster.
- Automated Pipeline: Integrates GDF fitting, homogeneity testing, and cluster analysis.
- Flexible: Supports both local (ELDF) and global (EGDF) GDFs.
- Robust: Handles weighted data, bounded/unbounded domains, and advanced parameterization.
- Diagnostics: Detailed error/warning logging and reproducible parameter tracking.
- Memory-Efficient: Optional flushing of intermediate results.
- Visualization: Built-in plotting for GDF and cluster analysis results.
Key Features¶
- End-to-end cluster-based bound estimation
- Integrates GDF fitting, homogeneity testing, and clustering
- Supports local and global GDFs
- Handles weighted, bounded, and unbounded data
- Detailed error and warning logging
- Memory-efficient operation via flushing
- Visualization of GDF and cluster analysis results
Parameters¶
Parameter | Type | Default | Description |
---|---|---|---|
verbose |
bool | False | Print detailed logs and progress information |
catch |
bool | True | Store intermediate results and diagnostics |
derivative_threshold |
float | 0.01 | Threshold for derivative-based cluster boundary detection |
slope_percentile |
int | 70 | Percentile for slope-based boundary detection |
DLB |
float or None | None | Data Lower Bound (absolute minimum, optional) |
DUB |
float or None | None | Data Upper Bound (absolute maximum, optional) |
LB |
float or None | None | Lower probable bound (optional) |
UB |
float or None | None | Upper probable bound (optional) |
S |
float or 'auto' | 'auto' | Scale parameter for GDF ('auto' for automatic estimation) |
varS |
bool | False | Use variable scale parameter during optimization |
z0_optimize |
bool | True | Optimize location parameter Z0 during fitting |
tolerance |
float | 1e-5 | Convergence tolerance for optimization |
data_form |
str | 'a' | Data form: 'a' (additive), 'm' (multiplicative) |
n_points |
int | 1000 | Number of points for GDF evaluation |
homogeneous |
bool | True | Assume data homogeneity |
weights |
np.ndarray or None | None | Prior weights for data points |
wedf |
bool | False | Use Weighted Empirical Distribution Function |
opt_method |
str | 'L-BFGS-B' | Optimization method (scipy.optimize) |
max_data_size |
int | 1000 | Max data size for smooth GDF generation |
flush |
bool | False | Flush intermediate results after fitting to save memory |
Attributes¶
- LCB:
float or None
Lower Cluster Bound (main cluster lower edge) - UCB:
float or None
Upper Cluster Bound (main cluster upper edge) - params:
dict
All parameters, intermediate results, errors, and warnings - _fitted:
bool
Indicates whether analysis has been completed
Methods¶
fit(data, plot=False)
¶
Runs the full cluster analysis pipeline on the input data.
- data:
np.ndarray
, shape(n_samples,)
Input data array for interval analysis - plot:
bool
(optional)
If True, generates plots for the fitted GDF and cluster analysis
Returns:
tuple
— (LCB, UCB)
as the main cluster bounds
results()
¶
Returns a dictionary with the estimated bounds and key results.
Returns:
dict
— { 'LCB': float, 'UCB': float }
plot()
¶
Visualizes the fitted GDF and cluster analysis results (if not flushed).
Returns:
None (displays plot)
Example Usage¶
import numpy as np
from machinegnostics.magcal import ClusterAnalysis
# Example data
data = np.array([ -13.5, 0, 1., 2., 3., 4., 5., 6., 7., 8., 9., 10.])
# Initialize ClusterAnalysis
ca = ClusterAnalysis(verbose=True)
# Fit and get cluster bounds
LCB, UCB = ca.fit(data)
print(f"Main cluster bounds: LCB={LCB:.3f}, UCB={UCB:.3f}")
# Visualize results
ca.plot()
# Access results dictionary
results = ca.results()
print(results)
Notes¶
- Designed for robust, interpretable cluster-based bound estimation
- Works best with local GDFs (ELDF); global GDFs (EGDF) are supported
- If
homogeneous=True
but data is heterogeneous, a warning is issued - All intermediate parameters, errors, and warnings are tracked in
params
- For large datasets or memory-constrained environments, set
flush=True
to save memory (disables plotting)
References¶
- Gnostic Distribution Function theory and clustering methods (see mathematical gnostics literature)
- For details on underlying algorithms, see documentation for ELDF, EGDF, and DataCluster classes
Author: Nirmal Parmar
Date: 2025-09-24