ClusterAnalysis: End-to-End Clustering-Based Bound Estimation (Machine Gnostics)¶

The ClusterAnalysis class provides a robust, automated workflow for estimating main cluster bounds in a dataset using Gnostic Distribution Functions (GDFs) and advanced clustering analysis. It is designed for interpretable, reproducible interval estimation in scientific, engineering, and data science applications.

Overview¶

ClusterAnalysis orchestrates the entire process of fitting a GDF (ELDF/EGDF), assessing data homogeneity, performing cluster boundary detection, and returning interpretable lower and upper cluster bounds (LCB, UCB) for the main data cluster.

Automated Pipeline: Integrates GDF fitting, homogeneity testing, and cluster analysis.
Flexible: Supports both local (ELDF) and global (EGDF) GDFs.
Robust: Handles weighted data, bounded/unbounded domains, and advanced parameterization.
Diagnostics: Detailed error/warning logging and reproducible parameter tracking.
Memory-Efficient: Optional flushing of intermediate results.
Visualization: Built-in plotting for GDF and cluster analysis results.

Key Features¶

End-to-end cluster-based bound estimation
Integrates GDF fitting, homogeneity testing, and clustering
Supports local and global GDFs
Handles weighted, bounded, and unbounded data
Detailed error and warning logging
Memory-efficient operation via flushing
Visualization of GDF and cluster analysis results

Parameters¶

Parameter	Type	Default	Description
`verbose`	bool	False	Print detailed logs and progress information
`catch`	bool	True	Store intermediate results and diagnostics
`derivative_threshold`	float	0.01	Threshold for derivative-based cluster boundary detection
`slope_percentile`	int	70	Percentile for slope-based boundary detection
`DLB`	float or None	None	Data Lower Bound (absolute minimum, optional)
`DUB`	float or None	None	Data Upper Bound (absolute maximum, optional)
`LB`	float or None	None	Lower probable bound (optional)
`UB`	float or None	None	Upper probable bound (optional)
`S`	float or 'auto'	'auto'	Scale parameter for GDF ('auto' for automatic estimation)
`varS`	bool	False	Use variable scale parameter during optimization
`z0_optimize`	bool	True	Optimize location parameter Z0 during fitting
`tolerance`	float	1e-5	Convergence tolerance for optimization
`data_form`	str	'a'	Data form: 'a' (additive), 'm' (multiplicative)
`n_points`	int	1000	Number of points for GDF evaluation
`homogeneous`	bool	True	Assume data homogeneity
`weights`	np.ndarray or None	None	Prior weights for data points
`wedf`	bool	False	Use Weighted Empirical Distribution Function
`opt_method`	str	'L-BFGS-B'	Optimization method (scipy.optimize)
`max_data_size`	int	1000	Max data size for smooth GDF generation
`flush`	bool	False	Flush intermediate results after fitting to save memory

Attributes¶

LCB: float or None
Lower Cluster Bound (main cluster lower edge)
UCB: float or None
Upper Cluster Bound (main cluster upper edge)
params: dict
All parameters, intermediate results, errors, and warnings
_fitted: bool
Indicates whether analysis has been completed

Methods¶

`fit(data, plot=False)`¶

Runs the full cluster analysis pipeline on the input data.

data: np.ndarray, shape (n_samples,)
Input data array for interval analysis
plot: bool (optional)
If True, generates plots for the fitted GDF and cluster analysis

Returns:
tuple — (LCB, UCB) as the main cluster bounds

`results()`¶

Returns a dictionary with the estimated bounds and key results.

Returns:
dict — { 'LCB': float, 'UCB': float }

`plot()`¶

Visualizes the fitted GDF and cluster analysis results (if not flushed).

Returns:
None (displays plot)

Example Usage¶

import numpy as np
from machinegnostics.magcal import ClusterAnalysis

# Example data
data = np.array([ -13.5, 0, 1., 2., 3., 4., 5., 6., 7., 8., 9., 10.])

# Initialize ClusterAnalysis
ca = ClusterAnalysis(verbose=True)

# Fit and get cluster bounds
LCB, UCB = ca.fit(data)
print(f"Main cluster bounds: LCB={LCB:.3f}, UCB={UCB:.3f}")

# Visualize results
ca.plot()

# Access results dictionary
results = ca.results()
print(results)

Notes¶

Designed for robust, interpretable cluster-based bound estimation
Works best with local GDFs (ELDF); global GDFs (EGDF) are supported
If homogeneous=True but data is heterogeneous, a warning is issued
All intermediate parameters, errors, and warnings are tracked in params
For large datasets or memory-constrained environments, set flush=True to save memory (disables plotting)

References¶

Gnostic Distribution Function theory and clustering methods (see mathematical gnostics literature)
For details on underlying algorithms, see documentation for ELDF, EGDF, and DataCluster classes

Author: Nirmal Parmar
Date: 2025-09-24