DataHomogeneity: Homogeneity Analysis for EGDF (Machine Gnostics)¶
Analyze data homogeneity for EGDF objects using probability density function analysis. This class provides comprehensive homogeneity analysis for Estimating Global Distribution Functions (EGDF) by examining the shape and characteristics of their probability density functions (PDF).
Overview¶
The homogeneity criterion is based on the mathematical properties and expected PDF behavior of EGDF according to Mathematical Gnostic theory principles.
Gnostic Theory Foundation:
The EGDF is uniquely determined by the data sample and finds the optimal scale parameter automatically. Unlike local distribution functions, EGDF has limited flexibility and provides a unique representation for each homogeneous data sample. The key principle is that homogeneous data should produce a distribution with a single density maximum, while non-homogeneous data will exhibit multiple maxima or negative density values.
Homogeneity Criteria:
- EGDF (Estimating Global Distribution Function): Data is considered homogeneous if:
- PDF has exactly one global maximum (single peak)
- PDF contains no negative values
EGDF Characteristics:
- Uniqueness: EGDF finds the best scale parameter automatically, providing a unique model
- Robustness: EGDF is robust with respect to outliers
- Homogeneity Testing: Particularly suitable for reliable data homogeneity testing
- Global Nature: Uses normalized weights resulting in limited flexibility controlled by optimal scale
- Data-Driven: Primary parameters are the data themselves, following gnostic "let data speak" principle
Non-Homogeneity Detection:
EGDF can sensitively detect two main causes of non-homogeneity: 1. Outliers: Individual data points significantly different from others, creating local maxima 2. Clusters: Separate groups in the data, resulting in multiple density peaks
Analysis Pipeline:
- Validation: Ensures input is EGDF only (rejects QGDF/ELDF/QLDF)
- PDF Extraction: Retrieves PDF points from fitted EGDF object
- Smoothing: Applies Gaussian filtering for noise reduction
- Maxima Detection: Identifies peaks in the smoothed PDF
- Homogeneity Assessment: Evaluates based on peak count and PDF negativity
- Result Storage: Comprehensive parameter collection and storage
Key Features¶
- Automatic EGDF validation
- Robust peak detection with configurable smoothing
- Comprehensive error and warning tracking
- Memory management with optional data flushing
- Detailed visualization of analysis results
- Integration with existing GDF parameter systems
Parameters¶
| Parameter | Type | Default | Description |
|---|---|---|---|
gdf |
EGDF | - | Fitted Estimating Global Distribution Function object. |
verbose |
bool | False | Controls output verbosity during analysis. |
catch |
bool | True | Enables comprehensive result storage in params dictionary. |
flush |
bool | False | Controls memory management of large arrays after analysis. |
smoothing_sigma |
float | 1.0 | Gaussian smoothing parameter for PDF preprocessing. |
min_height_ratio |
float | 0.01 | Minimum relative height threshold for peak detection. |
min_distance |
int (Optional) | None | Minimum separation between detected peaks in array indices. |
Attributes¶
- is_homogeneous:
boolorNonePrimary analysis result. None before fit(), True/False after analysis. - picks:
List[Dict]Detected maxima with detailed information. - z0:
floatorNoneGlobal optimum value from EGDF object or detected from PDF. - global_extremum_idx:
intorNoneArray index of the global maximum. - fitted:
boolRead-only property indicating if analysis has been completed.
Methods¶
fit(plot=False)¶
Perform comprehensive homogeneity analysis on the EGDF object. It analyzes the probability density function (PDF) of the fitted EGDF object to determine if the underlying data exhibits homogeneous characteristics.
- plot:
bool(optional, default=False) If True, generates plots for visual inspection of the analysis results.
Returns:
bool — True if data is homogeneous, False otherwise.
results()¶
Retrieve comprehensive homogeneity analysis results and metadata.
Returns:
Dict[str, Any] — Comprehensive results dictionary including core findings, maxima information, PDF characteristics, analysis configuration, and diagnostics.
plot(figsize=(12, 8), title=None)¶
Create a comprehensive visualization of the homogeneity analysis results, displaying the PDF, detected maxima, and status.
- figsize:
tuple(default=(12, 8)) Figure dimensions in inches as (width, height). - title:
str(optional) Custom plot title.
Returns: None
Example Usage¶
import numpy as np
from machinegnostics.magcal import EGDF, DataHomogeneity
# Homogeneous data
data = np.array([ -3, 0, 1., 2., 3., 4., 5., 6., 7., 8., 9., 10.])
egdf = EGDF(data=data, catch=True)
egdf.fit()
# Homogeneity analysis
homogeneity = DataHomogeneity(egdf, verbose=True)
is_homogeneous = homogeneity.fit(plot=True)
print(f"Data is homogeneous: {is_homogeneous}")
# Access results
results = homogeneity.results()
print(f"Number of maxima detected: {len(results['picks'])}")

Notes¶
Mathematical Background:
The gnostic homogeneity analysis is based on the principle that homogeneous data should produce a unimodal PDF with specific characteristics for EGDF:
- EGDF Uniqueness: Each data sample has exactly one optimal EGDF representation
- Scale Optimization: EGDF automatically finds the best scale parameter S_opt
- Density Properties: Homogeneous data produces single maximum, non-negative density
- Numerical Sensitivity: Analysis must be numerical, not based on visual inspection
Why Only EGDF:
Homogeneity testing is only applicable to EGDF because:
- EGDF provides unique representation for each data sample
- Automatic scale parameter optimization enables reliable homogeneity testing
- Global nature with normalized weights makes it suitable for detecting data structure
- Robustness against outliers while maintaining sensitivity to detect them
Parameter Tuning Guidelines:
- smoothing_sigma: Start with 1.0, increase for noisy data to improve numerical stability
- min_height_ratio: Start with 0.01, increase to reduce false positives from noise
- min_distance: Usually auto-calculated, manually set for specific data characteristics
Author: Nirmal Parmar
Date: 2025-10-10