gcorrelation: Gnostic Correlation Metric
The gcorrelation
function computes the Gnostic correlation coefficient between two data samples using robust irrelevance-based weighting. This metric provides a robust alternative to the classical Pearson correlation, making it less sensitive to outliers and non-normal data distributions.
Overview
Gnostic correlation leverages irrelevance functions to construct robust weights for each data point, following the gnostic framework described by Kovanic & Humber (2015). This approach allows for a more reliable measure of association between variables, especially in the presence of noise or outliers.
- Robust to outliers: Uses irrelevance-based weighting.
- No normality assumption: Works well with non-Gaussian data.
- Flexible: Supports both 1D and 2D data (column-wise correlation).
Parameters
Parameter | Type | Description |
---|---|---|
data_1 |
np.ndarray, pandas Series, or DataFrame | First data sample (1D or 2D). Each column is treated as a variable. |
data_2 |
np.ndarray, pandas Series, or DataFrame | Second data sample (must have same number of rows as data_1 ). |
Returns
- float, np.ndarray, or pandas.DataFrame
The calculated Gnostic correlation coefficient(s): - If both inputs are 1D: returns a float.
- If either input is 2D: returns a correlation matrix (np.ndarray or pandas DataFrame if input was pandas).
Raises
- ValueError
- If input arrays have different lengths.
- If inputs are empty or not numpy arrays/pandas Series/DataFrame.
- If input shapes are incompatible.
Example Usage
import numpy as np
from machinegnostics.metrics import gcorrelation
# Example 1: 1D arrays (robust analog of Pearson correlation)
x = np.array([1.0, 2.0, 3.0, 4.0, 5.0])
y = np.array([0.9, 2.1, 2.9, 4.2, 4.8])
gcor = gcorrelation(x, y)
print(f"Estimation correlation: {gcor:.3f}") # Output: Estimation correlation: 0.999
# Example 2: DataFrames (column-wise correlation matrix)
import pandas as pd
df1 = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
df2 = pd.DataFrame({'c': [1, 2, 1], 'd': [6, 5, 4]})
corr_matrix = gcorrelation(df1, df2)
print(corr_matrix)
Notes
- The location parameter is set by the mean (can be replaced by G-median for higher robustness).
- The geometric mean of the weights is used as the "best" weighting vector.
- For 2D arrays or DataFrames, the function computes the correlation for each pair of columns.
- The output is a DataFrame with appropriate column and index names if the input was pandas.