Skip to content

gcorrelation: Gnostic Correlation Metric

The gcorrelation function computes the Gnostic correlation coefficient between two data samples using robust irrelevance-based weighting. This metric provides a robust alternative to the classical Pearson correlation, making it less sensitive to outliers and non-normal data distributions.


Overview

Gnostic correlation leverages irrelevance functions to construct robust weights for each data point, following the gnostic framework described by Kovanic & Humber (2015). This approach allows for a more reliable measure of association between variables, especially in the presence of noise or outliers.

  • Robust to outliers: Uses irrelevance-based weighting.
  • No normality assumption: Works well with non-Gaussian data.
  • Flexible: Supports both 1D and 2D data (column-wise correlation).

Parameters

Parameter Type Description
data_1 np.ndarray, pandas Series, or DataFrame First data sample (1D or 2D). Each column is treated as a variable.
data_2 np.ndarray, pandas Series, or DataFrame Second data sample (must have same number of rows as data_1).

Returns

  • float, np.ndarray, or pandas.DataFrame
    The calculated Gnostic correlation coefficient(s):
  • If both inputs are 1D: returns a float.
  • If either input is 2D: returns a correlation matrix (np.ndarray or pandas DataFrame if input was pandas).

Raises

  • ValueError
  • If input arrays have different lengths.
  • If inputs are empty or not numpy arrays/pandas Series/DataFrame.
  • If input shapes are incompatible.

Example Usage

import numpy as np
from machinegnostics.metrics import gcorrelation

# Example 1: 1D arrays (robust analog of Pearson correlation)
x = np.array([1.0, 2.0, 3.0, 4.0, 5.0])
y = np.array([0.9, 2.1, 2.9, 4.2, 4.8])
gcor = gcorrelation(x, y)
print(f"Estimation correlation: {gcor:.3f}")  # Output: Estimation correlation: 0.999

# Example 2: DataFrames (column-wise correlation matrix)
import pandas as pd
df1 = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
df2 = pd.DataFrame({'c': [1, 2, 1], 'd': [6, 5, 4]})
corr_matrix = gcorrelation(df1, df2)
print(corr_matrix)

Notes

  • The location parameter is set by the mean (can be replaced by G-median for higher robustness).
  • The geometric mean of the weights is used as the "best" weighting vector.
  • For 2D arrays or DataFrames, the function computes the correlation for each pair of columns.
  • The output is a DataFrame with appropriate column and index names if the input was pandas.