train_test_split: Random Train/Test Data Splitter

The train_test_split function provides a simple and flexible way to split your dataset into random training and testing subsets. It is compatible with numpy arrays and can also handle lists or tuples as input. This function is essential for evaluating machine learning models on unseen data and is a core utility in most ML workflows.

Overview

Splitting your data into training and testing sets is a fundamental step in machine learning. The train_test_split function allows you to:

Randomly partition your data into train and test sets.
Specify the proportion or absolute number of test samples.
Shuffle your data for unbiased splitting.
Use a random seed for reproducibility.
Split both features (X) and targets (y) in a consistent manner.

Parameters

Parameter	Type	Default	Description
`X`	array-like	—	Feature data to be split. Must be indexable and of consistent length.
`y`	array-like or None	None	Target data to be split alongside X. Must be same length as X.
`test_size`	float or int	0.25	If float, fraction of data for test set (0.0 < test_size < 1.0). If int, absolute number of test samples.
`shuffle`	bool	True	Whether to shuffle the data before splitting.
`random_seed`	int or None	None	Controls the shuffling for reproducibility.

Returns

X_train, X_test: np.ndarray
Train-test split of X.
y_train, y_test: np.ndarray or None
Train-test split of y. If y is None, these will also be None.

Raises

ValueError
If inputs are invalid or test_size is not appropriate.
TypeError
If test_size is not a float or int.

Example Usage

import numpy as np
from machinegnostics.models import train_test_split

# Create sample data
X = np.arange(20).reshape(10, 2)
y = np.arange(10)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, shuffle=True, random_seed=42
)

print("X_train:", X_train)
print("X_test:", X_test)
print("y_train:", y_train)
print("y_test:", y_test)

Notes

If y is not provided, only X will be split and y_train, y_test will be None.
If test_size is a float, it must be between 0.0 and 1.0 (exclusive).
If test_size is an int, it must be between 1 and len(X) - 1.
Setting shuffle=False will split the data in order, without randomization.
Use random_seed for reproducible splits.

Author: Nirmal Parmar
Date: 2025-05-01