Skip to content

train_test_split: Random Train/Test Data Splitter

The train_test_split function provides a simple and flexible way to split your dataset into random training and testing subsets. It is compatible with numpy arrays and can also handle lists or tuples as input. This function is essential for evaluating machine learning models on unseen data and is a core utility in most ML workflows.


Overview

Splitting your data into training and testing sets is a fundamental step in machine learning. The train_test_split function allows you to:

  • Randomly partition your data into train and test sets.
  • Specify the proportion or absolute number of test samples.
  • Shuffle your data for unbiased splitting.
  • Use a random seed for reproducibility.
  • Split both features (X) and targets (y) in a consistent manner.

Parameters

Parameter Type Default Description
X array-like Feature data to be split. Must be indexable and of consistent length.
y array-like or None None Target data to be split alongside X. Must be same length as X.
test_size float or int 0.25 If float, fraction of data for test set (0.0 < test_size < 1.0). If int, absolute number of test samples.
shuffle bool True Whether to shuffle the data before splitting.
random_seed int or None None Controls the shuffling for reproducibility.

Returns

  • X_train, X_test: np.ndarray
    Train-test split of X.

  • y_train, y_test: np.ndarray or None
    Train-test split of y. If y is None, these will also be None.


Raises

  • ValueError
    If inputs are invalid or test_size is not appropriate.

  • TypeError
    If test_size is not a float or int.


Example Usage

import numpy as np
from machinegnostics.models import train_test_split

# Create sample data
X = np.arange(20).reshape(10, 2)
y = np.arange(10)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, shuffle=True, random_seed=42
)

print("X_train:", X_train)
print("X_test:", X_test)
print("y_train:", y_train)
print("y_test:", y_test)

Notes

  • If y is not provided, only X will be split and y_train, y_test will be None.
  • If test_size is a float, it must be between 0.0 and 1.0 (exclusive).
  • If test_size is an int, it must be between 1 and len(X) - 1.
  • Setting shuffle=False will split the data in order, without randomization.
  • Use random_seed for reproducible splits.

Author: Nirmal Parmar
Date: 2025-05-01