train_test_split: Random Train/Test Data Splitter
The train_test_split
function provides a simple and flexible way to split your dataset into random training and testing subsets. It is compatible with numpy arrays and can also handle lists or tuples as input. This function is essential for evaluating machine learning models on unseen data and is a core utility in most ML workflows.
Overview
Splitting your data into training and testing sets is a fundamental step in machine learning. The train_test_split
function allows you to:
- Randomly partition your data into train and test sets.
- Specify the proportion or absolute number of test samples.
- Shuffle your data for unbiased splitting.
- Use a random seed for reproducibility.
- Split both features (
X
) and targets (y
) in a consistent manner.
Parameters
Parameter | Type | Default | Description |
---|---|---|---|
X |
array-like | — | Feature data to be split. Must be indexable and of consistent length. |
y |
array-like or None | None | Target data to be split alongside X. Must be same length as X. |
test_size |
float or int | 0.25 | If float, fraction of data for test set (0.0 < test_size < 1.0). If int, absolute number of test samples. |
shuffle |
bool | True | Whether to shuffle the data before splitting. |
random_seed |
int or None | None | Controls the shuffling for reproducibility. |
Returns
-
X_train, X_test:
np.ndarray
Train-test split of X. -
y_train, y_test:
np.ndarray
orNone
Train-test split of y. If y is None, these will also be None.
Raises
-
ValueError
If inputs are invalid ortest_size
is not appropriate. -
TypeError
Iftest_size
is not a float or int.
Example Usage
import numpy as np
from machinegnostics.models import train_test_split
# Create sample data
X = np.arange(20).reshape(10, 2)
y = np.arange(10)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, shuffle=True, random_seed=42
)
print("X_train:", X_train)
print("X_test:", X_test)
print("y_train:", y_train)
print("y_test:", y_test)
Notes
- If
y
is not provided, onlyX
will be split andy_train
,y_test
will beNone
. - If
test_size
is a float, it must be between 0.0 and 1.0 (exclusive). - If
test_size
is an int, it must be between 1 andlen(X) - 1
. - Setting
shuffle=False
will split the data in order, without randomization. - Use
random_seed
for reproducible splits.
Author: Nirmal Parmar
Date: 2025-05-01