| Title: | Data Normalization and Transformation |
|---|---|
| Description: | Provides functions for data normalization and transformation in preprocessing stages. Implements scaling methods (min-max, Z-score, L2 normalization) and power transformations (Box-Cox, Yeo-Johnson). Box-Cox transformation is described in Box and Cox (1964) <doi:10.1111/j.2517-6161.1964.tb00553.x>, Yeo-Johnson transformation in Yeo and Johnson (2000) <doi:10.1093/biomet/87.4.954>. |
| Authors: | Rui Gong [aut, cre] (ORCID: <https://orcid.org/0000-0001-5112-5696>) |
| Maintainer: | Rui Gong <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.1.1 |
| Built: | 2026-05-24 08:45:17 UTC |
| Source: | https://github.com/gonrui/prepkit |
Normalizes a numeric vector by moving the decimal point of values of attribute A. The number of decimal points moved depends on the maximum absolute value of A.
norm_decimal(x, na.rm = TRUE)norm_decimal(x, na.rm = TRUE)
x |
A numeric vector. |
na.rm |
Logical. Should NA values be ignored when determining the scaling factor?
Default is |
Formula:
where is the smallest integer such that .
A numeric vector with values typically in the range (-1, 1).
Han, J., Kamber, M., & Pei, J. (2011). Data mining: concepts and techniques (3rd ed.). Morgan Kaufmann.
# Max value is 980, so j=3 (divides by 1000) -> 0.98 norm_decimal(c(10, 500, 980)) # Works with negative numbers norm_decimal(c(-50, 50, 200))# Max value is 980, so j=3 (divides by 1000) -> 0.98 norm_decimal(c(10, 500, 980)) # Works with negative numbers norm_decimal(c(-50, 50, 200))
Scales the vector so that its Euclidean norm (L2 norm) is 1. This technique is often used in text mining and high-dimensional clustering, and is related to spatial sign prepkitocessing in robust statistics.
norm_l2(x, na.rm = TRUE)norm_l2(x, na.rm = TRUE)
x |
A numeric vector. |
na.rm |
Logical. Remove NAs for norm calculation? Default is |
Formula:
A numeric vector with an L2 norm of 1.
Serneels, S., De Nages, E., & Van Espen, P. J. (2006). Spatial sign prepkitocessing: a simple way to impart moderate robustness to multivariate estimators. Journal of Chemical Information and Modeling, 46(3), 1402-1409. doi:10.1021/ci050498u
Han, J., Kamber, M., & Pei, J. (2011). Data mining: concepts and techniques (3rd ed.). Morgan Kaufmann.
# Convert a vector to unit length x <- c(3, 4) norm_l2(x) # Returns c(0.6, 0.8)# Convert a vector to unit length x <- c(3, 4) norm_l2(x) # Returns c(0.6, 0.8)
Scales a numeric vector by centering it around its mean and scaling it by its range. The resulting vector has a mean of 0 and values typically within [-1, 1].
norm_mean(x, na.rm = TRUE)norm_mean(x, na.rm = TRUE)
x |
A numeric vector. |
na.rm |
Logical. Should NA values be removed during calculation?
Default is |
Formula:
A numeric vector. If the range is 0 (all values are identical), returns a centered vector (zeros).
Han, J., Kamber, M., & Pei, J. (2011). Data mining: concepts and techniques (3rd ed.). Morgan Kaufmann.
# Result ranges from approx -0.5 to 0.5, mean is 0 norm_mean(c(1, 2, 3, 4, 5)) # Handles negative values norm_mean(c(-10, 0, 10))# Result ranges from approx -0.5 to 0.5, mean is 0 norm_mean(c(1, 2, 3, 4, 5)) # Handles negative values norm_mean(c(-10, 0, 10))
Scales a numeric vector to a specific range, typically [0, 1]. This method is sensitive to outliers.
norm_minmax(x, min_val = 0, max_val = 1, na.rm = TRUE)norm_minmax(x, min_val = 0, max_val = 1, na.rm = TRUE)
x |
A numeric vector. |
min_val |
The minimum value of the target range. Default is 0. |
max_val |
The maximum value of the target range. Default is 1. |
na.rm |
Logical. Should NA values be removed during min/max calculation?
Default is |
Formula:
A numeric vector scaled to the range [min_val, max_val].
Han, J., Kamber, M., & Pei, J. (2011). Data mining: concepts and techniques (3rd ed.). Morgan Kaufmann.
norm_minmax(c(1, 2, 3, 4, 5)) norm_minmax(c(1, 2, 3), min_val = -1, max_val = 1)norm_minmax(c(1, 2, 3, 4, 5)) norm_minmax(c(1, 2, 3), min_val = -1, max_val = 1)
Unlike Z-Score or Min-Max, the M-Score algorithm identifies the "Mode Range" (the most frequent value range) and maps it to 0. This effectively suppresses the noise of daily routine (e.g., stable step counts) and amplifies anomalies (e.g., frailty or sudden activity).
It maps:
Mode Range: (Baseline/Routine)
Left Tail: (Decline/Frailty)
Right Tail: (Surge/Hyperactivity)
norm_mode_range(x, tau = 0.8, digits = 0)norm_mode_range(x, tau = 0.8, digits = 0)
x |
A numeric vector. |
tau |
A numeric value (0 to 1). The threshold ratio for defining the mode plateau.
Bins with |
digits |
Integer or NULL.
If not NULL, values are rounded to this many decimal places solely for identifying the mode.
This makes the algorithm robust against sensor noise (e.g., 1.0001 vs 1.0002).
Default is 0 (rounds to integer), which is ideal for step counts or heart rates.
Set to |
A robust normalization method designed for longitudinal behavioral data with a "routine plateau". Also known as Mode-Range Normalization (MRN).
A numeric vector in the range [-1, 1].
Gong, R. (2026). M-Score: A Robust Normalization Method for Detecting Anomalies in Longitudinal Behavioral Data. arXiv prepkitint. (Submitted)
# Scenario 1: Integer data (Standard) steps <- c(3000, 3000, 200, 5000) norm_mode_range(steps) # Scenario 2: Noisy Sensor Data (Floating point) # Without 'digits', these would be seen as different values. # With digits=1, they are grouped into the same mode. sensor_data <- c(9.81, 9.82, 9.80, 2.5, 15.0) norm_mode_range(sensor_data, digits = 1)# Scenario 1: Integer data (Standard) steps <- c(3000, 3000, 200, 5000) norm_mode_range(steps) # Scenario 2: Noisy Sensor Data (Floating point) # Without 'digits', these would be seen as different values. # With digits=1, they are grouped into the same mode. sensor_data <- c(9.81, 9.82, 9.80, 2.5, 15.0) norm_mode_range(sensor_data, digits = 1)
Standardizes a numeric vector using robust statistics: median and median absolute deviation (MAD). This method is less sensitive to outliers compared to Z-score standardization.
norm_robust(x, na.rm = TRUE, constant = 1.4826)norm_robust(x, na.rm = TRUE, constant = 1.4826)
x |
A numeric vector. |
na.rm |
Logical. Should NA values be removed? Default is |
constant |
A scale factor for MAD calculation. Default is 1.4826, which ensures consistency with the standard deviation for normal distributions. |
Formula:
A numeric vector. If MAD is 0 (e.g., more than 50 returns a centered vector (x - median) and issues a warning.
Huber, P. J. (1981). Robust Statistics. Wiley. ISBN: 978-0-471-41805-4.
Hampel, F. R. (1974). The influence curve and its role in robust estimation. Journal of the American Statistical Association, 69(346), 383-393.
# Data with an outlier x <- c(1, 2, 3, 4, 100) # Z-score is heavily affected by the outlier norm_zscore(x) # Robust scaler handles it better norm_robust(x)# Data with an outlier x <- c(1, 2, 3, 4, 100) # Z-score is heavily affected by the outlier norm_zscore(x) # Robust scaler handles it better norm_robust(x)
Standardizes a numeric vector by centering it to have a mean of 0 and scaling it to have a standard deviation of 1.
norm_zscore(x, na.rm = TRUE)norm_zscore(x, na.rm = TRUE)
x |
A numeric vector. |
na.rm |
Logical. Should NA values be removed during mean/sd calculation?
Default is |
Formula:
A numeric vector. If the input vector has zero variance (all values are identical), the function returns a centered vector (all zeros) and issues a warning.
Han, J., Kamber, M., & Pei, J. (2011). Data mining: concepts and techniques (3rd ed.). Morgan Kaufmann.
# Standard usage norm_zscore(c(1, 2, 3, 4, 5)) # Edge case: Zero variance norm_zscore(c(5, 5, 5))# Standard usage norm_zscore(c(1, 2, 3, 4, 5)) # Edge case: Zero variance norm_zscore(c(5, 5, 5))
Creates a comparison plot to visualize the effect of a transformation. It displays histograms and density curves for both the original and transformed data.
pp_plot(x, y, title = "Distribution Comparison")pp_plot(x, y, title = "Distribution Comparison")
x |
Numeric vector. The original data. |
y |
Numeric vector. The transformed data. |
title |
String. The main title of the plot. |
A ggplot object.
# 1. Generate skewed data x <- rchisq(1000, df = 2) # 2. Transform it y <- trans_boxcox(x) # 3. Visualize pp_plot(x, y, title = "Box-Cox Transformation Effect")# 1. Generate skewed data x <- rchisq(1000, df = 2) # 2. Transform it y <- trans_boxcox(x) # 3. Visualize pp_plot(x, y, title = "Box-Cox Transformation Effect")
A synthetic longitudinal dataset representing daily step counts of an older adult. Used to demonstrate the "Vanishing Variance" problem.
data(sim_gait_data)data(sim_gait_data)
A data frame with 200 rows and 2 variables:
Integer. Time index (Days 1-200).
Numeric. Daily step count with habitual plateau and anomalies.
Generated via simulation logic in data-raw/.
Applies the Box-Cox transformation to normalize the data distribution. It automatically handles non-positive values by shifting the data. The optimal lambda parameter is estimated using Maximum Likelihood Estimation (MLE).
trans_boxcox(x, lambda = "auto", force_pos = TRUE)trans_boxcox(x, lambda = "auto", force_pos = TRUE)
x |
A numeric vector. |
lambda |
A numeric value for the transformation power.
If |
force_pos |
Logical. If |
A numeric vector with the transformed values.
The used lambda and shift amount are attached as attributes:
attr(res, "lambda") and attr(res, "shift").
Box, G. E. P., & Cox, D. R. (1964). An analysis of transformations. Journal of the Royal Statistical Society: Series B (Methodological), 26(2), 211-243. https://www.jstor.org/stable/2984418
Applies a logarithmic transformation with an offset. Useful for handling right-skewed data.
trans_log(x, base = exp(1), offset = 1)trans_log(x, base = exp(1), offset = 1)
x |
A numeric vector. |
base |
A positive number. The base of the logarithm. Default is exp(1). |
offset |
A numeric value to add before taking the log. Default is 1. |
A numeric vector.
Bartlett, M. S. (1947). The use of transformations. Biometrics, 3(1), 39-52.
A power transformation similar to Box-Cox but supports both positive and negative values. Automatically estimates the optimal lambda using MLE.
trans_yeojohnson(x, lambda = "auto")trans_yeojohnson(x, lambda = "auto")
x |
A numeric vector. |
lambda |
A numeric value or "auto". |
A numeric vector with attribute "lambda".
Yeo, I.-K., & Johnson, R. A. (2000). A new family of power transformations to improve normality or symmetry. Biometrika.