caret includes several functions to pre-process the predictor data. It assumes that all of the data are numeric (i.e. factors have been converted to dummy variables via model.matrix , dummyVars or other means).
![]()
Note that the later chapter on using
recipes with train shows how that approach can offer a more diverse and customizable interface to pre-processing in the package.
3.1 Creating Dummy Variables
The function
dummyVars can be used to generate a complete (less than full rank parameterized) set of dummy variables from one or more factors. The function takes a formula and a data set and outputs an object that can be used to create the dummy variables using the predict method.
May 13, 2017 Caret is a Markdown editor. Features Code highlighting Auto-completion Context commands Extendable selection Preview File navigation Recent files.
For example, the
etitanic data set in the earth package includes two factors: pclass (passenger class, with levels 1st, 2nd, 3rd) and sex (with levels female, male). The base R function model.matrix would generate the following variables:
Using
dummyVars :
Note there is no intercept and each factor has a dummy variable for each level, so this parameterization may not be useful for some model functions, such as
lm .
3.2 Zero- and Near Zero-Variance Predictors
In some situations, the data generating mechanism can create predictors that only have a single unique value (i.e. a “zero-variance predictor”). For many models (excluding tree-based models), this may cause the model to crash or the fit to be unstable.
Similarly, predictors might have only a handful of unique values that occur with very low frequencies. For example, in the drug resistance data, the
nR11 descriptor (number of 11-membered rings) data have a few unique numeric values that are highly unbalanced:
The concern here that these predictors may become zero-variance predictors when the data are split into cross-validation/bootstrap sub-samples or that a few samples may have an undue influence on the model. These “near-zero-variance” predictors may need to be identified and eliminated prior to modeling.
To identify these types of predictors, the following two metrics can be calculated:
If the frequency ratio is greater than a pre-specified threshold and the unique value percentage is less than a threshold, we might consider a predictor to be near zero-variance.
We would not want to falsely identify data that have low granularity but are evenly distributed, such as data from a discrete uniform distribution. Using both criteria should not falsely detect such predictors.
Looking at the MDRR data, the
nearZeroVar function can be used to identify near zero-variance variables (the saveMetrics argument can be used to show the details and usually defaults to FALSE ):
By default,
nearZeroVar will return the positions of the variables that are flagged to be problematic.
3.3 Identifying Correlated Predictors
While there are some models that thrive on correlated predictors (such as
pls ), other models may benefit from reducing the level of correlation between the predictors.
Given a correlation matrix, the
findCorrelation function uses the following algorithm to flag predictors for removal:
For the previous MDRR data, there are 65 descriptors that are almost perfectly correlated (|correlation| > 0.999), such as the total information index of atomic composition (
IAC ) and the total information content index (neighborhood symmetry of 0-order) (TIC0 ) (correlation = 1). The code chunk below shows the effect of removing descriptors with absolute correlations above 0.75.
3.4 Linear Dependencies
The function
findLinearCombos uses the QR decomposition of a matrix to enumerate sets of linear combinations (if they exist). For example, consider the following matrix that is could have been produced by a less-than-full-rank parameterizations of a two-way experimental layout:
Note that columns two and three add up to the first column. Similarly, columns four, five and six add up the first column.
findLinearCombos will return a list that enumerates these dependencies. For each linear combination, it will incrementally remove columns from the matrix and test to see if the dependencies have been resolved. findLinearCombos will also return a vector of column positions can be removed to eliminate the linear dependencies:
These types of dependencies can arise when large numbers of binary chemical fingerprints are used to describe the structure of a molecule.
3.5 The
|
AuthorWrite something about yourself. No need to be fancy, just an overview. Archives
January 2023
Categories |