Method:
- Preprocessing and feature construction
• Grouping modalities (for categorical variables)
Details on preprocessing and feature construction:
For those categorical variables with more than 25 categories, levels with fewer than 100 instances in the training data were aggregated into a "small" category, those with 100-500 instances were aggregated into a "medium" category, and those with 500-1000 instances aggregated into a "large" category. All other categories kept, with empty treated as a blank category.
- Feature selection
• Feature ranking with correlation or other criterion (precise below)
Details on feature selection:
For continuous variables, we split the instances into 1% quantiles and took the mean response for each quantile using half the training data. We then used this mean as a prediction applied to the other half and calculated the AUC of this prediction, using this to rank the variables. For categorical variables we calculated the mean for each level and used this as the prediction. The method was repeated for different drawings of the data, and an average AUC score used to rank variables.
- Classification
- Base classifier
• Decision tree, stub, or Random Forest
- Loss function
• Logistic loss or cross-entropy (like in logistic regression)
- Regularizer
• None
- Ensemble method
• Boosting
- Unlabeled data
Did you make use of the unlabeled test data for training?
• No
Details on classification:
Boosting with classification trees and shrinkage, using Bernoulli loss was used.
- Model selection/hyperparameter selection
• The on-line feed-back on 10% of the test set was used • K-fold or leave-one-out cross-validation (using training data)
Details on model selection:
Attempted to maximise the AUC score, each in a cross-validated, or on the 10% sample.
- Unscrambling the small dataset
Did you unscramble the small dataset?
• No
|
Comment about the following:
-
Quantitative advantages
* Models fit in an hour or so, thus not too long
* Feature selection very fast, generally effective
* Results very interpretable
* Ran it all on mid-range desktop computer
-
Qualitative advantages
* Trees able to measure variable significance, so can experiment with alternative variable choices
* Computes probabilities rather than straight classifications
* Deals with missing values well.
* Have not seen the feature selection approach before, although it's a very simple idea
-
Other methods
Some of the steps that improved model performance were
* Aggregating categorical variables with large number of levels
* Up-weighting the "1" responses, as they were less freqeunt
* Checking lower ranked variables to see if they added value to the model.
- Software implementation
-
Availability
• Off-the-shelf third party freeware or shareware
-
Language
• Other
Details on software implementation:
Used the R statistical package
-
Hardware implementation
-
Platform
• Windows
-
Memory
<= 2GB
-
Parallelism
• None
Details on hardware implementation.
Most of models run on laptop with Intel Core 2 Duo 2.66GHz processor, 2GB RAM, 120Gb hard drive.
-
Code efficiency and versatility:
Was the time constraint imposed by the fast challenge a difficulty or did you feel way enough time was given to prepare the data and train the model?
• Enough time
|