Orange
Server time : 2010-08-01 09:11
KDD Cup 2009 KDD 09

Winning the KDD Cup Orange Challenge with Ensemble Selection

Team name: IBM Research
Team leader:
Alexandru Niculescu-Mizil
Institution: IBM T.J. Watson Research Center
Country: United States
Team member 1:
Claudia Perlich
Institution: IBM T.J. Watson Research Center
Team member 2:
Grzegorz Swirszcz
Institution: IBM T.J. Watson Research Center
Team member 3:
Vikas Sindhwani
Institution: IBM T.J. Watson Research Center
Team member 4:
Yan Liu
Institution: IBM T.J. Watson Research Center
Team member 5:
Prem Melville
Institution: IBM T.J. Watson Research Center
Team member 6:
Dong Wang
Institution: IBM China Research Laboratory
Team member 7:
Jing Xiao
Institution: IBM China Research Laboratory
Team member 8:
Jianying Hu
Institution: IBM T.J. Watson Research Center
Team member 9:
Moninder Singh
Institution: IBM T.J. Watson Research Center
Team member 10:
Wei Xiong Shang
Institution: IBM China Research Laboratory
Team member 11:
Yan Feng Zhu
Institution: IBM China Research Laboratory
 
Background:

To make final predictions we used ensembles trained via Ensemble Selection [1]. We used a range of classifiers, parameter setting and feature sets, as well as post training calibration using Platt Scaling to generate model libraries containing 700-1200 models. The base classifiers we used were: random forests [2] and boosted trees [3] trained using the FEST package [4], logistic regression trained using the BBR package [5], SVMs trained using SVMPerf [6], LibLinear[12] and LibSVM [7], decision trees, TANs and naive bayes trained using Weka [8], Sparse Network of Winnows trained using the SNoW package [9], and KNN, linear regression and co-clustering [10] trained using in house code.

All the base classifiers were trained using ten-fold cross-validation (we actually only got to train on two of the ten folds for the fast challenge, and four for the slow challenge). The combined predictions on the test folds were used by Ensemble Selection to train final ensembles optimized for AUC, as described in [11].


 
Method:
  • Preprocessing and feature construction

    • Normalizations (for numerical variables)
    • Replacement of the missing values
    • Discretization (for numerical variables)
    • Principal Component Analysis
    • Other preprocessing not listed above (precise below)

    Details on preprocessing and feature construction:

    We normalized the numerical variables by range, keeping the sparsity. For the categorical variables, we coded them using at most 11 binary columns for each variable. For each categorical variable, we generated a binary feature for each of the ten most common values, encoding whether the instance had this value or not. The eleventh column encoded whether the instance had a value that was not among the top ten most common values. We removed constant attributes, as well as duplicate attributes.

    We replaced the missing values by mean for numerical attributes, and coded them as a separate value for discrete attributes. We also added a separate column for each numeric attribute with missing values, indicating wether the value was missing or not. We also tried another approach for imputing missing values based on KNN.

    On the large data set we discretized the 100 numerical variables that had the highest mutual information with the target into 10 bins, and added them as extra features.

    We tried PCA on the large data set, but it did not seem to help.

    Because we noticed that some of the most predictive attributes were not linearly correlated with the targets, we build shallow decision trees (2-4 levels deep) using single numerical attributes and used their predictions as extra features. We also build shallow decision trees using two features at a time and used their prediction as an extra feature in the hope of capturing some non-additive interactions among features.

    We also generated additional features using a co-clustering based technique [13].

  • Feature selection

    • Feature ranking with correlation or other criterion (precise below)

    Details on feature selection:

    We used Pearson Correlation and mutual information to rank features. We also ranked features by the weight they got in a linear SVM.

  • Classification
     
    • Base classifier

      • Decision tree, stub, or Random Forest
      • Linear classifier (Fisher's discriminant, SVM, linear regression)
      • Non-linear kernel method (SVM, kernel ridge regression, kernel logistic regression)
      • Naïve Bayes
      • Bayesian Network (other than Naïve Bayes)
      • Nearest neighbors
      • Other classifier not listed above (precise below)
    • Loss function

      • Hinge loss (like in SVM)
      • Square loss (like in ridge regression)
      • Logistic loss or cross-entropy (like in logistic regression)
      • Exponential loss (like in boosting)
    • Regularizer

      • One-norm (sum of weight magnitudes, like in Lasso)
      • Two-norm (||w||^2, like in ridge regression and regular SVM)
    • Ensemble method

      • Boosting
      • Bagging (check this if you use Random Forest)
      • Other ensemble method
    • Unlabeled data

      Did you make use of the unlabeled test data for training?
      • Yes

    Details on classification:

    Besides the classifiers listed above, we also used Sparse Network of Winnows and a co-clustering. The final model was generated via Ensemble Selection.

  • Model selection/hyperparameter selection

    • The on-line feed-back on 10% of the test set was used
    • K-fold or leave-one-out cross-validation (using training data)

    Details on model selection:

    Rather than performing model selection, we used all the models as candidates for inclusion in the final ensemble. 10-fold cross validation was used to generate a unbiassed hill-climbing set for Ensemble Selection. The on-line feedback was used mainly for sanity check, and to chose among a restricted set of ensemble building strategies for the final models.

  • Unscrambling the small dataset

    Did you unscramble the small dataset?
    • Yes

    We got feedback on 20% of the data, but that wasn't too much help since we barely used the on-line feedback information. We also used the small set to generate several extra features that we then added to the large data set. This again did not make a big difference, as we believe we could have just as well using a feature selected set from the large data for this purpose. We also submitted two models that will count towards the final ranking in the slow challenge, both of them trained using the large data set (in our experiments the cross-validated performance on the small data set was lower than the one we obtained on the large data set). But even this will not provide us with much advantage as the two submitted models ended up being very similar to each other.

Results:
  Method Churn Appetency Upselling Score
Train Test Train Test Train Test
Small Submission 1.0000 0.7651 1.0000 0.8819 1.0000 0.9092 0.8521
Large (slow track) Submission 1.0000 0.7651 1.0000 0.8816 1.0000 0.9091 0.8519
Large (fast track) Final Submission 1.0000 0.7611 1.0000 0.8830 1.0000 0.9038 0.8493
 
Comment about the following:
  • Quantitative advantages

    The largest advantage, both quantitative and qualitative, was training a large variety of classifiers. Different classifiers were best on the three different problems. Judging from our internal cross-validation, logistic regression was better on churn, random forests were the winner on appetency , and on upselling boosted trees yielded the highest performance. On top of this, using Ensemble Selection significantly boosted the performance over using the best single model, according to the on-line feedback.

  • Qualitative advantages

    Besides the points above, the realization that some of the numerical variables are not linearly correlated with the target, and the recoding of these variables using decision trees and binning, led to significant gains in the performance of the linear models.

    We also believe that optimizing the ensemble directly to AUC provided an advantage, but we have no hard evidence for it :).

  • Other methods

    We are not sure what to say here, as we do not know yet if our submissions were a success, nor do we know what other people have tried.

    Maybe one thing to mention would be that based on internal cross-validation, we determined that the performance we obtained on the small data set was lower than the performance we obtained on the large one. So due to time and computational constraints we focussed on the large data set. But we consider this a bug more than a feature, as we would have liked to explore the small data set more, and train slower models that we could not afford to train on the large data set.

  • Software implementation
     
    • Availability
      • Freeware or shareware in house software
      • Off-the-shelf third party freeware or shareware
    • Language
      • C/C++
      • Java
      • Matlab

    Details on software implementation:

  • Hardware implementation
     
    • Platform
      • Windows
      • Linux or other Unix
      • Mac OS
    • Memory
      <= 8 GB
    • Parallelism
      • Run in parallel different algorithms on different machines

    Details on hardware implementation.

    The main workhorse was a linux cluster of nine dual processor machines with 3G memory. Besides that, several models and the final ensembles were trained on individual machines.

  • Code efficiency and versatility:

    Was the time constraint imposed by the fast challenge a difficulty
    or did you feel way enough time was given to prepare the data and train the model?
    • Enough time

    Do you think this would have affected your results significantly?
    •Yes
References:

[1] Rich Caruana, Alexandru Niculescu-Mizil, Geoff Crew, Alex Ksikes: Ensemble Selection from Libraries of Models. -- Proceedings of the 21st International Conference on Machine Learning, 2004.

[2] Breiman, L.: Random Forests. -- Machine Learning Journal, 2001.

[3] Schapire, R.:The boosting approach to machine learning: An overview. In MSRI Workshop on Nonlinear Estimation and Classi cation, 2001.

[4] R. Caruana, N. Karampatziakis, and A. Yessenalina: An Empirical Evaluation of Supervised Learning in High Dimensions. In Proceedings of the 25th International Conference on Machine Learning, 2008.

[5] Genkin, A., Lewis, D., and Madigan, D.: Largescale bayesian logistic regression for text categorization. Technometrics, 2006.

[6] T. Joachims: A Support Vector Method for Multivariate Performance Measures, Proceedings of the 22nd International Conference on Machine Learning, 2005.

[7] Chih-Chung Chang and Chih-Jen Lin: LIBSVM : a library for support vector machines, 2001.

[8] Witten, I. H., & Frank, E.: Data mining: Practical machine learning tools and techniques. San Francisco: Morgan Kaufmann, 2005.

[9] Andrew J. Carlson, Chad M. Cumby, Jeff L. Rosen and Dan Roth: SNoW User's Guide. UIUC Tech report UIUC-DCS-R-99-210, 1999.

[10] V. Sindhwani, J. Hu, A. Mojsilovic: Regularized Co-Clustering With Dual Supervision. In Neural Information Processing Systems, 2008.

[11] Rich Caruana, Art Munson, Alexandru Niculescu-Mizil: Getting the Most Out of Ensemble Selection. Proceedings of the 6th International Conference on Data Mining, 2006.

[12] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin.: LIBLINEAR: A library for large linear classification. Journal of Machine Learning Research, 2008.

[13] Jing Xiao, Lusheng Wang, Xiaowen Liu, Tao Jiang: An Efficient Voting Algorithm for Finding Additive Biclusters with Random Background. Journal of Computational Biology,2008.