Orange
Server time : 2010-08-01 09:05
KDD Cup 2009 KDD 09

University of Melbourne entry

Team name: Uni Melb
Team leader:
Hugh Miller
Institution: Department of Mathematics and Statistics, The University of Melbourne
Country: Australia
Team member 1:
Sandy Clarke
Team member 2:
Steve Lane
Team member 3:
Andrew Lonie
Team member 4:
David Lazaridis
Team member 5:
Slave Petrovski
Team member 6:
Owen Jones
 
Background:

Our entry comprised of two main steps. Firstly we applied a cross-validated feature selection step, using a quantile method for continuous variables and mean-based method for categorical variables, targeting the AUC measure. We then built of models on a


 
Method:
  • Preprocessing and feature construction

    • Grouping modalities (for categorical variables)

    Details on preprocessing and feature construction:

    For those categorical variables with more than 25 categories, levels with fewer than 100 instances in the training data were aggregated into a "small" category, those with 100-500 instances were aggregated into a "medium" category, and those with 500-1000 instances aggregated into a "large" category. All other categories kept, with empty treated as a blank category.

  • Feature selection

    • Feature ranking with correlation or other criterion (precise below)

    Details on feature selection:

    For continuous variables, we split the instances into 1% quantiles and took the mean response for each quantile using half the training data. We then used this mean as a prediction applied to the other half and calculated the AUC of this prediction, using this to rank the variables. For categorical variables we calculated the mean for each level and used this as the prediction. The method was repeated for different drawings of the data, and an average AUC score used to rank variables.

  • Classification
     
    • Base classifier

      • Decision tree, stub, or Random Forest
    • Loss function

      • Logistic loss or cross-entropy (like in logistic regression)
    • Regularizer

      • None
    • Ensemble method

      • Boosting
    • Unlabeled data

      Did you make use of the unlabeled test data for training?
      • No

    Details on classification:

    Boosting with classification trees and shrinkage, using Bernoulli loss was used.

  • Model selection/hyperparameter selection

    • The on-line feed-back on 10% of the test set was used
    • K-fold or leave-one-out cross-validation (using training data)

    Details on model selection:

    Attempted to maximise the AUC score, each in a cross-validated, or on the 10% sample.

  • Unscrambling the small dataset

    Did you unscramble the small dataset?
    • No
Results:
  Method Churn Appetency Upselling Score
Train Test Train Test Train Test
Small small3 0.7983 0.7375 0.8951 0.8245 0.9147 0.8620 0.8080
Large (slow track) The generally satisfactory model 0.8144 0.7570 0.9338 0.8836 0.9679 0.9048 0.8484
Large (fast track) hfinal 0.8098 0.7087 0.9208 0.8669 0.9541 0.8996 0.8251
 
Comment about the following:
  • Quantitative advantages

    * Models fit in an hour or so, thus not too long
    * Feature selection very fast, generally effective
    * Results very interpretable
    * Ran it all on mid-range desktop computer

  • Qualitative advantages

    * Trees able to measure variable significance, so can experiment with alternative variable choices
    * Computes probabilities rather than straight classifications
    * Deals with missing values well.
    * Have not seen the feature selection approach before, although it's a very simple idea

  • Other methods

    Some of the steps that improved model performance were
    * Aggregating categorical variables with large number of levels
    * Up-weighting the "1" responses, as they were less freqeunt
    * Checking lower ranked variables to see if they added value to the model.

  • Software implementation
     
    • Availability
      • Off-the-shelf third party freeware or shareware
    • Language
      • Other

    Details on software implementation:

    Used the R statistical package

  • Hardware implementation
     
    • Platform
      • Windows
    • Memory
      <= 2GB
    • Parallelism
      • None

    Details on hardware implementation.

    Most of models run on laptop with Intel Core 2 Duo 2.66GHz processor, 2GB RAM, 120Gb hard drive.

  • Code efficiency and versatility:

    Was the time constraint imposed by the fast challenge a difficulty
    or did you feel way enough time was given to prepare the data and train the model?
    • Enough time