Orange
Server time : 2012-05-18 12:46
KDD Cup 2009 KDD 09

Fast Scoring on a Large Database

The challenge is over, read the proceedings of JMLR W&CP vol. 7

The platform is still open for post-challenge submissions!


Statistics

Registered 4871
Entrants 526
Entries 9717
Complete valid entries (*) 5994
(*) Complete valid entries include training and test results for all tasks


Introduction

Download the recording of Vincent Lemaire's presentation of Orange Labs

Customer Relationship Management (CRM) is a key element of modern marketing strategies. The KDD Cup 2009 offers the opportunity to work on large marketing databases from the French Telecom company Orange to predict the propensity of customers to switch provider (churn), buy new products or services (appetency), or buy upgrades or add-ons proposed to them to make the sale more profitable (up-selling).

The most practical way, in a CRM system, to build knowledge on customer is to produce scores. A score (the output of a model) is an evaluation for all instances of a target variable to explain (i.e. churn, appetency or up-selling). Tools which produce scores allow to project, on a given population, quantifiable information. The score is computed using input variables which describe instances. Scores are then used by the information system (IS), for example, to personalize the customer relationship. An industrial customer analysis platform able to build prediction models with a very large number of input variables has been developed by Orange Labs. This platform implements several processing methods for instances and variables selection, prediction and indexation based on an efficient model combined with variable selection regularization and model averaging method. The main characteristic of this platform is its ability to scale on very large datasets with hundreds of thousands of instances and thousands of variables. The rapid and robust detection of the variables that have most contributed to the output prediction can be a key factor in a marketing application.

The challenge is to beat the in-house system developed by Orange Labs. It is an opportunity to prove that you can deal with a very large database, including heterogeneous noisy data (numerical and categorical variables), and unbalanced class distributions. Time efficiency is often a crucial point. Therefore part of the competition will be time-constrained to test the ability of the participants to deliver solutions quickly.

Task Description

The task is to estimate the churn, appetency and up-selling probability of customers, hence there are three target values to be predicted. The challenge is staged in phases to test the rapidity with which each team is able to produce results. A large number of variables (15,000) is made available for prediction. However, to engage participants having access to less computing power, a smaller version of the dataset with only 230 variables will be made available in the second part of the challenge.

Schedule

Challenge schedule
Date (2009)
Small dataset (slow track only)
Large dataset (fast and slow tracks)
March 10
Nothing available yet.
Start of the FAST large challenge.
Data tables without target values made available for the large dataset.
Toy training target values made available for practice purpose.
Objective: participants can download data, ask questions, finalize their methodology, try the submission process.
April 6
Nothing available yet.
Training target values available for the large dataset for the real problems (churn, appetency, and upselling).
Feed-back: results on 10% of the test set available on-line when submissions are made.
April 10
Nothing available yet.
Deadline for the FAST large challenge. Submissions must be received before midnight, time zone of the challenge web server.
April 11
Data tables and training  target values made available for the small dataset.
The challenge continues for the large dataset in the slow track.
May 11
Deadline for the SLOW challenge (small and large datasets). Submissions must be received before midnight, time zone of the challenge web server.
Workshop and proceedings
May 20
Abstracts due. The fact sheets will serve as abstract.
May 30
Notification of acceptance.
June 20
Eight page full papers due for the participants who have been selected. Please use Latex to format your paper. Download instructions (PDF), a Latex example paper and the JMLR W&CP Latex style file. Email your submission to kddcup09@clopinet.com
June 28
KDD cup workshop.
July 20
Paper reviews sent back.
August 31
Revised papers due.
September
Proceedings publication.

Competition rules