Fast Scoring on a Large Database
The platform is still open for post-challenge submissions!
Statistics
| Registered |
1804 |
| Entrants |
475 |
| Entries |
8232 |
| Complete valid entries (*) |
4953 |
(*) Complete valid entries include training and test results for all tasks
Introduction
Customer Relationship Management (CRM) is a key element of modern marketing strategies. The KDD Cup 2009 offers the opportunity to work on large marketing databases from the French Telecom company Orange to predict the propensity of customers to switch provider (churn), buy new products or services (appetency), or buy upgrades or add-ons proposed to them to make the sale more profitable (up-selling).
The most practical way, in a CRM system, to build knowledge on customer is to produce scores. A score (the output of a model) is an evaluation for all instances of a target variable to explain (i.e. churn, appetency or up-selling). Tools which produce scores allow to project, on a given population, quantifiable information. The score is computed using input variables which describe instances. Scores are then used by the information system (IS), for example, to personalize the customer relationship. An industrial customer analysis platform able to build prediction models with a very large number of input variables has been developed by Orange Labs. This platform implements several processing methods for instances and variables selection, prediction and indexation based on an efficient model combined with variable selection regularization and model averaging method. The main characteristic of this platform is its ability to scale on very large datasets with hundreds of thousands of instances and thousands of variables. The rapid and robust detection of the variables that have most contributed to the output prediction can be a key factor in a marketing application.
The challenge is to beat the in-house system developed by Orange Labs. It is an opportunity to prove that you can deal with a very large database, including heterogeneous noisy data (numerical and categorical variables), and unbalanced class distributions. Time efficiency is often a crucial point. Therefore part of the competition will be time-constrained to test the ability of the participants to deliver solutions quickly.
Task Description
The task is to estimate the churn, appetency and up-selling probability of customers, hence there are three target values to be predicted. The challenge is staged in phases to test the rapidity with which each team is able to produce results. A large number of variables (15,000) is made available for prediction. However, to engage participants having access to less computing power, a smaller version of the dataset with only 230 variables will be made available in the second part of the challenge.
- Churn (wikipedia definition): Churn rate is also sometimes called attrition rate. It is one of two primary factors that determine the steady-state level of customers a business will support. In its broadest sense, churn rate is a measure of the number of individuals or items moving into or out of a collection over a specific period of time. The term is used in many contexts, but is most widely applied in business with respect to a contractual customer base. For instance, it is an important factor for any business with a subscriber-based service model, including mobile telephone networks and pay TV operators. The term is also used to refer to participant turnover in peer-to-peer networks.
- Appetency: In our context, the appetency is the propensity to buy a service or a product.
- Up-selling (wikipedia definition): Up-selling is a sales technique whereby a salesman attempts to have the customer purchase more expensive items, upgrades, or other add-ons in an attempt to make a more profitable sale. Up-selling usually involves marketing more profitable services or products, but up-selling can also be simply exposing the customer to other options he or she may not have considered previously. Up-selling can imply selling something additional, or selling something that is more profitable or otherwise preferable for the seller instead of the original sale.
Schedule
Challenge schedule
|
Date (2009)
|
Small dataset (slow track only)
|
Large dataset (fast and slow tracks)
|
March 10
|
Nothing available yet.
|
Start of the FAST large challenge.
Data tables without target values made available for the large dataset.
Toy training target values made available for practice purpose.
Objective: participants can download data, ask questions, finalize their
methodology, try the submission process.
|
April 6
|
Nothing available yet.
|
Training target values available for the large dataset for the real problems
(churn, appetency, and upselling).
Feed-back: results on 10% of the test set available on-line when submissions
are made.
|
April 10
|
Nothing available yet.
|
Deadline for the FAST large challenge.
Submissions must be received before midnight, time zone of the challenge web
server.
|
April 11
|
Data tables and training target values made
available for the small dataset.
|
The challenge continues for the large dataset in the slow track.
|
May 11
|
Deadline for
the SLOW challenge (small and large datasets). Submissions must be received before midnight, time zone of the challenge web server. |
Workshop and proceedings
|
May 20
|
Abstracts due. The fact sheets will serve as abstract.
|
May 30
|
Notification of acceptance.
|
June 20
|
Eight page full papers due for the participants who have been selected. Please use Latex to format your paper. Download instructions (PDF), a Latex example paper and
the JMLR W&CP Latex style file. Email your submission to kddcup09@clopinet.com
|
June 28
|
KDD cup workshop.
|
July 20
|
Paper reviews sent back.
|
August 31
|
Revised papers due.
|
September
|
Proceedings publication.
|
Competition rules
- Conditions of participation: Anybody who complies with the rules of the challenge (KDDcup 2009)
is welcome to participate. Only the organizers listed on the Credits page are excluded from participating. The KDDcup 2009 is part of the competition program of the
Knowledge Discovery in Databases conference (KDD 2009),
Paris June 28-July 1st, 2009. Participants are not required to attend the KDDcup 2009
workshop, which will be held at the conference, and the workshop is open to anyone who registers.
The proceedings of the competition will be published by the Journal of Machine Learning Research Workshop and Conference Proceedings (JMLR WC&P).
- Anonymity: All entrants must identify themselves by registering on the KDDcup 2009 website. However, they may elect to remain anonymous by choosing a nickname and checking the box "Make my profile anonymous". If this box is checked, only the nickname will appear in the result tables instead of the real name. Participant emails will not appear anywhere on the website and will be used only by the organizers to communicate with the participants. To be eligible for prizes the participants will have to publicly reveal their identity and uncheck the box "Make my profile anonymous".
- Data: The dataset is available for download from the Dataset page to registered participants. The data are available in several archives to facilitate downloading and two versions are made available ("small" with 230 variables, and "large" with 15,000 variables). The participants may enter results on either or both versions, which correspond to the same data entries, the 230 variables of the small version being just a subset of the 15,000 variables of the large version. Both training and test data are available without the true target labels. For practice purpose, "toy" training labels are available together with the training data from the onset of the challenge in the fast track. The results on toy targets (T) will not count for the final evaluation. The real training labels of the tasks "churn" (C), "appetency" (A), and "up-selling" (U), will be made available for download separately half-way through the challenge.
- Challenge duration and tracks: The challenge starts March 10, 2009 and ends May 11, 2009. There are two challenge tracks:
- FAST (large) challenge: Results submitted on the LARGE dataset within five days of the release of the real training labels will count towards the fast challenge.
- SLOW challenge: Results on the small dataset and results on the large dataset not qualifying for the fast challenge, submitted before the KDDcup 2009 deadline May 11, 2009, will count toward the SLOW challenge.
If more than one submission is made in either track and with either dataset, the last submission before the track deadline will be taken into account to determine the ranking of participants and attribute the prizes. You may compete in both tracks. There are prizes in both tracks.
- On-line feed-back: During the challenge, the training set performances will be available on the Result page as well as partial information on test set performances: The test set performances on the toy task (T) and performances on a fixed 10% subset of the test examples for the real tasks (C, A, U). After the challenge is over, the performances on the whole test set will be calculated and substituted in the result tables.
- Submission method: The method of submission is via the form on the Submission page. To be ranked, submissions must comply with
the Instructions. A submission should include results on both training and test set on at least one of the tasks (T, C, A, U), but it may include results on several tasks. A submission will be considered "complete" and eligible for prizes if it contains 6 files corresponding to training and test data predictions for the tasks C, A, and U, either for the small or for the large dataset (or for both). Results on the practice task T will not count as part of the competition. If you encounter problems with the submission process, please contact the Challenge Webmaster. Multiple submissions are allowed, but please limit yourself to 5 submissions per day
maximum. For your final entry in the slow track, you may submit results on either or both small and large datasets in the same archive (hence you get 2 chances of winning).
- Evaluation and ranking: For each entrant, only the last valid entry, as defined in the Instructions will count towards determining the winner in each track (fast and slow). We limit each participating person to a single final entry in each track (see the FAQ for the conditions under which you can work in teams). Valid entries must include results on all three real tasks.
The method of scoring is posted on the Evaluation page. Prizes will be attributed only to entries performing better than the baseline method (Naive Bayes). The results of the baseline method are provided in the Result table. These are not the best results obtained by the organization team at Orange, they are easy to outperform, but difficult to attain by chance.
- Reproducibility:
Participation is not conditioned on delivering code nor publishing methods. However, we will ask the top ranking participants to voluntarily fill out a fact sheet about their methods, contribute papers to the proceedings, and help reproducing their results.