![]() Server time : 2012-02-04 04:51
|
KDD Cup 2009 |
![]() |
The data are available for download only by registered users. Please register/login to gain access to the data.
The large dataset archives are available since the onset of the challenge. The small dataset will be made available at the end of the fast challenge. Both training and test sets contain 50,000 examples. The data are split similarly for the small and large versions, but the samples are ordered differently within the training and within the test sets. Both small and large datasets have numerical and categorical variables. For the large dataset, the first 14,740 variables are numerical and the last 260 are categorical. For the small dataset, the first 190 variables are numerical and the last 40 are categorical. Toy target values are available only for practice purpose. The prediction of the toy target values will not be part of the final evaluation.
FILES AVAILABLE ONLY TO LOGGED USERS. REGISTER/LOGIN to gain access...
| Small version (230 var.) |
Large version (15,000 var.) zip of text files |
Toy targets (large) |
|---|---|---|
|
orange_small_train.data.zip (8.2 Mbytes) orange_small_test.data.zip (8.2 Mbytes) |
orange_large_train.data.chunk1 (52.7 Mbytes) orange_large_train.data.chunk2 (52.7 Mbytes) orange_large_train.data.chunk3 (52.6 Mbytes) orange_large_train.data.chunk4 (52.5 Mbytes) orange_large_train.data.chunk5 (52.6 Mbytes) orange_large_test.data.chunk1 (52.8 Mbytes) orange_large_test.data.chunk2 (52.5 Mbytes) orange_large_test.data.chunk3 (52.6 Mbytes) orange_large_test.data.chunk4 (52.6 Mbytes) orange_large_test.data.chunk5 (52.6 Mbytes) |
orange_large_train_toy.labels |
| Real binary targets (small) |
Real binary targets (large) |
|---|---|
| orange_small_train_appentency.labels orange_small_train_churn.labels orange_small_train_upselling.labels |
orange_large_train_appetency.labels orange_large_train_churn.labels orange_large_train_upselling.labels |
The target values (.labels files) have one example per line in the same order as the corresponding data files. Note that churn, appetency, and up-selling are three separate binary classification problems. The target values are +1 or -1. We refer to examples having +1 (resp. -1) target values as positive (resp. negative) examples.
The Matlab matrices are numeric. When loaded, the data matrix is called X. The categorical variables are mapped to integers. Missing values are replaced by NaN for the original numeric variables while they are mapped to 0 for categorical variables.