Baillehache Pascal's personal website

A Python script to import datasets from OpenML

When in need for some data to test AI/ML algorithms OpenML has become my reference. Its a huge (several thousands!!) compilation of curated datasets, with results and analysis, and a Python API to interact with that treasure chest. I needed a small tool to bring these datasets to 'my' world, i.e. import them in a format compatible with my C library LibCapy. That's what the script below does, and I share it here as it may help someone else.

The usage looks as follow:

usage: OpenMLImport [-h] [-o OUTPUT_FOLDER] [-n NAME] -i TASK_ID [-s] [-b]

Script importing datasets from OpenML and converting them in a format compatible with
LibCapy

options:
  -h, --help            show this help message and exit
  -o OUTPUT_FOLDER, --output-folder OUTPUT_FOLDER
  -n NAME, --name NAME
  -i TASK_ID, --task-id TASK_ID
  -s, --split
  -b, --best

The script takes in argument a task id (cf OpenML), import the dataset, and convert it to a format compatible with CapyDataset.loadFromPath(). -o specifies the folder where to save the imported dataset, -n specifies the name of the dataset (created files' name are generated base on that name), -s also imports the splits used by OpenML for 10-fold cross validation, and finally -b also imports infos about the best run over all evaluation for performance comparison.

For example, to import the iris dataset with its split and best run, one could use the command python openml_import.py -o ./ -n iris -i 59 -s -b. This will create the files iris.csv:

5
in,in,in,in,out
sepallength,sepalwidth,petallength,petalwidth,class
num,num,num,num,cat
150
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
5.0,3.6,1.4,0.2,Iris-setosa
5.4,3.9,1.7,0.4,Iris-setosa
4.6,3.4,1.4,0.3,Iris-setosa
5.0,3.4,1.5,0.2,Iris-setosa
4.4,2.9,1.4,0.2,Iris-setosa
4.9,3.1,1.5,0.1,Iris-setosa
5.4,3.7,1.5,0.2,Iris-setosa
4.8,3.4,1.6,0.2,Iris-setosa
4.8,3.0,1.4,0.1,Iris-setosa
4.3,3.0,1.1,0.1,Iris-setosa
5.8,4.0,1.2,0.2,Iris-setosa
5.7,4.4,1.5,0.4,Iris-setosa
5.4,3.9,1.3,0.4,Iris-setosa
5.1,3.5,1.4,0.3,Iris-setosa
5.7,3.8,1.7,0.3,Iris-setosa
5.1,3.8,1.5,0.3,Iris-setosa
5.4,3.4,1.7,0.2,Iris-setosa
5.1,3.7,1.5,0.4,Iris-setosa
4.6,3.6,1.0,0.2,Iris-setosa
5.1,3.3,1.7,0.5,Iris-setosa
4.8,3.4,1.9,0.2,Iris-setosa
5.0,3.0,1.6,0.2,Iris-setosa
5.0,3.4,1.6,0.4,Iris-setosa
5.2,3.5,1.5,0.2,Iris-setosa
5.2,3.4,1.4,0.2,Iris-setosa
4.7,3.2,1.6,0.2,Iris-setosa
4.8,3.1,1.6,0.2,Iris-setosa
5.4,3.4,1.5,0.4,Iris-setosa
5.2,4.1,1.5,0.1,Iris-setosa
5.5,4.2,1.4,0.2,Iris-setosa
4.9,3.1,1.5,0.1,Iris-setosa
5.0,3.2,1.2,0.2,Iris-setosa
5.5,3.5,1.3,0.2,Iris-setosa
4.9,3.1,1.5,0.1,Iris-setosa
4.4,3.0,1.3,0.2,Iris-setosa
5.1,3.4,1.5,0.2,Iris-setosa
5.0,3.5,1.3,0.3,Iris-setosa
4.5,2.3,1.3,0.3,Iris-setosa
4.4,3.2,1.3,0.2,Iris-setosa
5.0,3.5,1.6,0.6,Iris-setosa
5.1,3.8,1.9,0.4,Iris-setosa
4.8,3.0,1.4,0.3,Iris-setosa
5.1,3.8,1.6,0.2,Iris-setosa
4.6,3.2,1.4,0.2,Iris-setosa
5.3,3.7,1.5,0.2,Iris-setosa
5.0,3.3,1.4,0.2,Iris-setosa
7.0,3.2,4.7,1.4,Iris-versicolor
6.4,3.2,4.5,1.5,Iris-versicolor
6.9,3.1,4.9,1.5,Iris-versicolor
5.5,2.3,4.0,1.3,Iris-versicolor
6.5,2.8,4.6,1.5,Iris-versicolor
5.7,2.8,4.5,1.3,Iris-versicolor
6.3,3.3,4.7,1.6,Iris-versicolor
4.9,2.4,3.3,1.0,Iris-versicolor
6.6,2.9,4.6,1.3,Iris-versicolor
5.2,2.7,3.9,1.4,Iris-versicolor
5.0,2.0,3.5,1.0,Iris-versicolor
5.9,3.0,4.2,1.5,Iris-versicolor
6.0,2.2,4.0,1.0,Iris-versicolor
6.1,2.9,4.7,1.4,Iris-versicolor
5.6,2.9,3.6,1.3,Iris-versicolor
6.7,3.1,4.4,1.4,Iris-versicolor
5.6,3.0,4.5,1.5,Iris-versicolor
5.8,2.7,4.1,1.0,Iris-versicolor
6.2,2.2,4.5,1.5,Iris-versicolor
5.6,2.5,3.9,1.1,Iris-versicolor
5.9,3.2,4.8,1.8,Iris-versicolor
6.1,2.8,4.0,1.3,Iris-versicolor
6.3,2.5,4.9,1.5,Iris-versicolor
6.1,2.8,4.7,1.2,Iris-versicolor
6.4,2.9,4.3,1.3,Iris-versicolor
6.6,3.0,4.4,1.4,Iris-versicolor
6.8,2.8,4.8,1.4,Iris-versicolor
6.7,3.0,5.0,1.7,Iris-versicolor
6.0,2.9,4.5,1.5,Iris-versicolor
5.7,2.6,3.5,1.0,Iris-versicolor
5.5,2.4,3.8,1.1,Iris-versicolor
5.5,2.4,3.7,1.0,Iris-versicolor
5.8,2.7,3.9,1.2,Iris-versicolor
6.0,2.7,5.1,1.6,Iris-versicolor
5.4,3.0,4.5,1.5,Iris-versicolor
6.0,3.4,4.5,1.6,Iris-versicolor
6.7,3.1,4.7,1.5,Iris-versicolor
6.3,2.3,4.4,1.3,Iris-versicolor
5.6,3.0,4.1,1.3,Iris-versicolor
5.5,2.5,4.0,1.3,Iris-versicolor
5.5,2.6,4.4,1.2,Iris-versicolor
6.1,3.0,4.6,1.4,Iris-versicolor
5.8,2.6,4.0,1.2,Iris-versicolor
5.0,2.3,3.3,1.0,Iris-versicolor
5.6,2.7,4.2,1.3,Iris-versicolor
5.7,3.0,4.2,1.2,Iris-versicolor
5.7,2.9,4.2,1.3,Iris-versicolor
6.2,2.9,4.3,1.3,Iris-versicolor
5.1,2.5,3.0,1.1,Iris-versicolor
5.7,2.8,4.1,1.3,Iris-versicolor
6.3,3.3,6.0,2.5,Iris-virginica
5.8,2.7,5.1,1.9,Iris-virginica
7.1,3.0,5.9,2.1,Iris-virginica
6.3,2.9,5.6,1.8,Iris-virginica
6.5,3.0,5.8,2.2,Iris-virginica
7.6,3.0,6.6,2.1,Iris-virginica
4.9,2.5,4.5,1.7,Iris-virginica
7.3,2.9,6.3,1.8,Iris-virginica
6.7,2.5,5.8,1.8,Iris-virginica
7.2,3.6,6.1,2.5,Iris-virginica
6.5,3.2,5.1,2.0,Iris-virginica
6.4,2.7,5.3,1.9,Iris-virginica
6.8,3.0,5.5,2.1,Iris-virginica
5.7,2.5,5.0,2.0,Iris-virginica
5.8,2.8,5.1,2.4,Iris-virginica
6.4,3.2,5.3,2.3,Iris-virginica
6.5,3.0,5.5,1.8,Iris-virginica
7.7,3.8,6.7,2.2,Iris-virginica
7.7,2.6,6.9,2.3,Iris-virginica
6.0,2.2,5.0,1.5,Iris-virginica
6.9,3.2,5.7,2.3,Iris-virginica
5.6,2.8,4.9,2.0,Iris-virginica
7.7,2.8,6.7,2.0,Iris-virginica
6.3,2.7,4.9,1.8,Iris-virginica
6.7,3.3,5.7,2.1,Iris-virginica
7.2,3.2,6.0,1.8,Iris-virginica
6.2,2.8,4.8,1.8,Iris-virginica
6.1,3.0,4.9,1.8,Iris-virginica
6.4,2.8,5.6,2.1,Iris-virginica
7.2,3.0,5.8,1.6,Iris-virginica
7.4,2.8,6.1,1.9,Iris-virginica
7.9,3.8,6.4,2.0,Iris-virginica
6.4,2.8,5.6,2.2,Iris-virginica
6.3,2.8,5.1,1.5,Iris-virginica
6.1,2.6,5.6,1.4,Iris-virginica
7.7,3.0,6.1,2.3,Iris-virginica
6.3,3.4,5.6,2.4,Iris-virginica
6.4,3.1,5.5,1.8,Iris-virginica
6.0,3.0,4.8,1.8,Iris-virginica
6.9,3.1,5.4,2.1,Iris-virginica
6.7,3.1,5.6,2.4,Iris-virginica
6.9,3.1,5.1,2.3,Iris-virginica
5.8,2.7,5.1,1.9,Iris-virginica
6.8,3.2,5.9,2.3,Iris-virginica
6.7,3.3,5.7,2.5,Iris-virginica
6.7,3.0,5.2,2.3,Iris-virginica
6.3,2.5,5.0,1.9,Iris-virginica
6.5,3.0,5.2,2.0,Iris-virginica
6.2,3.4,5.4,2.3,Iris-virginica
5.9,3.0,5.1,1.8,Iris-virginica

iris_split.txt:

10 135 15
49 15 47 0 44 76 89 58 54 53 149 114 112 115 139 42 41 31 48 7 94 93 95 51 72 135 124 141 105 143 34 22 45 3 16 79 55 98 80 77 120 122 145 100 107 4 27 38 19 33 96 50 61 65 73 109 126 110 148 103 46 6 18 36 24 83 90 68 52 75 134 104 117 127 121 26 13 32 20 17 63 88 64 91 81 137 101 130 118 108 29 40 12 35 8 86 82 84 74 57 140 133 106 128 102 25 1 21 30 28 67 78 66 70 59 136 138 129 113 132 5 11 9 39 2 85 56 69 71 60 142 123 131 147 146 43 14 37 23 10 99 87 97 62 92 119 111 144 116 125
43 14 37 23 10 99 87 97 62 92 119 111 144 116 125 42 41 31 48 7 94 93 95 51 72 135 124 141 105 143 34 22 45 3 16 79 55 98 80 77 120 122 145 100 107 4 27 38 19 33 96 50 61 65 73 109 126 110 148 103 46 6 18 36 24 83 90 68 52 75 134 104 117 127 121 26 13 32 20 17 63 88 64 91 81 137 101 130 118 108 29 40 12 35 8 86 82 84 74 57 140 133 106 128 102 25 1 21 30 28 67 78 66 70 59 136 138 129 113 132 5 11 9 39 2 85 56 69 71 60 142 123 131 147 146 49 15 47 0 44 76 89 58 54 53 149 114 112 115 139
43 14 37 23 10 99 87 97 62 92 119 111 144 116 125 49 15 47 0 44 76 89 58 54 53 149 114 112 115 139 34 22 45 3 16 79 55 98 80 77 120 122 145 100 107 4 27 38 19 33 96 50 61 65 73 109 126 110 148 103 46 6 18 36 24 83 90 68 52 75 134 104 117 127 121 26 13 32 20 17 63 88 64 91 81 137 101 130 118 108 29 40 12 35 8 86 82 84 74 57 140 133 106 128 102 25 1 21 30 28 67 78 66 70 59 136 138 129 113 132 5 11 9 39 2 85 56 69 71 60 142 123 131 147 146 42 41 31 48 7 94 93 95 51 72 135 124 141 105 143
43 14 37 23 10 99 87 97 62 92 119 111 144 116 125 49 15 47 0 44 76 89 58 54 53 149 114 112 115 139 42 41 31 48 7 94 93 95 51 72 135 124 141 105 143 4 27 38 19 33 96 50 61 65 73 109 126 110 148 103 46 6 18 36 24 83 90 68 52 75 134 104 117 127 121 26 13 32 20 17 63 88 64 91 81 137 101 130 118 108 29 40 12 35 8 86 82 84 74 57 140 133 106 128 102 25 1 21 30 28 67 78 66 70 59 136 138 129 113 132 5 11 9 39 2 85 56 69 71 60 142 123 131 147 146 34 22 45 3 16 79 55 98 80 77 120 122 145 100 107
43 14 37 23 10 99 87 97 62 92 119 111 144 116 125 49 15 47 0 44 76 89 58 54 53 149 114 112 115 139 42 41 31 48 7 94 93 95 51 72 135 124 141 105 143 34 22 45 3 16 79 55 98 80 77 120 122 145 100 107 46 6 18 36 24 83 90 68 52 75 134 104 117 127 121 26 13 32 20 17 63 88 64 91 81 137 101 130 118 108 29 40 12 35 8 86 82 84 74 57 140 133 106 128 102 25 1 21 30 28 67 78 66 70 59 136 138 129 113 132 5 11 9 39 2 85 56 69 71 60 142 123 131 147 146 4 27 38 19 33 96 50 61 65 73 109 126 110 148 103
43 14 37 23 10 99 87 97 62 92 119 111 144 116 125 49 15 47 0 44 76 89 58 54 53 149 114 112 115 139 42 41 31 48 7 94 93 95 51 72 135 124 141 105 143 34 22 45 3 16 79 55 98 80 77 120 122 145 100 107 4 27 38 19 33 96 50 61 65 73 109 126 110 148 103 26 13 32 20 17 63 88 64 91 81 137 101 130 118 108 29 40 12 35 8 86 82 84 74 57 140 133 106 128 102 25 1 21 30 28 67 78 66 70 59 136 138 129 113 132 5 11 9 39 2 85 56 69 71 60 142 123 131 147 146 46 6 18 36 24 83 90 68 52 75 134 104 117 127 121
43 14 37 23 10 99 87 97 62 92 119 111 144 116 125 49 15 47 0 44 76 89 58 54 53 149 114 112 115 139 42 41 31 48 7 94 93 95 51 72 135 124 141 105 143 34 22 45 3 16 79 55 98 80 77 120 122 145 100 107 4 27 38 19 33 96 50 61 65 73 109 126 110 148 103 46 6 18 36 24 83 90 68 52 75 134 104 117 127 121 29 40 12 35 8 86 82 84 74 57 140 133 106 128 102 25 1 21 30 28 67 78 66 70 59 136 138 129 113 132 5 11 9 39 2 85 56 69 71 60 142 123 131 147 146 26 13 32 20 17 63 88 64 91 81 137 101 130 118 108
43 14 37 23 10 99 87 97 62 92 119 111 144 116 125 49 15 47 0 44 76 89 58 54 53 149 114 112 115 139 42 41 31 48 7 94 93 95 51 72 135 124 141 105 143 34 22 45 3 16 79 55 98 80 77 120 122 145 100 107 4 27 38 19 33 96 50 61 65 73 109 126 110 148 103 46 6 18 36 24 83 90 68 52 75 134 104 117 127 121 26 13 32 20 17 63 88 64 91 81 137 101 130 118 108 25 1 21 30 28 67 78 66 70 59 136 138 129 113 132 5 11 9 39 2 85 56 69 71 60 142 123 131 147 146 29 40 12 35 8 86 82 84 74 57 140 133 106 128 102
43 14 37 23 10 99 87 97 62 92 119 111 144 116 125 49 15 47 0 44 76 89 58 54 53 149 114 112 115 139 42 41 31 48 7 94 93 95 51 72 135 124 141 105 143 34 22 45 3 16 79 55 98 80 77 120 122 145 100 107 4 27 38 19 33 96 50 61 65 73 109 126 110 148 103 46 6 18 36 24 83 90 68 52 75 134 104 117 127 121 26 13 32 20 17 63 88 64 91 81 137 101 130 118 108 29 40 12 35 8 86 82 84 74 57 140 133 106 128 102 5 11 9 39 2 85 56 69 71 60 142 123 131 147 146 25 1 21 30 28 67 78 66 70 59 136 138 129 113 132
43 14 37 23 10 99 87 97 62 92 119 111 144 116 125 49 15 47 0 44 76 89 58 54 53 149 114 112 115 139 42 41 31 48 7 94 93 95 51 72 135 124 141 105 143 34 22 45 3 16 79 55 98 80 77 120 122 145 100 107 4 27 38 19 33 96 50 61 65 73 109 126 110 148 103 46 6 18 36 24 83 90 68 52 75 134 104 117 127 121 26 13 32 20 17 63 88 64 91 81 137 101 130 118 108 29 40 12 35 8 86 82 84 74 57 140 133 106 128 102 25 1 21 30 28 67 78 66 70 59 136 138 129 113 132 5 11 9 39 2 85 56 69 71 60 142 123 131 147 146

and iris_best.txt:

run_id                                                     2012939
task_id                                                         59
setup_id                                                    157622
flow_id                                                       6048
flow_name        sklearn.pipeline.Pipeline(dualimputer=helper.d...
data_id                                                         61
data_name                                                     iris
function                                       predictive_accuracy
upload_time                                    2017-04-06 23:29:28
uploader                                                      1104
uploader_name                                      Jeroen van Hoof
value                                                     0.986667
values                                                        None
array_data                                                    None
Name: 3642, dtype: object

The script is as follow, you can also download it by clicking here. :

Edit on 2022/12/12: Correction of a bug in the importation of splits.

"""
    OpenMLImport - A Python script to import datasets from openml.org in a
    format compatible with LibCapy.
    Copyright (C) 2022 Pascal Baillehache baillehache.pascal@gmail.com
    https://baillehachepascal.dev
    This program is free software: you can redistribute it and/or modify
    it under the terms of the GNU General Public License as published by
    the Free Software Foundation, either version 3 of the License, or
    (at your option) any later version.
    This program is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
    GNU General Public License for more details.
    You should have received a copy of the GNU General Public License
    along with this program. If not, see <http://www.gnu.org/licenses/>.
"""
from pathlib import Path
import argparse
import openml

# Parsing of the arguments
parser = \
    argparse.ArgumentParser(
        prog='OpenMLImport',
        description='Script importing datasets from OpenML and converting ' +
                    'them in a format compatible with LibCapy')
parser.add_argument('-o', '--output-folder', default=Path('./'), type=Path)
parser.add_argument('-n', '--name', default='dataset')
parser.add_argument('-i', '--task-id', required=True, type=int)
parser.add_argument('-s', '--split', action='store_true')
parser.add_argument('-b', '--best', action='store_true')
args = parser.parse_args()
print(f"Importing {args.name} (id:{args.task_id}) to {args.output_folder}")

# Import the data from OpenML
task = openml.tasks.get_task(args.task_id)
inputs, outputs = task.get_X_and_y(dataset_format="dataframe")
nb_input = inputs.values[0].shape[0]
nb_sample = inputs.shape[0]
nb_output = 1
print(
    f"The dataset contains {nb_sample} samples with " +
    f"{nb_input} inputs and {nb_output} outputs")

# Open the result CSV file
path_csv = args.output_folder / (args.name + '.csv')
with open(str(path_csv), "w") as f:

    # Create the header
    nb_field = nb_input + nb_output
    f.write(f"{nb_field}\n")
    categories = (['in'] * nb_input) + ['out']
    f.write(f"{','.join(categories)}\n")
    labels = inputs.columns.values.tolist() + [outputs.name]
    f.write(f"{','.join(labels)}\n")
    types = []
    for i, d in enumerate(inputs.dtypes):
        if str(d) == "float64" or str(d) == "int64":
            types += ['num']
        else:
            types += ['cat']
        print(f"{labels[i]}: {d} -> {types[-1]}")
    if str(outputs.dtypes) == "category":
        types += ['cat']
    else:
        types += ['num']
    print(f"{labels[-1]}: {outputs.dtypes} -> {types[-1]}")
    f.write(f"{','.join(types)}\n")
    f.write(f"{nb_sample}\n")

    # Create the samples
    for i in range(inputs.shape[0]):
        row = \
            ','.join(inputs.values[i].astype(str)) + ',' + outputs[i] + '\n'
        f.write(row)

if args.split:
    # Open the result split file
    path_split = args.output_folder / (args.name + '_split.txt')
    with open(str(path_split), "w") as f:
        nb_repeats, nb_folds, nb_samples = task.get_split_dimensions()
        print("Importing k-fold cross validation data:")
        print(
            f"nb_repeats {nb_repeats} nb_folds {nb_folds} " +
            f"nb_samples {nb_samples}")
        if nb_repeats != 1 or nb_folds != 10 or nb_samples != 1:
            print(
                "The task must have one sample, 10 folds and repeatition")
        else:
            for i_fold in range(nb_folds):
                train_indices, test_indices = \
                    task.get_train_test_split_indices(
                        repeat=0, fold=i_fold, sample=0)
                if i_fold == 0:
                    header_txt = str(nb_folds) + ' '
                    nb_training_sample = train_indices.shape[0]
                    header_txt += str(nb_training_sample) + ' '
                    nb_test_sample = test_indices.shape[0]
                    header_txt += str(nb_test_sample) + '\n'
                    f.write(header_txt)
                split_txt = ' '.join(map(str, train_indices)) + ' '
                split_txt += ' '.join(map(str, test_indices)) + '\n'
                f.write(split_txt)

if args.best:
    # Search the best run
    print('Importing best run')
    evals = openml.evaluations.list_evaluations(
        function="predictive_accuracy",
        tasks=[args.task_id], output_format="dataframe")
    evals = evals.sort_values(by="value", ascending=False)
    path_best = args.output_folder / (args.name + '_best.txt')
    with open(str(path_best), "w") as f:
        f.write(str(evals.iloc[0]))

print('Importation completed.')

2022-11-24
in AI/ML, All, Python,
137 views
A comment, question, correction ? A project we could work together on ? Email me!
Learn more about me in my profile.