A Python script to import datasets from OpenML

When in need for some data to test AI/ML algorithms OpenML has become my reference. Its a huge (several thousands!!) compilation of curated datasets, with results and analysis, and a Python API to interact with that treasure chest. I needed a small tool to bring these datasets to 'my' world, i.e. import them in a format compatible with my C library LibCapy. That's what the script below does, and I share it here as it may help someone else.

The usage looks as follow:

The script takes in argument a task id (cf OpenML), import the dataset, and convert it to a format compatible with CapyDataset.loadFromPath(). -o specifies the folder where to save the imported dataset, -n specifies the name of the dataset (created files' name are generated base on that name), -s also imports the splits used by OpenML for 10-fold cross validation, and finally -b also imports infos about the best run over all evaluation for performance comparison.

For example, to import the iris dataset with its split and best run, one could use the command python openml_import.py -o ./ -n iris -i 59 -s -b. This will create the files iris.csv:

iris_split.txt:

and iris_best.txt:

The script is as follow, you can also download it by clicking here. :

Edit on 2022/12/12: Correction of a bug in the importation of splits.

2022-11-24
in AI/ML, All, Python,
68 views
Copyright 2021-2024 Baillehache Pascal