Tesis – Building a Dataset

10.6 Building a Dataset

In this guide, we give an example of the commands and steps needed to create and use a Quevedo dataset. It might be helpful to have an environment available where you can test the different commands as you read the guide, and maybe some data that you can import. The process of creating a dataset is often not straightforward or works out on the first step, so don’t worry about making mistakes and having to repeat the process (but please don’t delete your original data! Keep those safe in some backup or cloud. Quevedo only works with data it has copied, so deleting a Quevedo dataset is safe as long as you keep copies of your original data somewhere).

In this guide we use git and DVC to manage repository versions and workflows, but that is not necessary, so you can ignore those steps if you don’t use them. Also, we assume Quevedo is installed and available as the quevedo command, if not, please follow the steps in Section 10.2.1.1.

10.6.1 Create repo

To initialize the directory where the data and annotations will live, use the create command:

$ quevedo -D dataset_name create

List. 10.9 − Creating a Quevedo dataset.

It will offer you the opportunity to customize the configuration file (Section 10.4) for the repository, and set your annotation schema (Section 10.4.2) and other information. You can modify it later by editing the config.toml file, or using the config command.

From this point on, we will run commands with the dataset directory as working dir, so change to it with cd dataset_name, and we won’t need the -D flag anymore.

If you want to use git and/or DVC, initialize the repository with the commands:

$ git init
$ git add -A .
$ git commit -m "Created quevedo repository"
$ dvc init
$ git commit -m "Initialize DVC"

List. 10.10 − Initialize the dataset with git and DVC.

10.6.2 Add data

Once we have the structure, the first step is to import our data. This can be done using the add_images command, specifying both the source directory and target subset. To specify the target subset, use the -l flag if it’s a logogram subset, or -g for graphemes. You can specify multiple source directories, which will be imported to the same subset.

$ quevedo add_images -i source_image_directory -l logogram_set

List. 10.11 − Add images to a dateset.

To track these data with DVC, run:

$ dvc add logograms/*
$ git add logograms/logogram_set.dvc logograms/.gitignore
$ git commit -m "Imported logograms"

List. 10.12 − Track dataset data with DVC.

10.6.3 Automatically process the data

After the images are imported, we may want to use some preliminary automatic processing, like adding some tags that can be determined by code, preprocessing the images, etc. Create a scripts directory if it doesn’t exist, and write your code there according to the user script documentation (Section 10.8.2). Then you can run it on the appropriate subsets with the run_script command.

$ mkdir scripts
$ $EDITOR scripts/script_name.py
$ quevedo run_script -s script_name -l logogram_set
$ dvc add logograms

List. 10.13 − Run custom scripts on the dataset.

10.6.4 Annotate the data

Most of the important information in a dataset, apart to the source data, are the human annotations on these data (otherwise, why bother, right?). Since Quevedo deals with visual data, a graphical interface is needed for annotation, and is provided in the form of a web interface (Section 10.5). Remember to first set in the configuration file the annotation schema that you want to use, and then you can lanch the server with the web command. If using git and dvc, remember to add and commit any modifications.

$ quevedo web
$ dvc add logograms
$ git commit -m "Annotated logograms"

List. 10.14 − Annotate logograms with the web interface, dvc and git.

10.6.5 Augment the data

Once logograms are manually annotated, Quevedo can extract the graphemes included within them to augment the number of grapheme instances available to us, with the extract command. If what we have are graphemes, we can generate artificial examples of logograms with the generate. With these two commands, the data available for training increase, hopefully improving our algorithms.

$ quevedo extract -f logogram_set -t extracted_grapheme_set
$ quevedo generate -f grapheme_set -t generated_logogram_set

List. 10.15 − Augment the dataset with artificial samples.

These steps can be added to a DVC pipelines file ¹ so that DVC tracks the procedure and the results, and when we distribute the dataset other people can reproduce the full process. To have dvc automatically fill the pipelines file, run the commands with dvc run²:

$ dvc run -n extract \
          -d logograms/logogram_set \
          -o graphemes/extracted_set \
          quevedo extract -f logogram_set -t extracted_set

List. 10.16 − Record data augmentation as a DVC pipeline.

10.6.6 Splits and folds

For experimentation, we often need to divide our files into a train split on which to train the algorithms, and a test or evaluation split which acts as a stand-in for “new” or “unknown” data. We may also want to do cross-validation, in which evaluation is done on different runs of the experiment using different train/test partitions. Or in other cases, we may want to exclude some data from all training and testing, making a heldout set which is only looked at in the very end to really evaluate performance.

To support different needs from the researchers, the strategy adopted by Quevedo is to assign annotations to “folds”. Then, groups of folds can be defined either as being used for training, testing, or none. What folds to use, and which to assign to training or testing is set in the configuration file (Section 10.4.3).

Quevedo can assign the folds to your annotations randomly so that different folds have the approximate same number of annotations using the split command:

Split all logograms into the default folds:

$ quevedo split -l _ALL_

List. 10.17 − Split all logograms into folds.

Assign all graphemes in “some_set” to either fold 0 or 1:

$ quevedo split -g some_set -s 0 -e 1

List. 10.18 − Assign a particular set of graphemes to some folds.

10.6.7 Train and test the neural network

Now that our data are properly organized and annotated, we can try training a neural network and evaluating its results. The first step is to prepare the files needed for training, then calling the darknet binary with the train command. Finally, the test command evalates some quick metrics on the trained neural networks, and can also print all predictions so you can use your statistical software to get a more in-depth analysis.

The commands can also be chained, so it is enough to run (but it will probably take some time):

$ quevedo prepare train test

Remember that first you must have installed darknet (Section 10.2.1.1) and configured the neural network (Section 10.2.2) in the Quevedo configuration file. If you have more than one network, specify which one to use with the -N flag:

$ quevedo -N other_network prepare train test

To keep track of the neural network in DVC, we recommend setting preparation, training and test as different stages in the pipeline, so that intermediate artifacts can be cached and the expensive process of training only performed if necessary, and letting DVC track the produced metrics. If you have different networks, they can be set up as template parameters³ in the pipelines file to keep things DRY⁴.

$ dvc run -n prepare_detect \
          -d logograms \
          -o networks/detect/train.txt \
          quevedo -N detect prepare
$ dvc run -n train_detect \
          -d networks/detect/train.txt \
          -o networks/detect/darknet_final.weights \
          quevedo -N detect train
$ dvc run -n test_detect \
          -d networks/detect/darknet_final.weights \
          -m networks/detect/results.json \
          quevedo -N detect test --results-json

List. 10.19 − Train and test neural networks with DVC pipelines.

10.6.8 Exploitation

When doing data science, sometimes it is enough to stop at this step. Data are annotated, neural networks trained, experiments run and conclusions obtained. But often the results are actually useful beyond the science, and we want to somehow peruse them. The trained neural network weights are stored in the network directory (Section 10.2.3.1), and can be used with darknet in other applications or loaded by OpenCV⁵ for example. If access to the dataset data is needed, and not only the training results, Quevedo can also be used as a library (Section 10.8) from your own code.