10.6 Building a Dataset
In this guide, we give an example of the commands and steps needed to create and use a Quevedo dataset. It might be helpful to have an environment available where you can test the different commands as you read the guide, and maybe some data that you can import. The process of creating a dataset is often not straightforward or works out on the first step, so don’t worry about making mistakes and having to repeat the process (but please don’t delete your original data! Keep those safe in some backup or cloud. Quevedo only works with data it has copied, so deleting a Quevedo dataset is safe as long as you keep copies of your original data somewhere).
In this guide we use git and DVC to manage repository versions and
workflows, but that is not necessary, so you can ignore those steps if
you don’t use them. Also, we assume Quevedo is installed and available
as the quevedo
command, if not, please follow the steps
in Section
10.2.1.1.
10.6.1 Create repo
To initialize the directory where the data and annotations will
live, use the create
command:
It will offer you the opportunity to customize the configuration
file (Section
10.4) for the repository, and set your annotation schema (Section
10.4.2) and other information. You can modify it later by editing
the config.toml
file, or using the config
command.
From this point on, we will run commands with the dataset directory
as working dir, so change to it with cd dataset_name
, and
we won’t need the -D
flag anymore.
If you want to use git and/or DVC, initialize the repository with the commands:
10.6.2 Add data
Once we have the structure, the first step is to import our data.
This can be done using the add_images
command, specifying both
the source directory and target subset. To specify the target subset,
use the -l
flag if it’s a logogram subset, or
-g
for graphemes. You can specify multiple source
directories, which will be imported to the same subset.
To track these data with DVC, run:
10.6.3 Automatically process the data
After the images are imported, we may want to use some preliminary
automatic processing, like adding some tags that can be determined by
code, preprocessing the images, etc. Create a scripts
directory if it doesn’t exist, and write your code there according to
the user script documentation (Section
10.8.2). Then you can run it on the appropriate subsets with the
run_script
command.
10.6.4 Annotate the data
Most of the important information in a dataset, apart to the source
data, are the human annotations on these data (otherwise, why bother,
right?). Since Quevedo deals with visual data, a graphical interface
is needed for annotation, and is provided in the form of a web
interface (Section
10.5). Remember to first set in the configuration file the
annotation schema that you want to use, and then you can lanch the
server with the web
command. If using git and dvc,
remember to add and commit any modifications.
10.6.5 Augment the data
Once logograms are manually annotated, Quevedo can extract the
graphemes included within them to augment the number of grapheme
instances available to us, with the extract
command. If what we have are
graphemes, we can generate artificial examples of logograms with the
generate
. With these two commands,
the data available for training increase, hopefully improving our
algorithms.
These steps can be added to a DVC pipelines file 1
so that DVC tracks the procedure and the results, and when we
distribute the dataset other people can reproduce the full process. To
have dvc automatically fill the pipelines file, run the commands with
dvc run
2:
10.6.6 Splits and folds
For experimentation, we often need to divide our files into a train split on which to train the algorithms, and a test or evaluation split which acts as a stand-in for “new” or “unknown” data. We may also want to do cross-validation, in which evaluation is done on different runs of the experiment using different train/test partitions. Or in other cases, we may want to exclude some data from all training and testing, making a heldout set which is only looked at in the very end to really evaluate performance.
To support different needs from the researchers, the strategy adopted by Quevedo is to assign annotations to “folds”. Then, groups of folds can be defined either as being used for training, testing, or none. What folds to use, and which to assign to training or testing is set in the configuration file (Section 10.4.3).
Quevedo can assign the folds to your annotations randomly so that
different folds have the approximate same number of annotations using
the split
command:
- Split all logograms into the default folds:
- Assign all graphemes in “some_set” to either fold 0 or 1:
10.6.7 Train and test the neural network
Now that our data are properly organized and annotated, we can try
training a neural network and evaluating its results. The first step
is to prepare
the files needed for
training, then calling the darknet binary with the train
command. Finally, the test
command evalates some quick
metrics on the trained neural networks, and can also print all
predictions so you can use your statistical software to get a more
in-depth analysis.
The commands can also be chained, so it is enough to run (but it will probably take some time):
$ quevedo prepare train test
Remember that first you must have installed darknet (Section
10.2.1.1) and configured the neural network (Section
10.2.2) in the Quevedo configuration file. If you have more than
one network, specify which one to use with the -N
flag:
$ quevedo -N other_network prepare train test
To keep track of the neural network in DVC, we recommend setting preparation, training and test as different stages in the pipeline, so that intermediate artifacts can be cached and the expensive process of training only performed if necessary, and letting DVC track the produced metrics. If you have different networks, they can be set up as template parameters3 in the pipelines file to keep things DRY4.
10.6.8 Exploitation
When doing data science, sometimes it is enough to stop at this step. Data are annotated, neural networks trained, experiments run and conclusions obtained. But often the results are actually useful beyond the science, and we want to somehow peruse them. The trained neural network weights are stored in the network directory (Section 10.2.3.1), and can be used with darknet in other applications or loaded by OpenCV5 for example. If access to the dataset data is needed, and not only the training results, Quevedo can also be used as a library (Section 10.8) from your own code.