Small Scale Link Prediction (FB15K-237)
---------------------------------------------

In this tutorial, we use the **FB15K_237 knowledge graph** as an example to demonstrate a step-by-step walkthrough from preprocessing the dataset to defining the configuration file and to training **a link prediction model with the DistMult algorithm**.

1. Preprocess Dataset
^^^^^^^^^^^^^^^^^^^^^

Preprocessing a dataset is straightforward with the ``marius_preprocess`` command. This command comes with ``marius`` when ``marius`` is installed. See (TODO link) for installation information.

Assuming ``marius_preprocess`` has been built, we preprocess the FB15K_237 dataset by running the following command (assuming we are in the ``marius`` root directory):

.. code-block:: bash

   $ marius_preprocess --dataset fb15k_237 --output_directory datasets/fb15k_237_example/
   Downloading FB15K-237.2.zip to datasets/fb15k_237_example/FB15K-237.2.zip
   Reading edges
   Remapping Edges
   Node mapping written to: datasets/fb15k_237_example/nodes/node_mapping.txt
   Relation mapping written to: datasets/fb15k_237_example/edges/relation_mapping.txt
   Dataset statistics written to: datasets/fb15k_237_example/dataset.yaml

The  ``--dataset`` flag specifies which of the pre-set datasets ``marius_preprocess`` will preprocess and download.

The  ``--output_directory`` flag specifies where the preprocessed graph will be output and is set by the user. In this example, assume we have not created the datasets/fb15k_237_example repository. ``marius_preprocess`` will create it for us. 

For detailed usages of  ``marius_preprocess``, please execute the following command:

.. code-block:: bash

   $ marius_preprocess -h

Let's check what is inside the created directory:

.. code-block:: bash

   $ ls -l datasets/fb15k_237_example/ 
   dataset.yaml                       # input dataset statistics                                
   nodes/  
     node_mapping.txt                 # mapping of raw node ids to integer uuids
   edges/   
     relation_mapping.txt             # mapping of raw edge(relation) ids to integer uuids
     test_edges.bin                   # preprocessed testing edge list 
     train_edges.bin                  # preprocessed training edge list 
     validation_edges.bin             # preprocessed validation edge list 
   train.txt                          # raw training edge list                                              
   test.txt                           # raw testing edge list    
   valid.txt                          # raw validation edge list    
   text_cvsc.txt                      # relation triples as used in Toutanova and Chen CVSM-2015
   text_emnlp.txt                     # relation triples as used inToutanova et al. EMNLP-2015
   README.txt                         # README of the downloaded FB15K-237 dataset

Let's check what is inside the generated ``dataset.yaml`` file:

.. code-block:: bash

   $ cat datasets/fb15k_237_example/dataset.yaml
   dataset_dir: /marius-internal/datasets/fb15k_237_example/
   num_edges: 272115
   num_nodes: 14541
   num_relations: 237
   num_train: 272115
   num_valid: 17535
   num_test: 20466
   node_feature_dim: -1
   rel_feature_dim: -1
   num_classes: -1
   initialized: false

.. note:: 
   If the above ``marius_preprocess`` command fails due to any missing directory errors, please create the ``<output_directory>/edges`` and ``<output_directory>/nodes`` directories as a workaround.

2. Define Configuration File
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

To train a model, we need to define a YAML configuration file based on information created from marius_preprocess. 

The configuration file contains information including but not limited to the inputs to the model, training procedure, and hyperparameters to optimize. Given a configuration file, marius assembles a model depending on the given parameters. The configuration file is grouped up into four sections:

* Model: Defines the architecture of the model, neighbor sampling configuration, loss, and optimizer(s)
* Storage: Specifies the input dataset and how to store the graph, features, and embeddings.
* Training: Sets options for the training procedure and hyperparameters. E.g. batch size, negative sampling.
* Evaluation: Sets options for the evaluation procedure (if any). The options here are similar to those in the training section.

For the full configuration schema, please refer to ``docs/config_interface``.

An example YAML configuration file for the FB15K_237 dataset is given in ``examples/configuration/fb15k_237.yaml``. Note that the ``dataset_dir`` is set to the preprocessing output directory, in our example, ``datasets/fb15k_237_example/``.

Let's create the same YAML configuration file for the FB15K_237 dataset from scratch. We follow the structure of the configuration file and create each of the four sections one by one. In a YAML file, indentation is used to denote nesting and all parameters are in the format of key-value pairs. 

#. First, we define the **model**. We begin by setting all required parameters. This includes ``learning_task``, ``encoder``, ``decoder``, and ``loss``. The rest of the configurations can be fine-tuned by the user.

    .. code-block:: yaml
    
        model:
          learning_task: LINK_PREDICTION # set the learning task to link prediction
          encoder:
            layers:
              - - type: EMBEDDING # set the encoder to be an embedding table with 50-dimensional embeddings
                  output_dim: 50
          decoder:
            type: DISTMULT # set the decoder to DistMult
            options:
              input_dim: 50
          loss:
            type: SOFTMAX_CE
            options:
              reduction: SUM
          dense_optimizer: # optimizer to use for dense model parameters. In this case these are the DistMult relation (edge-type) embeddings
              type: ADAM
              options:
                learning_rate: 0.1
          sparse_optimizer: # optimizer to use for node embedding table
              type: ADAGRAD
              options:
                learning_rate: 0.1
        storage:
          # omit
        training:
          # omit
        evaluation:
          # omit
      
#. Next, we set the **storage** and **dataset**. We begin by setting all required parameters. This includes ``dataset``. Here, the ``dataset_dir`` is set to ``datasets/fb15k_237_example/``, which is the preprocessing output directory.

    .. code-block:: yaml
    
        model:
          # omit
        storage:
          device_type: cuda
          dataset:
            dataset_dir: datasets/fb15k_237_example/
          edges:
            type: DEVICE_MEMORY
          embeddings:
            type: DEVICE_MEMORY
          save_model: true
        training:
          # omit
        evaluation:
          # omit

#. Lastly, we configure **training** and **evaluation**. We begin by setting all required parameters. This includes ``num_epochs`` and ``negative_sampling``. We set ``num_epochs=10`` (10 epochs to train) to demonstrate this example. Note that ``negative_sampling`` is required for link prediction.

    .. code-block:: yaml
    
        model:
          # omit
        storage:
          # omit
        training:
          batch_size: 1000
          negative_sampling:
            num_chunks: 10
            negatives_per_positive: 500
            degree_fraction: 0.0
            filtered: false
          num_epochs: 10
          pipeline:
            sync: true
          epochs_per_shuffle: 1        
        evaluation:
          batch_size: 1000
          negative_sampling:
            filtered: true
          pipeline:
            sync: true   
     
3. Train Model
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

After defining our configuration file, training is run with ``marius_train <your_config.yaml>``.

We can now train our example using the configuration file we just created by running the following command (assuming we are in the ``marius`` root directory):

.. code-block:: bash

   $ marius_train datasets/fb15k_237_example/fb15k_237.yaml
    [2022-04-03 14:53:15.106] [info] [marius.cpp:45] Start initialization
    [04/03/22 14:53:19.140] Initialization Complete: 4.034s
    [04/03/22 14:53:19.147] ################ Starting training epoch 1 ################
    [04/03/22 14:53:19.224] Edges processed: [28000/272115], 10.29%
    [04/03/22 14:53:19.295] Edges processed: [56000/272115], 20.58%
    [04/03/22 14:53:19.369] Edges processed: [84000/272115], 30.87%
    [04/03/22 14:53:19.447] Edges processed: [112000/272115], 41.16%
    [04/03/22 14:53:19.525] Edges processed: [140000/272115], 51.45%
    [04/03/22 14:53:19.603] Edges processed: [168000/272115], 61.74%
    [04/03/22 14:53:19.685] Edges processed: [196000/272115], 72.03%
    [04/03/22 14:53:19.765] Edges processed: [224000/272115], 82.32%
    [04/03/22 14:53:19.851] Edges processed: [252000/272115], 92.61%
    [04/03/22 14:53:19.906] Edges processed: [272115/272115], 100.00%
    [04/03/22 14:53:19.906] ################ Finished training epoch 1 ################
    [04/03/22 14:53:19.906] Epoch Runtime: 758ms
    [04/03/22 14:53:19.906] Edges per Second: 358990.75
    [04/03/22 14:53:19.906] Evaluating validation set
    [04/03/22 14:53:19.972]
    =================================
    Link Prediction: 35070 edges evaluated
    Mean Rank: 443.786313
    MRR: 0.233709
    Hits@1: 0.157998
    Hits@3: 0.258597
    Hits@5: 0.308640
    Hits@10: 0.382407
    Hits@50: 0.560137
    Hits@100: 0.633191
    =================================
    [04/03/22 14:53:19.972] Evaluating test set
    [04/03/22 14:53:20.043]
    =================================
    Link Prediction: 40932 edges evaluated
    Mean Rank: 454.272940
    MRR: 0.230645
    Hits@1: 0.155282
    Hits@3: 0.253103
    Hits@5: 0.304065
    Hits@10: 0.382073
    Hits@50: 0.559758
    Hits@100: 0.630192
    =================================

After running this configuration for 10 epochs, we should see a result similar to below with a MRR roughly equal to 0.25:

.. code-block:: bash

    =================================
    [04/03/22 14:53:27.861] ################ Starting training epoch 10 ################
    [04/03/22 14:53:27.944] Edges processed: [28000/272115], 10.29%
    [04/03/22 14:53:28.023] Edges processed: [56000/272115], 20.58%
    [04/03/22 14:53:28.115] Edges processed: [84000/272115], 30.87%
    [04/03/22 14:53:28.220] Edges processed: [112000/272115], 41.16%
    [04/03/22 14:53:28.315] Edges processed: [140000/272115], 51.45%
    [04/03/22 14:53:28.410] Edges processed: [168000/272115], 61.74%
    [04/03/22 14:53:28.506] Edges processed: [196000/272115], 72.03%
    [04/03/22 14:53:28.602] Edges processed: [224000/272115], 82.32%
    [04/03/22 14:53:28.699] Edges processed: [252000/272115], 92.61%
    [04/03/22 14:53:28.772] Edges processed: [272115/272115], 100.00%
    [04/03/22 14:53:28.772] ################ Finished training epoch 10 ################
    [04/03/22 14:53:28.772] Epoch Runtime: 911ms
    [04/03/22 14:53:28.772] Edges per Second: 298699.22
    [04/03/22 14:53:28.772] Evaluating validation set
    [04/03/22 14:53:28.834]
    =================================
    Link Prediction: 35070 edges evaluated
    Mean Rank: 303.712946
    MRR: 0.259462
    Hits@1: 0.173253
    Hits@3: 0.286570
    Hits@5: 0.348104
    Hits@10: 0.434474
    Hits@50: 0.626775
    Hits@100: 0.706045
    =================================
    [04/03/22 14:53:28.835] Evaluating test set
    [04/03/22 14:53:28.904]
    =================================
    Link Prediction: 40932 edges evaluated
    Mean Rank: 317.841664
    MRR: 0.255330
    Hits@1: 0.169794
    Hits@3: 0.281858
    Hits@5: 0.341860
    Hits@10: 0.429859
    Hits@50: 0.625208
    Hits@100: 0.703875
    =================================

Let's check again what was added in the ``datasets/fb15k_237_example/`` directory. For clarity, we only list the files that were created in training. Notice that several files have been created, including the trained model, the embedding table, a full configuration file, and output logs:

.. code-block:: bash

   $ ls datasets/fb15k_237_example/ 
   model.pt                           # contains the dense model parameters, embeddings of the edge-types
   model_state.pt                     # optimizer state of the trained model parameters
   full_config.yaml                   # detailed config generated based on user-defined config
   metadata.csv                       # information about metadata
   logs/                              # logs containing output, error, debug,  information
   nodes/  
     embeddings.bin                   # trained node embeddings of the graph
     embeddings_state.bin             # node embedding optimizer state
     ...
   edges/   
     ...
   ...

.. note::
  ``model.pt`` contains the dense model parameters. For DistMult, this is the embeddings of the edge-types. For GNN encoders, this file will include the GNN parameters.

4. Inference
^^^^^^^^^^^^^^^^^^^^^^^^^^^

4.1 Command Line
""""""""""""""""

4.2 Load Into Python
""""""""""""""""""""