Small Scale Link Prediction (FB15K-237)
---------------------------------------------
This example will demonstrate how to use Marius Python API to do a Link 
Prediction task on a small scale graph. In this example we will use FB15K-237
graph. FB15K-237 is a graph that is supported by Marius already so you won't
need to write your own custom dataset class for preprocessing. If you want to 
use a custom dataset which is not supported by marius please refer to lp_custom
example.

*Example file location: examples/python/fb15k_237_gpu.py*

By going through the example we aim you will understand following things:

- How to use Marius' internally supported in dataset to do preprocessing
- How to define a model using the Python APIs and configure it as needed
- How to add different reporting metrics for the accuracy
- How to initialize data loading objects for training and evaluation
- And lastly how to do training and evaluation

Note: This is a GPU example and we are setting the device to GPU at the start of the
main using the line::

    device = torch.device("cuda")

If you want to run CPU based training please change *cuda* to *cpu*.

1. Create Dataset Class
^^^^^^^^^^^^^^^^^^^^^^^
In this example we are going to use a built in dataset class to do preprocessing
for FB15K-237 graph. Marius already has support for few graphs and you can use their
dataset classes directly to preprocess the data.

To use a built in class you need to import it which is done using the following line::
    
    from marius.tools.preprocess.datasets.fb15k_237 import FB15K237

Once you imported the class all you need to do is instansiate the base directory
where you will store all the dataset and preprocessed files. And call the download
and preprocess on the objects. As shown in the code.::

    dataset_dir = Path("fb15k_dataset/")
    dataset = FB15K237(dataset_dir)
    if not (dataset_dir / Path("edges/train_edges.bin")).exists():
        dataset.download()
        dataset.preprocess()

Lastly, note that dataset preprocessing will return a ``dataset.yaml`` file which
is needed for further tasks, so we read it in the example code.

2. Create Model
^^^^^^^^^^^^^^^
Next step is to define a model for the task. In this example we are going to make
a model with *DistMult*. The model is defined in the function ``init_model``. 
There are three steps to defining a model:

1. Defining an encoder: In this example we are defining a single layer encoder.
The layer is an embedding layer::

   embedding_layer = m.nn.layers.EmbeddingLayer(dimension=embedding_dim, 
                                                device=device)
 
To define a model all you need to do is call the ``GeneralEncoder(..)`` method with all
the layers as shown below::

    encoder = m.encoders.GeneralEncoder(layers=[[embedding_layer]])

In this example we are only having a single layer in the encoder but you can have
more than one layer also. (See the node classification example for refer on how to
pass more than one layer to ``GeneralEncoder(..)`` method)

2. Defining a decoder: In this example we are using *DistMult* as our decoder so
we are calling the following method::

    decoder = m.nn.decoders.edge.DistMult(num_relations=num_relations,
                                          embedding_dim=embedding_dim,
                                          use_inverse_relations=True,
                                          device=device,
                                          dtype=dtype,
                                          mode="train")

Notice that we are using mode as ``train`` but there are other
options available. Please refer to API documentation for more details. 

3. Defining a loss function: We are using *SoftmaxCrossEntropy* in this example. And defining
it is just doing a function call::

    loss = m.nn.SoftmaxCrossEntropy(reduction="sum")

There are many other options available for encoder, decoder and loss functions.
Please refer to the API documentation for more details.

In addition to doing the above three tasks, which defines the model, we also need
to provide details regarding which metrics we want to be reported. This is done through
following code::

    reporter = m.report.LinkPredictionReporter()
    reporter.add_metric(m.report.MeanReciprocalRank())
    reporter.add_metric(m.report.MeanRank())
    reporter.add_metric(m.report.Hitsk(1))
    reporter.add_metric(m.report.Hitsk(10))

Notice that you can add multiple metrics.

Once we have defined the encoder, decoder, loss function and the reporter, we can
create a model object using the following method::

    m.nn.Model(encoder, decoder, loss, reporter)

And now this model can be passed to during training and evaluation.

Lastly if you want to add an optimizer to the function you can do it as follows::

    model.optimizers = [m.nn.AdamOptimizer(model.named_parameters(), lr=.1)]

3. Create Dataloader
^^^^^^^^^^^^^^^^^^^^
After defining the model we need to define two dataloader objects, one for training
and the other for evaluation. Dataloader objects are used to handle all the data
movement required for training. Marius supported different types of storage backends
like complete InMemory, Partition Buffers, Flat_File, etc. Please refer to documentation
and the original paper for more details.

In this example we are using an InMemory storage backend where all the data will reside
in memory. This can be defined using the method ``tensor_to_file()``. Do define 
a dataloader object we need to do 3 things:

- First is a simple method call to define which objects need to be read::

    train_edges = m.storage.tensor_from_file(filename=dataset.train_edges_file, shape=[dataset_stats.num_train, -1], dtype=torch.int32, device=device)
    
- Second for this example we want to use a negative edge sampler so we define it
  as follows::
    
    train_neg_sampler = m.data.samplers.CorruptNodeNegativeSampler(num_chunks=10, num_negatives=500, degree_fraction=0.0, filtered=False)

- And last is to make the data loader object itself which will be used during training
  to fetch the data and process batches::

    train_dataloader = m.data.DataLoader(edges=train_edges,
                                         node_embeddings=embeddings,
                                         batch_size=1000,
                                         neg_sampler=train_neg_sampler,
                                         learning_task="lp",
                                         train=True)

Once done with this we have defined the dataloader for training task. Similarly in the
example we also define a dataloader for evaluation.

4. Train Model
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Now we have everything available to start the training. For training we run multiple
epochs of training and evaluation in this example.

For training all we need is the following function::
    
    def train_epoch(model, dataloader):
        dataloader.initializeBatches()

        while dataloader.hasNextBatch():
            batch = dataloader.getBatch()
            model.train_batch(batch)
            dataloader.updateEmbeddings(batch)

All we are doing in this function is as follows:

- Initializing the batches before the start of the epoch
- If there is a next batch available we fetch the next batch
- We train the model on the fetched batch
- And we update the embeddings

5. Inference
^^^^^^^^^^^^^^^^^^^^^^^^^^^
Similar to training the evaluation is also pretty simple can be concluded easily
using the following function::

    def eval_epoch(model, dataloader):
        dataloader.initializeBatches()

        while dataloader.hasNextBatch():
            batch = dataloader.getBatch()
            model.evaluate_batch(batch)
        
        model.reporter.report()

The function does the following:

- Initialize the batches before the start of every epoch
- Load if there is a next batch of data available
- Evaluate the batch
- Once all batches are done report the metrics we defined earlier in reporter

6. Save Model
^^^^^^^^^^^^^^^^^^^^^^^^^^^
Work in progress - More details will be added soon