Small Scale Link Prediction (FB15K-237)
This example will demonstrate how to use Marius Python API to do a Link Prediction task on a small scale graph. In this example we will use FB15K-237 graph. FB15K-237 is a graph that is supported by Marius already so you won’t need to write your own custom dataset class for preprocessing. If you want to use a custom dataset which is not supported by marius please refer to lp_custom example.
Example file location: examples/python/fb15k_237_gpu.py
By going through the example we aim you will understand following things:
How to use Marius’ internally supported in dataset to do preprocessing
How to define a model using the Python APIs and configure it as needed
How to add different reporting metrics for the accuracy
How to initialize data loading objects for training and evaluation
And lastly how to do training and evaluation
Note: This is a GPU example and we are setting the device to GPU at the start of the main using the line:
device = torch.device("cuda")
If you want to run CPU based training please change cuda to cpu.
1. Create Dataset Class
In this example we are going to use a built in dataset class to do preprocessing for FB15K-237 graph. Marius already has support for few graphs and you can use their dataset classes directly to preprocess the data.
To use a built in class you need to import it which is done using the following line:
from marius.tools.preprocess.datasets.fb15k_237 import FB15K237
Once you imported the class all you need to do is instansiate the base directory where you will store all the dataset and preprocessed files. And call the download and preprocess on the objects. As shown in the code.:
dataset_dir = Path("fb15k_dataset/")
dataset = FB15K237(dataset_dir)
if not (dataset_dir / Path("edges/train_edges.bin")).exists():
dataset.download()
dataset.preprocess()
Lastly, note that dataset preprocessing will return a dataset.yaml
file which
is needed for further tasks, so we read it in the example code.
2. Create Model
Next step is to define a model for the task. In this example we are going to make
a model with DistMult. The model is defined in the function init_model
.
There are three steps to defining a model:
1. Defining an encoder: In this example we are defining a single layer encoder. The layer is an embedding layer:
embedding_layer = m.nn.layers.EmbeddingLayer(dimension=embedding_dim,
device=device)
To define a model all you need to do is call the GeneralEncoder(..)
method with all
the layers as shown below:
encoder = m.encoders.GeneralEncoder(layers=[[embedding_layer]])
In this example we are only having a single layer in the encoder but you can have
more than one layer also. (See the node classification example for refer on how to
pass more than one layer to GeneralEncoder(..)
method)
2. Defining a decoder: In this example we are using DistMult as our decoder so we are calling the following method:
decoder = m.nn.decoders.edge.DistMult(num_relations=num_relations,
embedding_dim=embedding_dim,
use_inverse_relations=True,
device=device,
dtype=dtype,
mode="train")
Notice that we are using mode as train
but there are other
options available. Please refer to API documentation for more details.
3. Defining a loss function: We are using SoftmaxCrossEntropy in this example. And defining it is just doing a function call:
loss = m.nn.SoftmaxCrossEntropy(reduction="sum")
There are many other options available for encoder, decoder and loss functions. Please refer to the API documentation for more details.
In addition to doing the above three tasks, which defines the model, we also need to provide details regarding which metrics we want to be reported. This is done through following code:
reporter = m.report.LinkPredictionReporter()
reporter.add_metric(m.report.MeanReciprocalRank())
reporter.add_metric(m.report.MeanRank())
reporter.add_metric(m.report.Hitsk(1))
reporter.add_metric(m.report.Hitsk(10))
Notice that you can add multiple metrics.
Once we have defined the encoder, decoder, loss function and the reporter, we can create a model object using the following method:
m.nn.Model(encoder, decoder, loss, reporter)
And now this model can be passed to during training and evaluation.
Lastly if you want to add an optimizer to the function you can do it as follows:
model.optimizers = [m.nn.AdamOptimizer(model.named_parameters(), lr=.1)]
3. Create Dataloader
After defining the model we need to define two dataloader objects, one for training and the other for evaluation. Dataloader objects are used to handle all the data movement required for training. Marius supported different types of storage backends like complete InMemory, Partition Buffers, Flat_File, etc. Please refer to documentation and the original paper for more details.
In this example we are using an InMemory storage backend where all the data will reside
in memory. This can be defined using the method tensor_to_file()
. Do define
a dataloader object we need to do 3 things:
First is a simple method call to define which objects need to be read:
train_edges = m.storage.tensor_from_file(filename=dataset.train_edges_file, shape=[dataset_stats.num_train, -1], dtype=torch.int32, device=device)
Second for this example we want to use a negative edge sampler so we define it as follows:
train_neg_sampler = m.data.samplers.CorruptNodeNegativeSampler(num_chunks=10, num_negatives=500, degree_fraction=0.0, filtered=False)
And last is to make the data loader object itself which will be used during training to fetch the data and process batches:
train_dataloader = m.data.DataLoader(edges=train_edges, node_embeddings=embeddings, batch_size=1000, neg_sampler=train_neg_sampler, learning_task="lp", train=True)
Once done with this we have defined the dataloader for training task. Similarly in the example we also define a dataloader for evaluation.
4. Train Model
Now we have everything available to start the training. For training we run multiple epochs of training and evaluation in this example.
For training all we need is the following function:
def train_epoch(model, dataloader):
dataloader.initializeBatches()
while dataloader.hasNextBatch():
batch = dataloader.getBatch()
model.train_batch(batch)
dataloader.updateEmbeddings(batch)
All we are doing in this function is as follows:
Initializing the batches before the start of the epoch
If there is a next batch available we fetch the next batch
We train the model on the fetched batch
And we update the embeddings
5. Inference
Similar to training the evaluation is also pretty simple can be concluded easily using the following function:
def eval_epoch(model, dataloader):
dataloader.initializeBatches()
while dataloader.hasNextBatch():
batch = dataloader.getBatch()
model.evaluate_batch(batch)
model.reporter.report()
The function does the following:
Initialize the batches before the start of every epoch
Load if there is a next batch of data available
Evaluate the batch
Once all batches are done report the metrics we defined earlier in reporter
6. Save Model
Work in progress - More details will be added soon