Custom Dataset Link Prediction
In this tutorial, we use the OGBN_Arxiv dataset as an example to demonstrate a step-by-step walkthrough from preprocessing a custom dataset to defining the configuration file and to training a link prediction model with the DistMult algorithm.
1. Preprocess Dataset
Preprocessing a custom dataset is straightforward with the marius_preprocess
command. This command comes with marius
when marius
is installed. See (TODO link) for installation information.
Let’s start by downloading and extracting the OGBN_Arxiv dataset we will use in this example if it has not been downloaded (assuming we are in the marius
root directory):
$ wget http://snap.stanford.edu/ogb/data/nodeproppred/arxiv.zip # download original dataset
$ unzip arxiv.zip -d datasets/custom_lp_example/ # extract downloaded dataset
$ gzip -dr datasets/custom_lp_example/arxiv/raw/ # extract raw dataset files
$ gzip -dr datasets/custom_lp_example/arxiv/split/time/ # extract raw split files
After the previous step, we should have the directory datasets/custom_lp_example/arxiv/raw/
created containing the following raw files downloaded and extracted from the OGBN_Arxiv dataset:
$ ls -1 datasets/custom_lp_example/arxiv/raw/
edge.csv # raw edge list
node-feat.csv # raw node features
node-label.csv # raw node lables
node_year.csv
num-edge-list.csv
num-node-list.csv
$ head -5 arxiv/raw/edge.csv
104447,13091
15858,47283
107156,69161
107156,136440
107156,107366
Assuming marius_preprocess
has been built, we preprocess the OGBN_Arxiv dataset by running the following command (assuming we are in the marius
root directory):
$ marius_preprocess --output_dir datasets/custom_lp_example/
--edges datasets/custom_lp_example/arxiv/raw/edge.csv
--dataset_split 0.8 0.1 0.1 --delim="," --columns 0 1
Preprocess custom dataset
Reading edges
Remapping Edges
Node mapping written to: datasets/custom_lp/nodes/node_mapping.txt
Dataset statistics written to: datasets/custom_lp/dataset.yaml
In the above command, we set dataset_split
to a list of 0.8 0.1 0.1
. Under the hood, this splits edge.csv
into edges/train_edges.bin
, edges/validation_edges.bin
and edges/test_edges.bin
based on the given list of fractions.
Note that edge.csv
contains two columns delimited by comma, so we set --columns 0,1
and --delim=","
.
The --edges
flag specifies the raw edge list file that marius_preprocess
will preprocess (and train later).
The --output_directory
flag specifies where the preprocessed graph will be output and is set by the user. In this example, assume we have not created the datasets/fb15k_237_example repository. marius_preprocess
will create it for us.
For detailed usages of marius_preprocess
, please execute the following command:
$ marius_preprocess -h
Let’s check again what was created inside the datasets/custom_lp_example/
directory:
$ ls -1 datasets/fb15k_237_example/
dataset.yaml # input dataset statistics
nodes/
node_mapping.txt # mapping of raw node ids to integer uuids
edges/
test_edges.bin # preprocessed testing edge list
train_edges.bin # preprocessed training edge list
validation_edges.bin # preprocessed validation edge list
arxiv/ # existing arxiv dir
...
Let’s check what is inside the generated dataset.yaml
file:
$ cat datasets/custom_lp_example/dataset.yaml
dataset_dir: /marius-internal/datasets/custom_lp_example/
num_edges: 932994
num_nodes: 169343
num_relations: 1
num_train: 932994
num_valid: 116624
num_test: 116625
node_feature_dim: -1
rel_feature_dim: -1
num_classes: -1
initialized: false
Note
If the above marius_preprocess
command fails due to any missing directory errors, please create the <output_directory>/edges
and <output_directory>/nodes
directories as a workaround.
2. Define Configuration File
To train a model, we need to define a YAML configuration file based on information created from marius_preprocess.
The configuration file contains information including but not limited to the inputs to the model, training procedure, and hyperparameters to optimize. Given a configuration file, marius assembles a model depending on the given parameters. The configuration file is grouped up into four sections:
Model: Defines the architecture of the model, neighbor sampling configuration, loss, and optimizer(s)
Storage: Specifies the input dataset and how to store the graph, features, and embeddings.
Training: Sets options for the training procedure and hyperparameters. E.g. batch size, negative sampling.
Evaluation: Sets options for the evaluation procedure (if any). The options here are similar to those in the training section.
For the full configuration schema, please refer to docs/config_interface
.
An example YAML configuration file for the OGBN_Arxiv dataset (link prediction model with DistMult) is given in examples/configuration/custom_lp.yaml
. Note that the dataset_dir
is set to the preprocessing output directory, in our example, datasets/custom_lp_example/
.
Let’s create the same YAML configuration file for the OGBN_Arxiv dataset from scratch. We follow the structure of the configuration file and create each of the four sections one by one. In a YAML file, indentation is used to denote nesting and all parameters are in the format of key-value pairs.
Note
String values in the configuration file are case insensitive but we use capital letters for convention.
First, we define the model. We begin by setting all required parameters. This includes
learning_task
,encoder
,decoder
, andloss
. The rest of the configurations can be fine-tuned by the user.model: learning_task: LINK_PREDICTION # set the learning task to link prediction encoder: layers: - - type: EMBEDDING # set the encoder to be an embedding table with 50-dimensional embeddings output_dim: 50 decoder: type: DISTMULT # set the decoder to DistMult options: input_dim: 50 loss: type: SOFTMAX_CE options: reduction: SUM dense_optimizer: # optimizer to use for dense model parameters. In this case these are the DistMult relation (edge-type) embeddings type: ADAM options: learning_rate: 0.1 sparse_optimizer: # optimizer to use for node embedding table type: ADAGRAD options: learning_rate: 0.1 storage: # omit training: # omit evaluation: # omit
Next, we set the storage and dataset. We begin by setting all required parameters. This includes
dataset
. Here, thedataset_dir
is set todatasets/custom_lp_example/
, which is the preprocessing output directory.model: # omit storage: device_type: cuda dataset: dataset_dir: /marius-internal/datasets/custom_lp_example/ edges: type: DEVICE_MEMORY embeddings: type: DEVICE_MEMORY save_model: true training: # omit evaluation: # omit
Lastly, we configure training and evaluation. We begin by setting all required parameters. We begin by setting all required parameters. This includes
num_epochs
andnegative_sampling
. We setnum_epochs=10
(10 epochs to train) to demonstrate this example. Note thatnegative_sampling
is required for link prediction.model: # omit storage: # omit training: batch_size: 1000 negative_sampling: num_chunks: 10 negatives_per_positive: 500 degree_fraction: 0.0 filtered: false num_epochs: 10 pipeline: sync: true epochs_per_shuffle: 1 evaluation: batch_size: 1000 negative_sampling: filtered: true pipeline: sync: true
3. Train Model
After defining our configuration file, training is run with marius_train <your_config.yaml>
.
We can now train our example using the configuration file we just created by running the following command (assuming we are in the marius
root directory):
$ marius_train datasets/custom_lp_example/custom_lp.yaml
[2022-04-04 17:11:53.029] [info] [marius.cpp:45] Start initialization
[04/04/22 17:11:57.581] Initialization Complete: 4.552s
[04/04/22 17:11:57.650] ################ Starting training epoch 1 ################
[04/04/22 17:11:57.824] Edges processed: [94000/932994], 10.08%
[04/04/22 17:11:57.988] Edges processed: [188000/932994], 20.15%
[04/04/22 17:11:58.153] Edges processed: [282000/932994], 30.23%
[04/04/22 17:11:58.317] Edges processed: [376000/932994], 40.30%
[04/04/22 17:11:58.484] Edges processed: [470000/932994], 50.38%
[04/04/22 17:11:58.650] Edges processed: [564000/932994], 60.45%
[04/04/22 17:11:58.817] Edges processed: [658000/932994], 70.53%
[04/04/22 17:11:59.008] Edges processed: [752000/932994], 80.60%
[04/04/22 17:11:59.200] Edges processed: [846000/932994], 90.68%
[04/04/22 17:11:59.408] Edges processed: [932994/932994], 100.00%
[04/04/22 17:11:59.408] ################ Finished training epoch 1 ################
[04/04/22 17:11:59.408] Epoch Runtime: 1758ms
[04/04/22 17:11:59.408] Edges per Second: 530713.3
[04/04/22 17:11:59.408] Evaluating validation set
[04/04/22 17:12:00.444]
=================================
Link Prediction: 116624 edges evaluated
Mean Rank: 10927.984317
MRR: 0.088246
Hits@1: 0.043936
Hits@3: 0.091285
Hits@5: 0.123697
Hits@10: 0.176499
Hits@50: 0.337538
Hits@100: 0.414872
=================================
[04/04/22 17:12:00.444] Evaluating test set
[04/04/22 17:12:01.470]
=================================
Link Prediction: 116625 edges evaluated
Mean Rank: 10928.291687
MRR: 0.088237
Hits@1: 0.043798
Hits@3: 0.091670
Hits@5: 0.123190
Hits@10: 0.176377
Hits@50: 0.337749
Hits@100: 0.414697
=================================
After running this configuration for 10 epochs, we should see a result similar to below:
=================================
[04/04/22 17:12:32.312] ################ Starting training epoch 10 ################
[04/04/22 17:12:32.475] Edges processed: [94000/932994], 10.08%
[04/04/22 17:12:32.638] Edges processed: [188000/932994], 20.15%
[04/04/22 17:12:32.800] Edges processed: [282000/932994], 30.23%
[04/04/22 17:12:32.963] Edges processed: [376000/932994], 40.30%
[04/04/22 17:12:33.126] Edges processed: [470000/932994], 50.38%
[04/04/22 17:12:33.313] Edges processed: [564000/932994], 60.45%
[04/04/22 17:12:33.500] Edges processed: [658000/932994], 70.53%
[04/04/22 17:12:33.666] Edges processed: [752000/932994], 80.60%
[04/04/22 17:12:33.835] Edges processed: [846000/932994], 90.68%
[04/04/22 17:12:33.988] Edges processed: [932994/932994], 100.00%
[04/04/22 17:12:33.988] ################ Finished training epoch 10 ################
[04/04/22 17:12:33.988] Epoch Runtime: 1676ms
[04/04/22 17:12:33.988] Edges per Second: 556679
[04/04/22 17:12:33.988] Evaluating validation set
[04/04/22 17:12:35.010]
=================================
Link Prediction: 116624 edges evaluated
Mean Rank: 5765.685716
MRR: 0.132049
Hits@1: 0.048926
Hits@3: 0.149883
Hits@5: 0.210797
Hits@10: 0.304637
Hits@50: 0.536768
Hits@100: 0.626072
=================================
[04/04/22 17:12:35.011] Evaluating test set
[04/04/22 17:12:36.034]
=================================
Link Prediction: 116625 edges evaluated
Mean Rank: 5797.073741
MRR: 0.132749
Hits@1: 0.049406
Hits@3: 0.151588
Hits@5: 0.211944
Hits@10: 0.304437
Hits@50: 0.536549
Hits@100: 0.626006
=================================
Let’s check again what was added in the datasets/custom_lp_example/
directory. For clarity, we only list the files that were created in training. Notice that several files have been created, including the trained model, the embedding table, a full configuration file, and output logs:
$ ls datasets/custom_lp_example/
model.pt # contains the dense model parameters, embeddings of the edge-types
model_state.pt # optimizer state of the trained model parameters
full_config.yaml # detailed config generated based on user-defined config
metadata.csv # information about metadata
logs/ # logs containing output, error, debug information, and etc.
nodes/
embeddings.bin # trained node embeddings of the graph
embeddings_state.bin # node embedding optimizer state
...
edges/
...
...
Note
model.pt
contains the dense model parameters. For DistMult, this is the embeddings of the edge-types. For GNN encoders, this file will include the GNN parameters.