Batch Inference (marius_predict)

This document contains an overview of the inference module for link prediction and node classification models trained using the configuration API. The module supports both in memory and disk-based inference.

The input test set can be be preprocessed, or can be in the raw input format and then preprocessed (partitioned, remapped and converted to binary format) before input to evaluation.

Link Prediction

Input

A configuration file for a previously trained link prediction model
A set of test edges (preprocessed or unpreprocessed)
A list of metrics to compute (optional)
Negative sampling configuration (optional)

Output

Text file containing a summary of metrics for the evaluation set: <output_dir>/metrics.txt (optional)

CSV file where each row denotes an edge, and it’s corresponding score and link prediction rank <output_dir>/scores.csv (optional)

Example Usage

marius_predict --config configs/fb15k237.yaml --metrics mrr mr hits3 hits5 hits10 hits50 hits100 hits2129 --save_ranks --save_scores --output_dir results/`

This command takes in a trained configuration file, configs/fb15k237.yaml, which defines a model that has been previous trained.

The list of metrics over the training set will be computed and output to results/metrics.txt. The ranks and scores for each edge are output to results/scores.csv.

Contents of configs/fb15k237.yaml. The test set here has been created during preprocessing and is stored in <storage.dataset.dataset_dir>/edges/test_edges.bin

model:
  learning_task: LINK_PREDICTION
  encoder:
    layers:

      - - type: EMBEDDING
          output_dim: 10
          bias: true
          init:
            type: GLOROT_NORMAL

  decoder:
    type: DISTMULT
  loss:
    type: SOFTMAX_CE
    options:
      reduction: SUM
  dense_optimizer:
    type: ADAM
    options:
      learning_rate: 0.01
  sparse_optimizer:
    type: ADAGRAD
    options:
      learning_rate: 0.1

storage:
  device_type: cpu
  dataset:
    dataset_dir: ./fb15k_237_example/
  edges:
    type: HOST_MEMORY
    options:
      dtype: int
  embeddings:
    type: HOST_MEMORY
    options:
      dtype: float
training:
  batch_size: 1000
  negative_sampling:
    num_chunks: 10
    negatives_per_positive: 10
    degree_fraction: 0
    filtered: false
  num_epochs: 10
  pipeline:
    sync: true
evaluation:
  batch_size: 1000
  negative_sampling:
    filtered: true
  pipeline:
    sync: true

Since storage.model_dir is not specified in the above configuration, marius_predict will use the latest trained model present in storage.dataset.dataset_dir. When storage.model_dir is not specified, marius_train stores the model parameters in model_x directory within the storage.dataset.dataset_dir, where x changes incrementally from 0 - 10. A maximum of 11 models are stored when model_dir is not specified, post which the contents in model_10/ directory are overwritten with the latest parameters. marius_predict will use the latest model for inference and save the files to that directory. If storage.model_dir is specified, the model parameters will be loaded from the given directory and the generated files will be saved to the same.

Example output

Two files are output by the above command:

metrics.txt

Link Prediction: 40932 edges evaluated
MRR: 0.125147
Mean Rank: 426.079766
Hits@3: 0.156259
Hits@5: 0.207148
Hits@10: 0.285229
Hits@50: 0.510383
Hits@100: 0.598725
Hits@2129: 0.947987

scores.csv

src,rel,dst,rank,score
14469,149,11486,26,32.206722
8558,74,7904,2789,5.628761
3160,73,8048,282,7.548909
7240,168,4510,149,1.634745
2393,211,10586,2,96.834641
12773,198,5262,3136,9.098152
11469,88,8946,18,15.922592
2045,166,3344,289,0.407495

Input a new test set

If the dataset does not have a predefined test set. (e.g. storage.dataset.num_test == 0). Then users can specify a separate test set with the --input_file <path_to_test_set>. This test set can either be preprocessed and in binary format, or unpreprocessed.

Preprocessed input_test set usage:

marius_predict --config configs/fb15k237.yaml --input_file test_edges.bin --metrics mrr --save_ranks --save_scores --output_dir results/

Unpreprocessed input_test set usage:

If the input test set is unpreprocessed and in some raw input format. Then the --preprocess_input flag can be given. Users will need to specify the format of their input with --input_format <format>. Currently delimited formats are only supported.

marius_predict --config configs/fb15k237.yaml --input_file test_edges.csv --preprocess_input --input_format CSV --metrics mrr --save_ranks --save_scores --output_dir results/

Node Classification

Input

A configuration file for a previously trained node classification model

A set of test nodes (preprocessed or unpreprocessed)

A list of metrics to compute (optional)

Output

Text file containing a summary of metrics for the evaluation set: <output_dir>/metrics.txt (optional)

CSV file where each row denotes an node, and it’s corresponding node classification label <output_dir>/labels.csv (optional)

Example Usage

marius_predict --config configs/arxiv.yaml --metrics accuracy --save_labels --output_dir results/

This command takes in a trained configuration file, configs/arxiv.yaml, which defines the previously trained model.

The list of metrics over the training set will be computed and output to results/metrics.txt. The ranks and scores for each node are output to results/labels.csv.

Command line arguments

Below is the help message for the tool, containing an overview of the tools arguments and usage.

$ marius_predict --help
usage: predict [-h] --config config [--output_dir output_dir] [--metrics [metrics ...]] [--save_labels] [--save_scores] [--save_ranks] [--batch_size batch_size] [--num_nbrs num_nbrs]
               [--num_negs num_negs] [--num_chunks num_chunks] [--deg_frac deg_frac] [--filtered filtered] [--input_file input_file] [--input_format input_format] [--preprocess_input preprocess_input]
               [--columns columns] [--header_length header_length] [--delim delim] [--dtype dtype]

Tool for performing link prediction or node classification inference with trained models.

Link prediction example usage:
marius_predict <trained_config> --output_dir results/ --metrics mrr mean_rank hits1 hits10 hits50 --save_scores --save_ranks
Assuming <trained_config> contains a link prediction model, this command will perform link prediction evaluation over the test set of edges provided in the config file. Metrics are saved to results/metrics.csv and scores and ranks for each test edge are saved to results/scores.csv

Node classification example usage:
marius_predict <trained_config> --output_dir results/ --metrics accuracy --save_labels
This command will perform node classification evaluation over the test set of nodes provided in the config file. Metrics are saved to results/metrics.csv and labels for each test node are saved to results/labels.csv

Custom inputs:
The test set can be directly specified setting --input_file <test_set_file>. If the test set has not been preprocessed, then --preprocess_input should be enabled. The default format is a binary file, but additional formats can be specified with --input_format.

optional arguments:
  -h, --help            show this help message and exit
  --config config       Configuration file for trained model
  --output_dir output_dir
                        Path to output directory
  --metrics [metrics ...]
                        List of metrics to report.
  --save_labels         (Node Classification) If true, the node classification labels of each test node will be saved to <output_dir>/labels.csv
  --save_scores         (Link Prediction) If true, the link prediction scores of each test edge will be saved to <output_dir>/scores.csv
  --save_ranks          (Link Prediction) If true, the link prediction ranks of each test edge will be saved to <output_dir>/scores.csv
  --batch_size batch_size
                        Number of examples to evaluate at a time.
  --num_nbrs num_nbrs   Number of neighbors to sample for each GNN layer. If not provided, then the module will check if the output of the encoder has been saved after training (see
                        storage.export_encoded_nodes). If the encoder outputs exist, the the module will skip the encode step (incl. neighbor sampling) and only perform the decode over the saved
                        inputs. If encoder outputs are not saved, model.encoder.eval_neighbor_sampling will be used for the neighbor sampling configuration. If model.encoder.eval_neighbor_sampling does
                        not exist, then model.encoder.train_neighbor_sampling will be used.If none of the above are given, then the model is assumed to not require neighbor sampling.
  --num_negs num_negs   (Link Prediction) Number of negatives to compare per positive edge for link prediction. If -1, then all nodes are used as negatives. Otherwise, num_neg*num_chunks nodes will be
                        sampled and used as negatives. If not provided, the evaluation.negative_sampling configuration will be used.if evaluation.negative_sampling is not provided, then negative
                        sampling will not occur and only the scores for the input edges will be computed, this means that any ranking metrics cannot be calculated.
  --num_chunks num_chunks
                        (Link Prediction) Specifies the amount of reuse of negative samples. A given set of num_neg sampled nodes will be reused to corrupt (batch_size // num_chunks) edges.
  --deg_frac deg_frac   (Link Prediction) Specifies the fraction of the num_neg nodes sampled as negatives that should be sampled according to their degree. This sampling procedure approximates degree
                        based sampling by sampling nodes that appear in the current batch of edges.
  --filtered filtered   (Link Prediction) If true, then false negative samples will be filtered out. This is only supported when evaluating with all nodes.
  --input_file input_file
                        Path to input file containing the test set, if not provided then the test set described in the configuration file will be used.
  --input_format input_format
                        Format of the input file to test. Options are [BINARY, CSV, TSV, DELIMITED] files. If DELIMITED, then --delim must be specified.
  --preprocess_input preprocess_input
                        If true, the input file (if provided) will be preprocessed before evaluation.
  --columns columns     List of column ids of input delimited file which denote the src node, edge-type, and dst node of edges.E.g. columns=[0, 2, 1] means that the source nodes are found in the first
                        column of the file, the edge-types are found in the third column, and the destination nodes are found in the second column.For graphs without edge types, only the location node
                        columns need to be provided. E.g. [0, 1]If the input file contains node ids rather than edges, then only a single id is needed. E.g. [2]
  --header_length header_length
                        Length of the header for input delimited file
  --delim delim         Delimiter for input file
  --dtype dtype         Datatype of input file elements. Defaults to the dataset specified in the configuration file.