Batch Inference (marius_predict)

This document contains an overview of the inference module for link prediction and node classification models trained using the configuration API. The module supports both in memory and disk-based inference.

The input test set can be be preprocessed, or can be in the raw input format and then preprocessed (partitioned, remapped and converted to binary format) before input to evaluation.

Node Classification

Input

A configuration file for a previously trained node classification model

A set of test nodes (preprocessed or unpreprocessed)

A list of metrics to compute (optional)

Output

Text file containing a summary of metrics for the evaluation set: <output_dir>/metrics.txt (optional)

CSV file where each row denotes an node, and it’s corresponding node classification label <output_dir>/labels.csv (optional)

Example Usage

marius_predict --config configs/arxiv.yaml --metrics accuracy --save_labels --output_dir results/

This command takes in a trained configuration file, configs/arxiv.yaml, which defines the previously trained model.

The list of metrics over the training set will be computed and output to results/metrics.txt. The ranks and scores for each node are output to results/labels.csv.

Command line arguments

Below is the help message for the tool, containing an overview of the tools arguments and usage.

$ marius_predict --help
usage: predict [-h] --config config [--output_dir output_dir] [--metrics [metrics ...]] [--save_labels] [--save_scores] [--save_ranks] [--batch_size batch_size] [--num_nbrs num_nbrs]
               [--num_negs num_negs] [--num_chunks num_chunks] [--deg_frac deg_frac] [--filtered filtered] [--input_file input_file] [--input_format input_format] [--preprocess_input preprocess_input]
               [--columns columns] [--header_length header_length] [--delim delim] [--dtype dtype]

Tool for performing link prediction or node classification inference with trained models.

Link prediction example usage:
marius_predict <trained_config> --output_dir results/ --metrics mrr mean_rank hits1 hits10 hits50 --save_scores --save_ranks
Assuming <trained_config> contains a link prediction model, this command will perform link prediction evaluation over the test set of edges provided in the config file. Metrics are saved to results/metrics.csv and scores and ranks for each test edge are saved to results/scores.csv

Node classification example usage:
marius_predict <trained_config> --output_dir results/ --metrics accuracy --save_labels
This command will perform node classification evaluation over the test set of nodes provided in the config file. Metrics are saved to results/metrics.csv and labels for each test node are saved to results/labels.csv

Custom inputs:
The test set can be directly specified setting --input_file <test_set_file>. If the test set has not been preprocessed, then --preprocess_input should be enabled. The default format is a binary file, but additional formats can be specified with --input_format.

optional arguments:
  -h, --help            show this help message and exit
  --config config       Configuration file for trained model
  --output_dir output_dir
                        Path to output directory
  --metrics [metrics ...]
                        List of metrics to report.
  --save_labels         (Node Classification) If true, the node classification labels of each test node will be saved to <output_dir>/labels.csv
  --save_scores         (Link Prediction) If true, the link prediction scores of each test edge will be saved to <output_dir>/scores.csv
  --save_ranks          (Link Prediction) If true, the link prediction ranks of each test edge will be saved to <output_dir>/scores.csv
  --batch_size batch_size
                        Number of examples to evaluate at a time.
  --num_nbrs num_nbrs   Number of neighbors to sample for each GNN layer. If not provided, then the module will check if the output of the encoder has been saved after training (see
                        storage.export_encoded_nodes). If the encoder outputs exist, the the module will skip the encode step (incl. neighbor sampling) and only perform the decode over the saved
                        inputs. If encoder outputs are not saved, model.encoder.eval_neighbor_sampling will be used for the neighbor sampling configuration. If model.encoder.eval_neighbor_sampling does
                        not exist, then model.encoder.train_neighbor_sampling will be used.If none of the above are given, then the model is assumed to not require neighbor sampling.
  --num_negs num_negs   (Link Prediction) Number of negatives to compare per positive edge for link prediction. If -1, then all nodes are used as negatives. Otherwise, num_neg*num_chunks nodes will be
                        sampled and used as negatives. If not provided, the evaluation.negative_sampling configuration will be used.if evaluation.negative_sampling is not provided, then negative
                        sampling will not occur and only the scores for the input edges will be computed, this means that any ranking metrics cannot be calculated.
  --num_chunks num_chunks
                        (Link Prediction) Specifies the amount of reuse of negative samples. A given set of num_neg sampled nodes will be reused to corrupt (batch_size // num_chunks) edges.
  --deg_frac deg_frac   (Link Prediction) Specifies the fraction of the num_neg nodes sampled as negatives that should be sampled according to their degree. This sampling procedure approximates degree
                        based sampling by sampling nodes that appear in the current batch of edges.
  --filtered filtered   (Link Prediction) If true, then false negative samples will be filtered out. This is only supported when evaluating with all nodes.
  --input_file input_file
                        Path to input file containing the test set, if not provided then the test set described in the configuration file will be used.
  --input_format input_format
                        Format of the input file to test. Options are [BINARY, CSV, TSV, DELIMITED] files. If DELIMITED, then --delim must be specified.
  --preprocess_input preprocess_input
                        If true, the input file (if provided) will be preprocessed before evaluation.
  --columns columns     List of column ids of input delimited file which denote the src node, edge-type, and dst node of edges.E.g. columns=[0, 2, 1] means that the source nodes are found in the first
                        column of the file, the edge-types are found in the third column, and the destination nodes are found in the second column.For graphs without edge types, only the location node
                        columns need to be provided. E.g. [0, 1]If the input file contains node ids rather than edges, then only a single id is needed. E.g. [2]
  --header_length header_length
                        Length of the header for input delimited file
  --delim delim         Delimiter for input file
  --dtype dtype         Datatype of input file elements. Defaults to the dataset specified in the configuration file.