Model exporting tool (marius_postprocess)

This page describes the marius_postprocess tool, which can convert and export trained models to csv or parquet formats.

Currently this tool is in memory only, thus the embedding table(s) must fit in CPU memory to perform the export.

Example Usage

Train the fb15k_237 example model:

marius_preprocess --dataset fb15k_237 --output_directory datasets/fb15k_237_example/
marius_train examples/configuration/fb15k_237.yaml

The trained model is located at: datasets/fb15k_237_example/model_0

Export the model to CSV format:

marius_postprocess --model_dir datasets/fb15k_237_example/model_0 --format csv --output_dir my_output_dir

The output of the command should look like:

Wrote my_output_dir/embeddings.csv: shape (14541, 2)
Wrote my_output_dir/relation_embeddings.csv: shape (237, 2)
Wrote my_output_dir/inverse_relation_embeddings.csv: shape (237, 2)
Wrote my_output_dir/model.pt

From the above we can see that the node and edge-type (relation) embeddings have been written to CSV files. model.pt is a pytorch model file which may contain additional model parameters (GNN weights).

The output files contain two columns (id, embedding). The number of rows corresponds to the number of nodes or number of edge-types.

Note that the CSV format may not be ideal for exporting the embedding table(s), as the embedding vectors are converted to text representations. The model can be exported in parquet format by using --format parquet, or copied in raw format with --format bin.

Command line arguments

Below is the help message for the tool, containing an overview of the tools arguments and usage.

$ marius_postprocess --help
usage: postprocess [-h] [--model_dir model_dir] [--format format] [--delim delim] [--output_dir output_dir] [--overwrite]

Convert trained embeddings to desired output format and output to specified directory.

Example usage:
marius_postprocess --model_dir foo --format csv --output_dir bar

optional arguments:
  -h, --help            show this help message and exit
  --model_dir model_dir
                        Directory of the trained model
  --format format, -f format
                        Format of output embeddings. Choices are [csv, parquet, binary]
  --delim delim         Delimiter to use for the output CSV
  --output_dir output_dir
                        Output directory, if not provided the model directory will be used.
  --overwrite           If enabled, the output directory will be overwritten