Command Line Preprocessing

The preprocessing procedure takes datasets in their raw format and converts them to the input format required by Marius.

Built-in datasets

Preprocessing the FB15K-237 knowledge graph

$ marius_preprocess --dataset fb15k_237 --output_directory datasets/fb15k_237_example/
Downloading FB15K-237.2.zip to datasets/fb15k_237_example/FB15K-237.2.zip
Reading edges
Remapping Edges
Node mapping written to: datasets/fb15k_237_example/nodes/node_mapping.txt
Relation mapping written to: datasets/fb15k_237_example/edges/relation_mapping.txt
Dataset statistics written to: datasets/fb15k_237_example/dataset.yaml

The --dataset flag specifies which of the built-in datasets marius_preprocess will preprocess and download.

The --output_directory flag specifies where the preprocessed graph will be output and is set by the user. In this example, assume we have not created the datasets/fb15k_237_example repository. marius_preprocess will create it for us.

See Usage for detailed options.

Here are the contents of the output directory after preprocessing

$ ls -l datasets/fb15k_237_example/
dataset.yaml                       # input dataset statistics
nodes/
  node_mapping.txt                 # mapping of raw node ids to integer uuids
edges/
  relation_mapping.txt             # mapping of raw edge(relation) ids to integer uuids
  test_edges.bin                   # preprocessed testing edge list
  train_edges.bin                  # preprocessed training edge list
  validation_edges.bin             # preprocessed validation edge list
train.txt                          # raw training edge list
test.txt                           # raw testing edge list
valid.txt                          # raw validation edge list
text_cvsc.txt                      # relation triples as used in Toutanova and Chen CVSM-2015
text_emnlp.txt                     # relation triples as used inToutanova et al. EMNLP-2015
README.txt                         # README of the downloaded FB15K-237 dataset

List of built-in datasets

# node classification
ogbn_arxiv
ogbn_products
ogbn_papers100m
ogb_mag240m

# link prediction
fb15k
fb15k_237
livejournal
twitter
freebase86m
ogbl_wikikg2
ogbl_citation2
ogbl_ppa
ogb_wikikg90mv2

Custom datasets

Datasets in delimited file formats such as CSVs can be preprocessed with marius_preprocess

See this example.

Usage

usage: marius_preprocess [-h] [--output_directory output_directory] [--edges edges [edges ...]] [--dataset dataset] [--num_partitions num_partitions] [--partitioned_eval] [--delim delim]
                  [--dataset_split dataset_split [dataset_split ...]] [--overwrite] [--spark] [--no_remap_ids]

Preprocess built-in datasets and custom link prediction datasets

optional arguments:
  -h, --help            show this help message and exit
  --output_directory output_directory
                        Directory to put graph data
  --edges edges [edges ...]
                        File(s) containing the edge list(s) for a custom dataset
  --dataset dataset     Name of dataset to preprocess
  --num_partitions num_partitions
                        Number of node partitions
  --partitioned_eval    If true, the validation and/or the test set will be partitioned.
  --delim delim, -d delim
                        Delimiter to use for delimited file inputs
  --dataset_split dataset_split [dataset_split ...], -ds dataset_split [dataset_split ...]
                        Split dataset into specified fractions
  --overwrite           If true, the preprocessed dataset will be overwritten if it already exists
  --spark               If true, pyspark will be used to perform the preprocessing
  --no_remap_ids        If true, the node ids of the input dataset will not be remapped to random integer ids.
  --columns [columns [columns ...]]
                        List of column ids of input delimited files which
                        denote the src node, edge-type, and dst node of edges.