Command Line Preprocessing
The preprocessing procedure takes datasets in their raw format and converts them to the input format required by Marius.
Built-in datasets
Preprocessing the FB15K-237 knowledge graph
$ marius_preprocess --dataset fb15k_237 --output_directory datasets/fb15k_237_example/
Downloading FB15K-237.2.zip to datasets/fb15k_237_example/FB15K-237.2.zip
Reading edges
Remapping Edges
Node mapping written to: datasets/fb15k_237_example/nodes/node_mapping.txt
Relation mapping written to: datasets/fb15k_237_example/edges/relation_mapping.txt
Dataset statistics written to: datasets/fb15k_237_example/dataset.yaml
The --dataset
flag specifies which of the built-in datasets marius_preprocess
will preprocess and download.
The --output_directory
flag specifies where the preprocessed graph will be output and is set by the user. In this example, assume we have not created the datasets/fb15k_237_example repository. marius_preprocess
will create it for us.
See Usage for detailed options.
Here are the contents of the output directory after preprocessing
$ ls -l datasets/fb15k_237_example/
dataset.yaml # input dataset statistics
nodes/
node_mapping.txt # mapping of raw node ids to integer uuids
edges/
relation_mapping.txt # mapping of raw edge(relation) ids to integer uuids
test_edges.bin # preprocessed testing edge list
train_edges.bin # preprocessed training edge list
validation_edges.bin # preprocessed validation edge list
train.txt # raw training edge list
test.txt # raw testing edge list
valid.txt # raw validation edge list
text_cvsc.txt # relation triples as used in Toutanova and Chen CVSM-2015
text_emnlp.txt # relation triples as used inToutanova et al. EMNLP-2015
README.txt # README of the downloaded FB15K-237 dataset
List of built-in datasets
# node classification
ogbn_arxiv
ogbn_products
ogbn_papers100m
ogb_mag240m
# link prediction
fb15k
fb15k_237
livejournal
twitter
freebase86m
ogbl_wikikg2
ogbl_citation2
ogbl_ppa
ogb_wikikg90mv2
Custom datasets
Datasets in delimited file formats such as CSVs can be preprocessed with marius_preprocess
See this example.
Usage
usage: marius_preprocess [-h] [--output_directory output_directory] [--edges edges [edges ...]] [--dataset dataset] [--num_partitions num_partitions] [--partitioned_eval] [--delim delim]
[--dataset_split dataset_split [dataset_split ...]] [--overwrite] [--spark] [--no_remap_ids]
Preprocess built-in datasets and custom link prediction datasets
optional arguments:
-h, --help show this help message and exit
--output_directory output_directory
Directory to put graph data
--edges edges [edges ...]
File(s) containing the edge list(s) for a custom dataset
--dataset dataset Name of dataset to preprocess
--num_partitions num_partitions
Number of node partitions
--partitioned_eval If true, the validation and/or the test set will be partitioned.
--delim delim, -d delim
Delimiter to use for delimited file inputs
--dataset_split dataset_split [dataset_split ...], -ds dataset_split [dataset_split ...]
Split dataset into specified fractions
--overwrite If true, the preprocessed dataset will be overwritten if it already exists
--spark If true, pyspark will be used to perform the preprocessing
--no_remap_ids If true, the node ids of the input dataset will not be remapped to random integer ids.
--columns [columns [columns ...]]
List of column ids of input delimited files which
denote the src node, edge-type, and dst node of edges.