Command Line Preprocessing ================================ The preprocessing procedure takes datasets in their raw format and converts them to the input format required by Marius. Built-in datasets ----------------------- Preprocessing the FB15K-237 knowledge graph .. code-block:: bash $ marius_preprocess --dataset fb15k_237 --output_directory datasets/fb15k_237_example/ Downloading FB15K-237.2.zip to datasets/fb15k_237_example/FB15K-237.2.zip Reading edges Remapping Edges Node mapping written to: datasets/fb15k_237_example/nodes/node_mapping.txt Relation mapping written to: datasets/fb15k_237_example/edges/relation_mapping.txt Dataset statistics written to: datasets/fb15k_237_example/dataset.yaml The ``--dataset`` flag specifies which of the built-in datasets ``marius_preprocess`` will preprocess and download. The ``--output_directory`` flag specifies where the preprocessed graph will be output and is set by the user. In this example, assume we have not created the datasets/fb15k_237_example repository. ``marius_preprocess`` will create it for us. See `Usage`_ for detailed options. Here are the contents of the output directory after preprocessing .. code-block:: bash $ ls -l datasets/fb15k_237_example/ dataset.yaml # input dataset statistics nodes/ node_mapping.txt # mapping of raw node ids to integer uuids edges/ relation_mapping.txt # mapping of raw edge(relation) ids to integer uuids test_edges.bin # preprocessed testing edge list train_edges.bin # preprocessed training edge list validation_edges.bin # preprocessed validation edge list train.txt # raw training edge list test.txt # raw testing edge list valid.txt # raw validation edge list text_cvsc.txt # relation triples as used in Toutanova and Chen CVSM-2015 text_emnlp.txt # relation triples as used inToutanova et al. EMNLP-2015 README.txt # README of the downloaded FB15K-237 dataset List of built-in datasets .. code-block:: text # node classification ogbn_arxiv ogbn_products ogbn_papers100m ogb_mag240m # link prediction fb15k fb15k_237 livejournal twitter freebase86m ogbl_wikikg2 ogbl_citation2 ogbl_ppa ogb_wikikg90mv2 Custom datasets ----------------------- .. _custom_dataset_example: http://marius-project.org/marius/examples/config/lp_custom.html#preprocess-dataset Datasets in delimited file formats such as CSVs can be preprocessed with ``marius_preprocess`` See this `example `_. Usage ----------------------- .. code-block:: text usage: marius_preprocess [-h] [--output_directory output_directory] [--edges edges [edges ...]] [--dataset dataset] [--num_partitions num_partitions] [--partitioned_eval] [--delim delim] [--dataset_split dataset_split [dataset_split ...]] [--overwrite] [--spark] [--no_remap_ids] Preprocess built-in datasets and custom link prediction datasets optional arguments: -h, --help show this help message and exit --output_directory output_directory Directory to put graph data --edges edges [edges ...] File(s) containing the edge list(s) for a custom dataset --dataset dataset Name of dataset to preprocess --num_partitions num_partitions Number of node partitions --partitioned_eval If true, the validation and/or the test set will be partitioned. --delim delim, -d delim Delimiter to use for delimited file inputs --dataset_split dataset_split [dataset_split ...], -ds dataset_split [dataset_split ...] Split dataset into specified fractions --overwrite If true, the preprocessed dataset will be overwritten if it already exists --spark If true, pyspark will be used to perform the preprocessing --no_remap_ids If true, the node ids of the input dataset will not be remapped to random integer ids. --columns [columns [columns ...]] List of column ids of input delimited files which denote the src node, edge-type, and dst node of edges.