Custom Dataset Node Classification --------------------------------------------- In this tutorial, we use the **Cora dataset** as an example to demonstrate a step-by-step walkthrough from preprocessing the dataset to defining the configuration file and to training **a node classification with 3-layer GraphSage model**. 1. Preprocess Dataset ^^^^^^^^^^^^^^^^^^^^^ Preprocessing a custom dataset is straightforward with the help of Marius python API. Preprocessing using the Marius Python API requires creating a custom Dataset class of type ``NodeClassificationDataset`` or ``LinkPredictionDataset``. An example python script which preprocesses, trains, and evaluates the Cora dataset is provided in ``examples/python/custom_nc_graphsage.py``. For detailed steps, please refer to (link). Let's borrow the provided ``examples/python/custom_nc_graphsage.py`` and modify it to suit our purpose. We first ``download()`` the dataset to ``datasets/custom_nc_example/cora/``, then ``preprocess()``,. Note that the ``MYDATASET`` class is a child class of ``NodeClassificationDataset``: .. code-block:: python import marius as m import torch from omegaconf import OmegaConf import numpy as np import pandas as pd from pathlib import Path from marius.tools.preprocess.dataset import NodeClassificationDataset from marius.tools.preprocess.utils import download_url, extract_file from marius.tools.preprocess.converters.torch_converter import TorchEdgeListConverter from marius.tools.preprocess.converters.spark_converter import SparkEdgeListConverter from marius.tools.configuration.constants import PathConstants from marius.tools.preprocess.datasets.dataset_helpers import remap_nodes def switch_to_num(row): names = ['Neural_Networks', 'Rule_Learning', 'Reinforcement_Learning', 'Probabilistic_Methods',\ 'Theory', 'Genetic_Algorithms', 'Case_Based'] idx = 0 for i in range(len(names)): if (row == names[i]): idx = i break return idx class MYDATASET(NodeClassificationDataset): def __init__(self, output_directory: Path, spark=False): super().__init__(output_directory, spark) self.dataset_name = "cora" self.dataset_url = "http://www.cs.umd.edu/~sen/lbc-proj/data/cora.tgz" def download(self, overwrite=False): # These are the files that I want to make my the end of the the download self.input_edge_list_file = self.output_directory / Path("edge.csv") self.input_node_feature_file = self.output_directory / Path("node-feat.csv") self.input_node_label_file = self.output_directory / Path("node-label.csv") self.input_train_nodes_file = self.output_directory / Path("train.csv") self.input_valid_nodes_file = self.output_directory / Path("valid.csv") self.input_test_nodes_file = self.output_directory / Path("test.csv") download = False if not self.input_edge_list_file.exists(): download = True if not self.input_node_feature_file.exists(): download = True if not self.input_node_label_file.exists(): download = True if not self.input_train_nodes_file.exists(): download = True if not self.input_valid_nodes_file.exists(): download = True if not self.input_test_nodes_file.exists(): download = True if download: archive_path = download_url(self.dataset_url, self.output_directory, overwrite) extract_file(archive_path, remove_input=False) # Reading and processing the csv df = pd.read_csv(dataset_dir / Path("cora/cora.content"), sep="\t", header=None) cols = df.columns[1:len(df.columns)-1] # Getting all the indices indices = np.array(range(len(df))) np.random.shuffle(indices) train_indices = indices[0:int(0.8*len(df))] valid_indices = indices[int(0.8*len(df)):int(0.8*len(df))+int(0.1*len(df))] test_indices = indices[int(0.8*len(df))+int(0.1*len(df)):] np.savetxt(dataset_dir / Path("train.csv"), train_indices, delimiter=",", fmt="%d") np.savetxt(dataset_dir / Path("valid.csv"), valid_indices, delimiter=",", fmt="%d") np.savetxt(dataset_dir / Path("test.csv"), test_indices, delimiter=",", fmt="%d") # Features features = df[cols] features.to_csv(index=False, sep=",", path_or_buf = dataset_dir / Path("node-feat.csv"), header=False) # Labels labels = df[df.columns[len(df.columns)-1]] labels = labels.apply(switch_to_num) labels.to_csv(index=False, sep=",", path_or_buf = dataset_dir / Path("node-label.csv"), header=False) # Edges node_ids = df[df.columns[0]] dict_reverse = node_ids.to_dict() nodes_dict = {v: k for k, v in dict_reverse.items()} df_edges = pd.read_csv(dataset_dir / Path("cora/cora.cites"), sep="\t", header=None) df_edges.replace({0: nodes_dict, 1: nodes_dict},inplace=True) df_edges.to_csv(index=False, sep=",", path_or_buf = dataset_dir / Path("edge.csv"), header=False) def preprocess(self, num_partitions=1, remap_ids=True, splits=None, sequential_train_nodes=False, partitioned_eval=False): train_nodes = np.genfromtxt(self.input_train_nodes_file, delimiter=",").astype(np.int32) valid_nodes = np.genfromtxt(self.input_valid_nodes_file, delimiter=",").astype(np.int32) test_nodes = np.genfromtxt(self.input_test_nodes_file, delimiter=",").astype(np.int32) converter = SparkEdgeListConverter if self.spark else TorchEdgeListConverter converter = converter( output_dir=self.output_directory, train_edges=self.input_edge_list_file, num_partitions=num_partitions, columns=[0, 1], remap_ids=remap_ids, sequential_train_nodes=sequential_train_nodes, delim=",", known_node_ids=[train_nodes, valid_nodes, test_nodes], partitioned_evaluation=partitioned_eval ) dataset_stats = converter.convert() features = np.genfromtxt(self.input_node_feature_file, delimiter=",").astype(np.float32) labels = np.genfromtxt(self.input_node_label_file, delimiter=",").astype(np.int32) if remap_ids: node_mapping = np.genfromtxt(self.output_directory / Path(PathConstants.node_mapping_path), delimiter=",") train_nodes, valid_nodes, test_nodes, features, labels = remap_nodes(node_mapping, train_nodes, valid_nodes, test_nodes, features, labels) with open(self.train_nodes_file, "wb") as f: f.write(bytes(train_nodes)) with open(self.valid_nodes_file, "wb") as f: f.write(bytes(valid_nodes)) with open(self.test_nodes_file, "wb") as f: f.write(bytes(test_nodes)) with open(self.node_features_file, "wb") as f: f.write(bytes(features)) with open(self.node_labels_file, "wb") as f: f.write(bytes(labels)) # update dataset yaml dataset_stats.num_train = train_nodes.shape[0] dataset_stats.num_valid = valid_nodes.shape[0] dataset_stats.num_test = test_nodes.shape[0] dataset_stats.node_feature_dim = features.shape[1] dataset_stats.num_classes = 40 dataset_stats.num_nodes = dataset_stats.num_train + dataset_stats.num_valid + dataset_stats.num_test with open(self.output_directory / Path("dataset.yaml"), "w") as f: yaml_file = OmegaConf.to_yaml(dataset_stats) f.writelines(yaml_file) return if __name__ == '__main__': # initialize and preprocess dataset dataset_dir = Path("datasets/custom_nc_example/cora/") # note that we write to this directory dataset = MYDATASET(dataset_dir) if not (dataset_dir / Path("edges/train_edges.bin")).exists(): dataset.download() dataset.preprocess() We preprocess the Cora dataset by running the ollowing command (assuming we are in the ``marius`` root directory): .. code-block:: bash $ python datasets/custom_nc_example/custom_nc_graphsage.py Downloading cora.tgz to cora/cora.tgz Reading edges Remapping Edges Node mapping written to: cora/nodes/node_mapping.txt Dataset statistics written to: cora/dataset.yaml In this example, assume we have not created the ``datasets/custom_nc_example/cora/`` repository, ``custom_nc_graphsage.py`` will create it for us. For detailed usages of Marius python API, please refer to (link). Let's check what is inside the created directory: .. code-block:: bash $ ls -1 datasets/custom_nc_example/cora/ dataset.yaml # input dataset statistics nodes/ node_mapping.txt # mapping of raw node ids to integer uuids features.bin # preprocessed features list labels.bin # preprocessed labels list test_nodes.bin # preprocessed testing nodes list train_nodes.bin # preprocessed training nodes list validation_nodes.bin # preprocessed validation nodes list edges/ train_edges.bin # mapping of raw edge(relation) ids to integer uuids cora/ # downloaded source files ... edge.csv # raw edge list train.csv # raw training edge list test.csv # raw testing edge list valid.csv # raw validation edge list node-feat.csv # node features node-label.csv # node labels cora.tgz # downloaded Cora dataset Let's check what is inside the generated ``dataset.yaml`` file: .. code-block:: bash $ cat datasets/ogbn_arxiv_example/dataset.yaml dataset_dir: /marius-internal/datasets/custom_nc_example/cora/ num_edges: 5429 num_nodes: 2708 num_relations: 1 num_train: 2166 num_valid: 270 num_test: 272 node_feature_dim: 1433 rel_feature_dim: -1 num_classes: 40 initialized: false 2. Define Configuration File ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ To train a model, we need to define a YAML configuration file based on information created from the preprocessing python script. The configuration file contains information including but not limited to the inputs to the model, training procedure, and hyperparameters to optimize. Given a configuration file, marius assembles a model depending on the given parameters. The configuration file is grouped up into four sections: * Model: Defines the architecture of the model, neighbor sampling configuration, loss, and optimizer(s) * Storage: Specifies the input dataset and how to store the graph, features, and embeddings. * Training: Sets options for the training procedure and hyperparameters. E.g. batch size, negative sampling. * Evaluation: Sets options for the evaluation procedure (if any). The options here are similar to those in the training section. For the full configuration schema, please refer to ``docs/config_interface``. An example YAML configuration file for the Cora dataset is given in ``examples/configuration/custom_nc.yaml``. Note that the ``dataset_dir`` is set to the preprocessing output directory, in our example, ``datasets/custom_nc_example/cora/``. Let's create the same YAML configuration file for the OGBN_Arxiv dataset from scratch. We follow the structure of the configuration file and create each of the four sections one by one. In a YAML file, indentation is used to denote nesting and all parameters are in the format of key-value pairs. #. | First, we define the **model**. We begin by setting all required parameters. This includes ``learning_task``, ``encoder``, ``decoder``, and ``loss``. | Note that the output of the encoder is the output label vector for a given node. (E.g. For node classification with 5 classes, the output label vector from the encoder might look like this: [.05, .2, .8, .01, .03]. In this case, an argmax will return a class label of 2 for the node.) The rest of the configurations can be fine-tuned by the user. .. code-block:: yaml model: learning_task: NODE_CLASSIFICATION # set the learning task to node classification encoder: train_neighbor_sampling: - type: ALL - type: ALL - type: ALL layers: # define three layers of GNN of type GRAPH_SAGE - - type: FEATURE output_dim: 1433 # set to 1433 (to match "node_feature_dim=1433" in "dataset.yaml") for each layer except for the last bias: true - - type: GNN options: type: GRAPH_SAGE aggregator: MEAN input_dim: 1433 # set to 1433 (to match "node_feature_dim=1433" in "dataset.yaml") for each layer except for the last output_dim: 1433 bias: true - - type: GNN options: type: GRAPH_SAGE aggregator: MEAN input_dim: 1433 output_dim: 1433 bias: true - - type: GNN options: type: GRAPH_SAGE aggregator: MEAN input_dim: 1433 output_dim: 40 # set "output_dim" to 40 (to match "num_classes=40") in "dataset.yaml" for the last layer bias: true decoder: type: NODE loss: type: CROSS_ENTROPY options: reduction: SUM dense_optimizer: type: ADAM options: learning_rate: 0.01 storage: # omit training: # omit evaluation: # omit #. | Next, we set the **storage** and **dataset**. We begin by setting all required parameters. This includes ``dataset``. Here, the ``dataset_dir`` is set to ``datasets/custom_nc_example/cora/``, which is the preprocessing output directory. .. code-block:: yaml model: # omit storage: device_type: cuda dataset: dataset_dir: datasets/custom_nc_example/cora/ edges: type: DEVICE_MEMORY options: dtype: int features: type: DEVICE_MEMORY options: dtype: float training: # omit evaluation: # omit #. Lastly, we configure **training** and **evaluation**. We begin by setting all required parameters. This includes ``num_epochs``. We set ``num_epochs=10`` (10 epochs to train) to demonstrate this example. .. code-block:: yaml model: # omit storage: # omit training: batch_size: 1000 num_epochs: 10 pipeline: sync: true evaluation: batch_size: 1000 pipeline: sync: true 3. Train Model ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ After defining our configuration file, training is run with ``marius_train ``. We can now train our example using the configuration file we just created by running the following command (assuming we are in the ``marius`` root directory): .. code-block:: bash $ marius_train datasets/custom_nc_example/cora/custom_nc.yaml [2022-04-05 18:41:44.987] [info] [marius.cpp:45] Start initialization [04/05/22 18:41:49.122] Initialization Complete: 4.134s [04/05/22 18:41:49.135] ################ Starting training epoch 1 ################ [04/05/22 18:41:49.161] Nodes processed: [1000/2166], 46.17% [04/05/22 18:41:49.180] Nodes processed: [2000/2166], 92.34% [04/05/22 18:41:49.199] Nodes processed: [2166/2166], 100.00% [04/05/22 18:41:49.199] ################ Finished training epoch 1 ################ [04/05/22 18:41:49.199] Epoch Runtime: 63ms [04/05/22 18:41:49.199] Nodes per Second: 34380.953 [04/05/22 18:41:49.199] Evaluating validation set [04/05/22 18:41:49.213] ================================= Node Classification: 270 nodes evaluated Accuracy: 12.962963% ================================= [04/05/22 18:41:49.213] Evaluating test set [04/05/22 18:41:49.221] ================================= Node Classification: 272 nodes evaluated Accuracy: 16.176471% ================================= After running this configuration for 10 epochs, we should see a result similar to below with arruracy roughly equal to 86%: .. code-block:: bash ================================= [04/05/22 18:41:49.820] ################ Starting training epoch 10 ################ [04/05/22 18:41:49.833] Nodes processed: [1000/2166], 46.17% [04/05/22 18:41:49.854] Nodes processed: [2000/2166], 92.34% [04/05/22 18:41:49.872] Nodes processed: [2166/2166], 100.00% [04/05/22 18:41:49.872] ################ Finished training epoch 10 ################ [04/05/22 18:41:49.872] Epoch Runtime: 51ms [04/05/22 18:41:49.872] Nodes per Second: 42470.59 [04/05/22 18:41:49.872] Evaluating validation set [04/05/22 18:41:49.883] ================================= Node Classification: 270 nodes evaluated Accuracy: 84.814815% ================================= [04/05/22 18:41:49.883] Evaluating test set [04/05/22 18:41:49.891] ================================= Node Classification: 272 nodes evaluated Accuracy: 88.970588% ================================= Let's check again what was added in the ``datasets/custom_nc_example/cora/`` directory. For clarity, we only list the files that were created in training. Notice that several files have been created, including the trained model, the embedding table, a full configuration file, and output logs: .. code-block:: bash $ ls -1 datasets/ogbn_arxiv_example/ model.pt # contains the dense model parameters, including the GNN parameters model_state.pt # optimizer state of the trained model parameters full_config.yaml # detailed config generated based on user-defined config metadata.csv # information about metadata logs/ # logs containing output, error, debug information, and etc. nodes/ ... edges/ ... ... .. note:: ``model.pt`` contains the dense model parameters. For GNN encoders, this file will include the GNN parameters. 4. Inference ^^^^^^^^^^^^^^^^^^^^^^^^^^^ 4.1 Command Line """""""""""""""" 4.2 Load Into Python """"""""""""""""""""