Introduction
Marius is a system for scaling graph learning on a single machine. Marius supports training and evaluation of GNNs and graph embedding models for link prediction or node classification. See our papers Marius and Marius++ for technical details.
Feature Overview
Billion scale link prediction and node classification training and evaluation
High performance configuration-file based execution
PyTorch compatible Python API for custom training and evaluation routines
Define 3-layer GraphSage model in Python
nbr_sampler = m.nn.LayeredNeighborSampler([-1, -1, -1])
feat_dim = 128
num_classes = 40
device = torch.device("cuda")
feat_layer = m.nn.layers.FeatureLayer(dimension=feature_dim,
device=device)
gs_layer1 = m.nn.layers.GraphSageLayer(input_dim=feature_dim,
output_dim=feature_dim,
device=device)
gs_layer2 = m.nn.layers.GraphSageLayer(input_dim=feature_dim,
output_dim=feature_dim,
device=device)
gs_layer3 = m.nn.layers.GraphSageLayer(input_dim=feature_dim,
output_dim=num_classes,
device=device)
encoder = m.encoders.GeneralEncoder(layers=[[feature_layer],
[graph_sage_layer1],
[graph_sage_layer2],
[graph_sage_layer3]])
decoder = m.nn.decoders.node.NoOpNodeDecoder()
loss = m.nn.CrossEntropyLoss(reduction="sum")
model = m.nn.Model(encoder, decoder, loss)
model.optimizers = [m.nn.AdamOptimizer(model.named_parameters(),
lr=.01)]
or with YAML configuration
model:
learning_task: node_classification
encoder:
train_neighbor_sampling:
- type: all
- type: all
- type: all
layers:
- - type: feature
output_dim: 128
- - type: gnn
options:
type: graph_sage
input_dim: 128
output_dim: 128
- - type: GNN
options:
type: graph_sage
input_dim: 128
output_dim: 128
- - type: gnn
options:
type: graph_sage
input_dim: 128
output_dim: 40
decoder:
type: node
loss:
type: cross_entropy
options:
reduction: sum
dense_optimizer:
type: adam
options:
learning_rate: 0.01
Preprocessing
Performant dataset preprocessing of raw datasets in CSV format
13 built-in datasets for link prediction or node classification
Custom dataset support
Training & Evaluation
CPU-GPU pipeline to mitigate data movement overheads
Optimized neighborhood sampling and datastructures for GNN aggregation
Scale beyond CPU memory with a partition buffer
Supported Input Graphs
Formats: CSV/TSVs, PyTorch tensors, Numpy arrays
Graphs with or without edge-types or node features
Scales to graphs with billions of edges and 100s of millions of nodes
Supported Models
Tasks: Link prediction, node classification
GNN layers: GraphSage, GCN, RGCN, GAT
Link prediction decoders: ComplEx, DistMult, TransE
Upcoming Features
Configuration file optimizer and generator (in testing)
SQL database to graph conversion tool (in testing)
Multi-GPU training (in progress)
Model checkpointing (in progress)
KNN inference module
marius_preprocess parquet file support
Remote storage for graph data and embeddings
Additional encoder layers and decoder layers