SKIL Documentation

Skymind Intelligence Layer

The community edition of the Skymind Intelligence Layer (SKIL) is free. It takes data science projects from prototype to production quickly and easily. SKIL bridges the gap between the Python ecosystem and the JVM with a cross-team platform for Data Scientists, Data Engineers, and DevOps/IT. It is an automation tool for machine-learning workflows that enables easy training on Spark-GPU clusters, experiment tracking, one-click deployment of trained models, model performance monitoring and more.

Get Started

Multi-Server Requirements

SKIL works in both single-node and multi-node configurations. In a multi-node setup you can leverage different machines to scale out model serving or training to fit the needs of your business.

SKIL Training and Inference Clusters

SKIL clusters for training allow different groups in your organization to share compute resources for deep learning as well as provide a consistent framework accessing the trained models for applications.

The training process for deep learning models can be compute and sometimes memory heavy so higher performance systems are recommended. These nodes usually contain large amounts of RAM and GPUs and/or a high CPU core count. SKIL can be configured to work inside of a Hadoop cluster and take advantage of Spark and HDFS for training or for high-performance batch inference.

  • 64-128GB of RAM (dedicated for training)
  • 500GB-1TB disk space (SSD recommended)
  • 1-8 x NVIDIA Tesla P100 or V100 GPUs
  • 10Gbps ethernet or Fibre Channel network connections.

SKIL Inference Clusters

Inference clusters are optimized for serving your models for scoring with a simple REST API. They can also running transformations and performing KNN lookups depending on your application's needs. You can configure SKIL to run in an inference only mode and scale it up to meet your performance targets.

The specs required for an inference cluster depends on the complexity of the models being served. Larger models may require GPUs to assure adequate response time for model scoring. For less complex models, a large number of smaller CPU-only machines can be sufficient. Determining the correct node size for your cluster is not covered in detail here, but here are a couple of typical configurations:

CPU-only cluster nodes:

  • Quad-Core Processors
  • 16-128GB of RAM
  • Minimum 1Gbps network connection
  • 100GB-1TB disk space

Typical GPU cluster:

  • Quad-core Processors
  • 64-128GB of RAM
  • Minimum 1Gbps network connection
  • 500GB-1TB disk space
  • 1-4 NVIDIA Tesla P100/V100 GPU

Multi-Server Requirements