SKIL Documentation

Skymind Intelligence Layer

The community edition of the Skymind Intelligence Layer (SKIL) is free. It takes data science projects from prototype to production quickly and easily. SKIL bridges the gap between the Python ecosystem and the JVM with a cross-team platform for Data Scientists, Data Engineers, and DevOps/IT. It is an automation tool for machine-learning workflows that enables easy training on Spark-GPU clusters, experiment tracking, one-click deployment of trained models, model performance monitoring and more.

Get Started

Transforming Data

In practice, data rarely exists in a convenient form for neural networks to use. It's a mixture of strings, categories, numbers, images in different formats and most of the time it's not normalized. Quite often a data scientist must spend more than 50% of their time cleaning and transforming data before it's ready to input in a deep neural network.

What is a Transform?

Data transformation is a broad term used to describe the conversion of data that is appropriate for an input function. In a deep learning context, this often means taking a raw file such as a CSV, dropping columns or computing new ones, grouping rows by a common key (if creating sequences), normalizing the data (usually statistically), and finally vectorizing the inputs to NDArrays.

SKIL has support for a specific implementation of transforms implemented in DataVec. These transforms are also serializable via a JSON format.

Types of Transforms

In SKIL, we can build and deploy two types of transforms.

CSV Transforms

CSV transforms work with textual CSV data. They require a Schema to be built first that specifies the metadata related to every column and it's type in the data source. After the Schema is made, you can define different transform processes that you want to do with the data. You can remove columns, change their data type, categorize them and normalize the data.

Image Transforms

Image transforms work on images. They are used for scaling, filtering and cropping images through image transform processes. They mainly take images as input, run a list of transforms on them and return a transformed image.

Transformation Workflows

You can use any Zeppelin notebook in a SKIL workspace experiment for building your own transforms.

CSV Transforms

We can build a CSV transform in the following way:

Creating a Schema

If the example data looks like this: 5.1,3.5,1.4,0.2,Iris-setosa, where the comma-separated sequence is sepal length, sepal width, petal length, petal width and label's category, then the Schema will look like this:

import org.datavec.api.transform.schema.Schema

val schema = new Schema.Builder()
            .addColumnsDouble("Sepal length", "Sepal width", "Petal length", "Petal width") // Defining columns of type "Double"
            .addColumnCategorical("Species", "Iris-setosa", "Iris-versicolor", "Iris-virginica") // Defining categorical data - Column name followed by the category elements
            .build();

Defining a TransformProcess

If we want to transform the category labels into their integer symbols, we can do that in the following way:

import org.datavec.api.transform.TransformProcess

val tp = new TransformProcess.Builder(schema) // Starting from the initial created Schema
            .categoricalToInteger("Species") // Converting the categories to their respective integer values
            .build();

If you want to know about more details, have a look at the datavec javadocs for TransformProcess.

Executing Transforms

You need to define a spark job to execute a transform process. It's done in the following way:

/* Downloading the data */
import org.apache.commons.io.FileUtils

import java.io.File
import java.net.URL

val filename = "/tmp/iris.data"

val url = new URL("https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data")

val irisText = new File(filename)
if (!irisText.exists()){
    FileUtils.copyURLToFile(url, irisText)
}

/* ---------------------------------------------------------------------- */

/* Executing the CSV transform */
import org.apache.spark.api.java.JavaRDD
import org.apache.spark.api.java.JavaSparkContext
import org.datavec.api.writable.Writable
import org.datavec.api.records.reader.impl.csv.CSVRecordReader
import org.datavec.spark.transform.misc.StringToWritablesFunction

val jsc = JavaSparkContext.fromSparkContext(sc) // Creating a java spark context from zeppelin's spark context (sc)
val stringData = jsc.textFile(filename) // Reading the data as JavaRDD[String]

val rr = new CSVRecordReader()
val parsedInputData = stringData.filter((line: String) => !(line.isEmpty())).toJavaRDD().map(new StringToWritablesFunction(rr)); // Converting from Strings to to List of Writables

val processedData = SparkTransformExecutor.execute(parsedInputData, tp) // Executing the transform process

/* ---------------------------------------------------------------------- */

/* Viewing data */
import org.datavec.spark.transform.misc.WritablesToStringFunction
import scala.collection.JavaConversions._ // For implicit conversions from java lists to scala lists

val processedAsString = processedData.map(new WritablesToStringFunction(",")) // Converting from JavaRDD to String
val inputDataParsed = processedAsString.collect()	// Executing and collecting the processed input data

inputDataParsed.foreach { println } // Printing data

Executing Transforms are not necessary

Here, we executed the transforms just to show the basic transformation workflow. You don't have to execute the transform process for saving and deploying it. Minimally, you can just define the TransformProcess or ImageTransformProcess and save it. We just need the transform process JSON to deploy it.

Saving Transforms

Our transform is ready to be saved now. You can use the TransformProcess#toJson function to save your implemented transform.

import java.nio.file.{Paths, Files}
import java.nio.charset.StandardCharsets

val transformProcessJson = tp.toJson()

Files.write(Paths.get("/tmp/transformProcess.json"), transformProcessJson.getBytes(StandardCharsets.UTF_8)) // Saving the transform process

Image Transforms

Image Transform Processes are defined by a list of ImageTransform as:

Defining an ImageTransformProcess

Image transform process can be built as:

import org.datavec.image.transform.ImageTransformProcess

val itp = new ImageTransformProcess.Builder()
    .seed(12345) // Random seed
    .cropImageTransform(10, 10, 100, 100) // Top, Left, Bottom, Right
    .build 

More details about ImageTransform implementations can be found here.

Saving Image Transforms

Image transform processes can also be saved in the same way as a transform process.

import java.nio.file.{Paths, Files}
import java.nio.charset.StandardCharsets

val imageTransformProcessJson = itp.toJson()

Files.write(Paths.get("/tmp/imageTransformProcess.json"), imageTransformProcessJson.getBytes(StandardCharsets.UTF_8)) // Saving the image transform process