Deep Learning Workflow

From HPCWIKI
Jump to navigation Jump to search

Deep Learning (DL) workflow

Both DL training and inference are computation-intensive in their own ways. On the training side, feeding a DNN large amounts of data is intensive for GPU computing, and it may require more or higher efficiency units. And minimizing latency issues during the inference process can pose a challenge for getting the system to make decisions in real time.


Training and inference are usually completed on two separate systems, training of deep neural networks is usually done on GPUs and that inference is usually done on CPUs. However, in some specific cases like play video games, training and inference are done on the same system. so that would lead to more efficiency because it would allow the model to continuously learn..[1]


The main workflow for many data scientists today is

  1. Create and establish all hyper-parameters for a model such as a deep neural network
  2. Train the deep neural network using a GPU
  3. Save the weights that training on the GPU established so that the model can be deployed.
  4. Code the model in a production application with the optimal weights found in training.
  • Neural Network: Artificial neural networks are computing systems inspired by the organic neural networks found in human and other animal brains, where nodes (artificial neurons) are connected (artificial synapses) to work together.

Key elements to look for in DL training infrastructure

Training phase is learning a new capability from existing data to build data specific neural network. During the training phase of deep learning, a large amount of data is input into the GPU for model training. The GPU accelerates the training process through its parallel computing capabilities. Training data is typically stored in local storage devices such as hard drives or solid-state drives and interacts with the GPU through the host system.

  • The more nodes and the more mathematical accuracy you can build into your cluster, the faster and more accurate your training will be done quickly.
  • Training often requires incremental addition of new data sets that remain clean and well-structured. Huge training datasets require massive networking and storage capabilities to hold and transfer the data, especially if your data is image-based
  • Cluster scalability is the greatest features since doubling the amount of training data means expanding exponentially

Key elements to look for in DL inference infrastructure

In inference phase, we are applying trained neural network to new data usually via an application or service. During the inference phase of deep learning, a trained model is used to make predictions or classify new data. The GPU performs inference tasks with its high parallel computing capabilities, quickly processing input data and generating results.


Inferencing, in most applications, looks for quick answers that can be arrived at in milliseconds. meaning the inference process typically requires low latency and high throughput, especially for real-time applications and large-scale inference tasks and requires much less processing power than training.[2]

  • High I/O bandwidth and enough memory to hold both the required training model(s) and the input data. So the storage and memory as close to the processor as possible to reduce latency in I/O and low-latency network

Software and Tools Requirement Differences

ML training and inferencing is related to the software environments.

In model development training and testing there are many approaches being used today. These include popular libraries such as CUDA for NVIDIA GPUs, ML frameworks such as TensorFlow and PyTorch, optimized cross platform model libraries such as Keras and many more. however, when it comes to inferencing applications, there is a much different and smaller set of software tools that are required. Inferencing tool sets are focused on running the model on a target platform. Technology such as Open Neural Network Exchange (ONNX) - an open standard and is managed as a Linux Foundation project - allows for a decoupling of training and inferencing systems and provides the freedom for developers to choose the best platforms for training and inferencing.

References