AMX bfloat16 mixed precision learning TensorFlow Transformer sample (#1317)

krzeszew · web-flow · commit 48264e5a1ab3 · 2023-01-26T07:49:32.000-08:00
diff --git a/AI-and-Analytics/Features-and-Functionality/IntelTensorFlow_Transformer_AMX_bfloat16_MixedPrecision/IntelTensorFlow_Transformer_AMX_bfloat16_MixedPrecision.ipynb b/AI-and-Analytics/Features-and-Functionality/IntelTensorFlow_Transformer_AMX_bfloat16_MixedPrecision/IntelTensorFlow_Transformer_AMX_bfloat16_MixedPrecision.ipynb
diff --git a/AI-and-Analytics/Features-and-Functionality/IntelTensorFlow_Transformer_AMX_bfloat16_MixedPrecision/License.txt b/AI-and-Analytics/Features-and-Functionality/IntelTensorFlow_Transformer_AMX_bfloat16_MixedPrecision/License.txt
@@ -0,0 +1,7 @@
+Copyright Intel Corporation
+
+Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
diff --git a/AI-and-Analytics/Features-and-Functionality/IntelTensorFlow_Transformer_AMX_bfloat16_MixedPrecision/README.md b/AI-and-Analytics/Features-and-Functionality/IntelTensorFlow_Transformer_AMX_bfloat16_MixedPrecision/README.md
@@ -0,0 +1,141 @@
+# `TensorFlow (TF) Transformer with Intel® Advanced Matrix Extensions (Intel® AMX) bfoat16 Mixed Precision Learning` 
+
+This sample code demonstrates optimizing a TensorFlow model with Intel® Advanced Matrix Extensions (Intel® AMX) using bfloat16 (Brain Floating Point) on  4th Gen Intel® Xeon® Scalable Processors (Sapphire Rapids).
+
+| Area                  | Description
+|:---                   |:--
+ What you will learn    | How to use AMX bfloat16 mixed precision learning on a TensorFlow model
+| Time to complete      | 15 minutes
+
+> **Note**: The sample is based on the [*Text classification with Transformer*](https://keras.io/examples/nlp/text_classification_with_transformer/) Keras sample.
+
+
+## Purpose
+
+In this sample, you will run a transformer classification model with bfloat16 mixed precision learning on Intel® AMX ISA and compare the performance against AVX512. You should notice that using Intel® AMX results in performance increases when compared to AVX512 while retaining the expected precision.
+
+## Prerequisites
+
+This sample code work on **Sapphire Rapids** only.
+
+| Optimized for             | Description
+|:---                       |:---
+| OS                        | Ubuntu* 20.04
+| Hardware                  | Sapphire Rapids
+| Software                  | Intel® AI Analytics Toolkit (AI Kit)
+
+The sample assumes Intel® Optimization for TensorFlow is installed. (See the [Intel® Optimization for TensorFlow* Installation Guide](https://www.intel.com/content/www/us/en/developer/articles/guide/optimization-for-TensorFlow-installation-guide.html) for more information.)
+
+### For Local Development Environments
+
+You will need to download and install the following toolkits, tools, and components to use the sample.
+
+- **Intel® AI Analytics Toolkit (AI Kit)**
+
+  You can get the AI Kit from [Intel® oneAPI Toolkits](https://www.intel.com/content/www/us/en/developer/tools/oneapi/toolkits.html#analytics-kit). <br> See [*Get Started with the Intel® AI Analytics Toolkit for Linux**](https://www.intel.com/content/www/us/en/develop/documentation/get-started-with-ai-linux) for AI Kit installation information and post-installation steps and scripts.
+
+- **Jupyter Notebook**
+
+  Install using PIP: `$pip install notebook`. <br> Alternatively, see [*Installing Jupyter*](https://jupyter.org/install) for detailed installation instructions.
+
+
+- **Intel® oneAPI Data Analytics Library**
+
+  You might need some parts of the [Intel® oneAPI Data Analytics Library](https://software.intel.com/content/www/us/en/develop/tools/oneapi/components/onedal.html).
+
+
+### For Intel® DevCloud
+
+The necessary tools and components are already installed in the environment. You do not need to install additional components. See [Intel® DevCloud for oneAPI](https://devcloud.intel.com/oneapi/get_started/) for information.
+
+
+## Key Implementation Details
+
+The sample code is written in Python and targets Sapphire Rapids only.
+
+
+## Run the Sample
+
+### On Linux*
+
+> **Note**: If you have not already done so, set up your CLI
+> environment by sourcing  the `setvars` script in the root of your oneAPI installation.
+>
+> Linux*:
+> - For system wide installations: `. /opt/intel/oneapi/setvars.sh`
+> - For private installations: ` . ~/intel/oneapi/setvars.sh`
+> - For non-POSIX shells, like csh, use the following command: `bash -c 'source <install-dir>/setvars.sh ; exec csh'`
+>
+> For more information on configuring environment variables, see [Use the setvars Script with Linux* or macOS*](https://www.intel.com/content/www/us/en/develop/documentation/oneapi-programming-guide/top/oneapi-development-environment-setup/use-the-setvars-script-with-linux-or-macos.html).
+
+#### Activate Conda
+
+1. Activate the Conda environment.
+
+    ```
+    conda activate tensorflow
+    ```
+
+   By default, the AI Kit is installed in the `/opt/intel/oneapi` folder and requires root privileges to manage it.
+
+   You can choose to activate Conda environment without root access. To bypass root access to manage your Conda environment, clone and activate your desired Conda environment using the following commands similar to the following.
+
+   ```
+   conda create --name usr_tensorflow --clone tensorflow
+   conda activate usr_tensorflow
+   ```
+
+#### Run the NoteBook
+
+1. Launch Jupyter Notebook.
+   ```
+   jupyter notebook --ip=0.0.0.0
+   ```
+2. Follow the instructions to open the URL with the token in your browser.
+3. Locate and select the Notebook.
+   ```
+   IntelTensorFlow_Transformer_AMX_bfloat16_MixedPrecision.ipynb
+   ```
+4. Run every cell in the Notebook in sequence.
+
+
+#### Troubleshooting
+
+If you receive an error message, troubleshoot the problem using the **Diagnostics Utility for Intel® oneAPI Toolkits**. The diagnostic utility provides configuration and system checks to help find missing dependencies, permissions errors, and other issues. See the [Diagnostics Utility for Intel® oneAPI Toolkits User Guide](https://www.intel.com/content/www/us/en/develop/documentation/diagnostic-utility-user-guide/top.html) for more information on using the utility.
+
+
+### Run the Sample on Intel® DevCloud
+
+1. If you do not already have an account, request an Intel® DevCloud account at [*Create an Intel® DevCloud Account*](https://intelsoftwaresites.secure.force.com/DevCloud/oneapi).
+2. On a Linux* system, open a terminal.
+3. SSH into Intel® DevCloud.
+   ```
+   ssh DevCloud
+   ```
+   > **Note**: You can find information about configuring your Linux system and connecting to Intel DevCloud at Intel® DevCloud for oneAPI [Get Started](https://devcloud.intel.com/oneapi/get_started).
+
+4. Locate and select the Notebook.
+   ```
+   IntelTensorFlow_Transformer_AMX_bfloat16_MixedPrecision.ipynb
+   ```
+5. Run every cell in the Notebook in sequence.
+
+
+## Example Output
+
+You should see diagrams demonstrating performance analysis formatted, as pie charts, for JIT Kernel Type Time breakdown for both AVX512 and AMX.
+
+The following image shows a typical example of JIT Kernel Time breakdown file analysis diagrams.
+
+![jit pie chart](images/jit_breakdown_pie.png)
+
+## Further Reading
+
+Explore [Get Started with the Intel® AI Analytics Toolkit for Linux*](https://www.intel.com/content/www/us/en/develop/documentation/get-started-with-ai-linux/top.html) to find out how you can achieve performance gains for popular deep-learning and machine-learning frameworks through Intel optimizations.
+
+## License
+
+Code samples are licensed under the MIT license. See [License.txt](https://github.com/oneapi-src/oneAPI-samples/blob/master/License.txt)
+for details.
+
+Third party program Licenses can be found here: [third-party-programs.txt](https://github.com/oneapi-src/oneAPI-samples/blob/master/third-party-programs.txt).
diff --git a/AI-and-Analytics/Features-and-Functionality/IntelTensorFlow_Transformer_AMX_bfloat16_MixedPrecision/images/jit_breakdown_pie.png b/AI-and-Analytics/Features-and-Functionality/IntelTensorFlow_Transformer_AMX_bfloat16_MixedPrecision/images/jit_breakdown_pie.png
diff --git a/AI-and-Analytics/Features-and-Functionality/IntelTensorFlow_Transformer_AMX_bfloat16_MixedPrecision/job.sh b/AI-and-Analytics/Features-and-Functionality/IntelTensorFlow_Transformer_AMX_bfloat16_MixedPrecision/job.sh
@@ -0,0 +1,14 @@
+#!/bin/bash
+
+mkdir logs
+
+wget https://raw.githubusercontent.com/IntelAI/models/master/benchmarks/common/platform_util.py
+
+echo "########## Executing the run"
+
+source /opt/intel/oneapi/setvars.sh
+source activate tensorflow
+
+ONEDNN_VERBOSE_TIMESTAMP=1 ONEDNN_VERBOSE=1 python ./text_classification_with_transformer.py > ./logs/dnn_logs.txt
+
+echo "########## Done with the run"
diff --git a/AI-and-Analytics/Features-and-Functionality/IntelTensorFlow_Transformer_AMX_bfloat16_MixedPrecision/job_mixed.sh b/AI-and-Analytics/Features-and-Functionality/IntelTensorFlow_Transformer_AMX_bfloat16_MixedPrecision/job_mixed.sh
@@ -0,0 +1,10 @@
+#!/bin/bash
+
+echo "########## Executing the run"
+
+source /opt/intel/oneapi/setvars.sh
+source activate tensorflow
+
+ONEDNN_VERBOSE_TIMESTAMP=1 ONEDNN_VERBOSE=1 python ./text_classification_with_transformer.py > ./logs/dnn_logs_mixed.txt
+
+echo "########## Done with the run"
diff --git a/AI-and-Analytics/Features-and-Functionality/IntelTensorFlow_Transformer_AMX_bfloat16_MixedPrecision/patch/mixed_precision.patch b/AI-and-Analytics/Features-and-Functionality/IntelTensorFlow_Transformer_AMX_bfloat16_MixedPrecision/patch/mixed_precision.patch
@@ -0,0 +1,19 @@
+--- text_classification_with_transformer.py	2022-09-20 02:24:42.814605146 -0700
++++ text_classification_with_transformer2.py	2022-09-20 02:24:48.489188611 -0700
+@@ -27,6 +27,16 @@
+ 
+ 
+ """
++## Bfloat16 mixed precision learning 
++"""
++
++from tensorflow.keras import mixed_precision
++
++policy = mixed_precision.Policy('mixed_bfloat16')
++mixed_precision.set_global_policy(policy)
++
++
++"""
+ ## Implement a Transformer block as a layer
+ """
+ 
diff --git a/AI-and-Analytics/Features-and-Functionality/IntelTensorFlow_Transformer_AMX_bfloat16_MixedPrecision/patch/time.patch b/AI-and-Analytics/Features-and-Functionality/IntelTensorFlow_Transformer_AMX_bfloat16_MixedPrecision/patch/time.patch
@@ -0,0 +1,53 @@
+--- text_classification_with_transformer.py	2022-10-17 04:04:37.455448493 -0700
++++ text_classification_with_transformer2.py	2022-10-17 04:07:15.196716415 -0700
+@@ -9,10 +9,38 @@
+ ## Setup
+ """
+ 
++from time import time
++import os
++
+ import tensorflow as tf
+ from tensorflow import keras
+ from tensorflow.keras import layers
+ 
++from platform_util import PlatformUtil
++cpu_info = PlatformUtil("")
++
++numa_nodes = cpu_info.numa_nodes 
++print("CPU count per socket:" , cpu_info.cores_per_socket ," \nSocket count:", cpu_info.sockets, " \nNuma nodes:",numa_nodes) 
++
++if numa_nodes > 0: 
++    socket_number = 1 
++    cpu_count = cpu_info.cores_per_socket 
++    inter_thread = 1 
++else: 
++    # on non-numa machine, we should use all the cores and don't use numactl 
++    socket_number = -1 
++    cpu_count = cpu_info.cores_per_socket * cpu_info.sockets 
++    inter_thread = cpu_info.sockets
++        
++# Intel OpenMP threads and other fine-tuning parameters
++os.environ['OMP_NUM_THREADS'] = "cpu_count "
++os.environ['KMP_BLOCKTIME'] = "inter_thread"
++os.environ['KMP_AFFINITY'] = "granularity=fine,verbose,compact,1,0"
++
++# # Eigen threads
++tf.config.threading.set_intra_op_parallelism_threads(cpu_count)
++tf.config.threading.set_inter_op_parallelism_threads(inter_thread)
++
+ 
+ """
+ ## Implement a Transformer block as a layer
+@@ -110,6 +138,11 @@
+ model.compile(
+     optimizer="adam", loss="sparse_categorical_crossentropy", metrics=["accuracy"]
+ )
++
++start = time()
+ history = model.fit(
+     x_train, y_train, batch_size=32, epochs=2, validation_data=(x_val, y_val)
+ )
++end = time()
++
++print("time: ", end-start)
diff --git a/AI-and-Analytics/Features-and-Functionality/IntelTensorFlow_Transformer_AMX_bfloat16_MixedPrecision/run.sh b/AI-and-Analytics/Features-and-Functionality/IntelTensorFlow_Transformer_AMX_bfloat16_MixedPrecision/run.sh
@@ -0,0 +1,14 @@
+#!/bin/bash
+
+echo "########## Executing the run"
+
+source activate tensorflow
+
+# enable verbose log
+export DNNL_VERBOSE=2 
+# enable JIT Dump
+export DNNL_JIT_DUMP=1
+
+DNNL_MAX_CPU_ISA=AVX512_CORE_BF16 python ./text_classification_with_transformer.py cpu >> ./logs/log_cpu_bf16_avx512_bf16.csv 2>&1
+
+echo "########## Done with the run"
diff --git a/AI-and-Analytics/Features-and-Functionality/IntelTensorFlow_Transformer_AMX_bfloat16_MixedPrecision/run_amx.sh b/AI-and-Analytics/Features-and-Functionality/IntelTensorFlow_Transformer_AMX_bfloat16_MixedPrecision/run_amx.sh
@@ -0,0 +1,14 @@
+#!/bin/bash
+
+echo "########## Executing the run"
+
+source activate tensorflow
+
+# enable verbose log
+export DNNL_VERBOSE=2 
+# enable JIT Dump
+export DNNL_JIT_DUMP=1
+
+DNNL_MAX_CPU_ISA=AVX512_CORE_AMX python ./text_classification_with_transformer.py cpu >> ./logs/log_cpu_bf16_avx512_amx.csv 2>&1
+
+echo "########## Done with the run"
diff --git a/AI-and-Analytics/Features-and-Functionality/IntelTensorFlow_Transformer_AMX_bfloat16_MixedPrecision/sample.json b/AI-and-Analytics/Features-and-Functionality/IntelTensorFlow_Transformer_AMX_bfloat16_MixedPrecision/sample.json
@@ -0,0 +1,24 @@
+{
+  "guid": "60A68888-6099-414E-999B-EDC7310A01EA",
+  "name": "TensorFlow (TF) Transformer with Intel® Advanced Matrix Extensions (Intel® AMX) bfoat16 Mixed Precision Learning",
+  "categories": ["Toolkit/oneAPI AI And Analytics/AI Getting Started Samples"],
+  "description": "This sample code demonstrates optimizing a TensorFlow model with Intel® Advanced Matrix Extensions (Intel® AMX) using bfloat16 (Brain Floating Point) on Sapphire Rapids",
+  "builder": ["cli"],
+  "languages": [{"python":{}}],
+  "os":["linux"],
+  "targetDevice": ["CPU"],
+  "ciTests": {
+  	"linux": [
+    {
+  		"env": [],
+  		"id": "Transformer_AMX_bfloat16_Mixed_Precision_Learning",
+  		"steps": [
+			"conda activate tensorflow",
+			"conda install -y jupyter",
+			"jupyter nbconvert --execute IntelTensorFlow_Transformer_AMX_bfloat16_MixedPrecision.ipynb"
+  		]
+  	}
+    ]
+},
+"expertise": "Getting Started"
+}