Skip to content

Commit 606c5b1

Browse files
authored
Merge pull request #2558 from krzeszew/krzeszew/numba-dpex-to-dpnp
Updated numba kNN script
2 parents aa7b6d6 + c975a6a commit 606c5b1

File tree

2 files changed

+47
-129
lines changed

2 files changed

+47
-129
lines changed

AI-and-Analytics/Features-and-Functionality/IntelPython_Numpy_Numba_dpnp_kNN/IntelPython_Numpy_Numba_dpnp_kNN.ipynb

+3-3
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@
1919
"source": [
2020
"# Simple k-NN classification with Data Parallel Extension for NumPy IDP optimization\n",
2121
"\n",
22-
"This sample shows how to receive the same accuracy of the k-NN model classification by using numpy, numba and numba_dpex. The computation are performed using wine dataset.\n",
22+
"This sample shows how to receive the same accuracy of the k-NN model classification by using numpy, numba and dpnp. The computation are performed using wine dataset.\n",
2323
"\n",
2424
"Let's start with general imports used in the whole sample."
2525
]
@@ -73,7 +73,7 @@
7373
"cell_type": "markdown",
7474
"metadata": {},
7575
"source": [
76-
"We are planning to compare the results of the numpy, namba and IDP numba_dpex so we need to make sure that the results are reproducible. We can do this through the use of a random seed function that initializes a random number generator."
76+
"We are planning to compare the results of the numpy, namba and IDP dpnp so we need to make sure that the results are reproducible. We can do this through the use of a random seed function that initializes a random number generator."
7777
]
7878
},
7979
{
@@ -370,7 +370,7 @@
370370
"cell_type": "markdown",
371371
"metadata": {},
372372
"source": [
373-
"Like before, let's measure the accuracy of the prepared implementation. It is measured as the number of well-assigned classes for the test set. The final result is the same for all: NumPy, numba and numba-dpex implementations."
373+
"Like before, let's measure the accuracy of the prepared implementation. It is measured as the number of well-assigned classes for the test set. The final result is the same for all: NumPy, numba and dpnp implementations."
374374
]
375375
},
376376
{

AI-and-Analytics/Features-and-Functionality/IntelPython_Numpy_Numba_dpnp_kNN/IntelPython_Numpy_Numba_dpnp_kNN.py

+44-126
Original file line numberDiff line numberDiff line change
@@ -11,10 +11,10 @@
1111
# =============================================================
1212

1313

14-
# # Simple k-NN classification with numba_dpex IDP optimization
15-
#
16-
# This sample shows how to receive the same accuracy of the k-NN model classification by using numpy, numba and numba_dpex. The computation are performed using wine dataset.
17-
#
14+
# # Simple k-NN classification with Data Parallel Extension for NumPy IDP optimization
15+
#
16+
# This sample shows how to receive the same accuracy of the k-NN model classification by using numpy, numba and dpnp. The computation are performed using wine dataset.
17+
#
1818
# Let's start with general imports used in the whole sample.
1919

2020
# In[ ]:
@@ -27,11 +27,11 @@
2727

2828

2929
# ## Data preparation
30-
#
30+
#
3131
# Then, let's download the dataset and prepare it for future computations.
32-
#
32+
#
3333
# We are using the wine dataset available in the sci-kit learn library. For our purposes, we will be using only 2 features: alcohol and malic_acid.
34-
#
34+
#
3535
# So first we need to load the dataset and create DataFrame from it. Later we will limit the DataFrame to just target and 2 classes we choose for this problem.
3636

3737
# In[ ]:
@@ -51,7 +51,7 @@
5151
df.head()
5252

5353

54-
# We are planning to compare the results of the numpy, namba and IDP numba_dpex so we need to make sure that the results are reproducible. We can do this through the use of a random seed function that initializes a random number generator.
54+
# We are planning to compare the results of the numpy, namba and IDP dpnp so we need to make sure that the results are reproducible. We can do this through the use of a random seed function that initializes a random number generator.
5555

5656
# In[ ]:
5757

@@ -60,7 +60,7 @@
6060

6161

6262
# The next step is to prepare the dataset for training and testing. To do this, we randomly divided the downloaded wine dataset into a training set (containing 90% of the data) and a test set (containing 10% of the data).
63-
#
63+
#
6464
# In addition, we take from both sets (training and test) data *X* (features) and label *y* (target).
6565

6666
# In[ ]:
@@ -78,9 +78,9 @@
7878

7979

8080
# ## NumPy k-NN
81-
#
81+
#
8282
# Now, it's time to implement the first version of k-NN function using NumPy.
83-
#
83+
#
8484
# First, let's create simple euclidean distance function. We are taking positions form the provided vectors, counting the squares of the individual differences between the positions, and then drawing the root of their sum for the whole vectors (remember that the vectors must be of equal length).
8585

8686
# In[ ]:
@@ -93,7 +93,7 @@ def distance(vector1, vector2):
9393

9494

9595
# Then, the k-nearest neighbors algorithm itself.
96-
#
96+
#
9797
# 1. We are starting by defining a container for predictions the same size as a test set.
9898
# 2. Then, for each row in the test set, we calculate distances between then and every training record.
9999
# 3. We are sorting training datasets based on calculated distances
@@ -145,19 +145,18 @@ def knn(X_train, y_train, X_test, k):
145145

146146

147147
# ## Numba k-NN
148-
#
148+
#
149149
# Now, let's move to the numba implementation of the k-NN algorithm. We will start the same, by defining the distance function and importing the necessary packages.
150-
#
150+
#
151151
# For numba implementation, we are using the core functionality which is `numba.jit()` decorator.
152-
#
152+
#
153153
# We are starting with defining the distance function. Like before it is a euclidean distance. For additional optimization we are using `np.linalg.norm`.
154154

155155
# In[ ]:
156156

157157

158158
import numba
159159

160-
161160
@numba.jit(nopython=True)
162161
def euclidean_distance_numba(vector1, vector2):
163162
dist = np.linalg.norm(vector1 - vector2)
@@ -174,6 +173,7 @@ def knn_numba(X_train, y_train, X_test, k):
174173
# 1. Prepare container for predictions
175174
predictions = np.zeros(X_test.shape[0])
176175
for x in np.arange(X_test.shape[0]):
176+
177177
# 2. Calculate distances
178178
inputs = X_train.copy()
179179
distances = np.zeros((inputs.shape[0], 1))
@@ -198,7 +198,7 @@ def knn_numba(X_train, y_train, X_test, k):
198198
counter = {}
199199
for item in neighbor_classes:
200200
if item in counter:
201-
counter[item] = counter.get(item) + 1
201+
counter[item] += 1
202202
else:
203203
counter[item] = 1
204204
counter_sorted = sorted(counter)
@@ -208,8 +208,8 @@ def knn_numba(X_train, y_train, X_test, k):
208208
return predictions
209209

210210

211-
# Similarly, as in the NumPy example, we are testing implemented method for the `k = 3`.
212-
#
211+
# Similarly, as in the NumPy example, we are testing implemented method for the `k = 3`.
212+
#
213213
# The accuracy of the method is the same as in the NumPy implementation.
214214

215215
# In[ ]:
@@ -222,143 +222,61 @@ def knn_numba(X_train, y_train, X_test, k):
222222
print("Numba accuracy:", accuracy)
223223

224224

225-
# ## Numba_dpex k-NN
226-
#
227-
# Numba_dpex implementation use `numba_dpex.kernel()` decorator. For more information about programming, SYCL kernels go to: https://intelpython.github.io/numba-dpex/latest/user_guides/kernel_programming_guide/index.html.
228-
#
229-
# Calculating distance is like in the NumPy example. We are using Euclidean distance. Later, we create the queue of the neighbors by the calculated distance and count in provided *k* votes for dedicated classes of neighbors.
230-
#
231-
# In the end, we are taking a class that achieves the maximum value of votes and setting it for the current global iteration.
225+
# ## Data Parallel Extension for NumPy k-NN
226+
#
227+
# To take benefit of DPNP, we can leverage its vectorized operations and efficient algorithms to implement a k-NN algorithm. We will use optimized operations like `sum`, `sqrt` or `argsort`.
228+
#
229+
# Calculating distance is like in the NumPy example. We are using Euclidean distance. The next step is to find the indexes of k-nearest neighbours for each test poin, and get tehir labels. At the end, we neet to determine the most frequent label among k-nearest.
232230

233231
# In[ ]:
234232

235233

236-
import numba_dpex
237-
238-
239-
@numba_dpex.kernel
240-
def knn_numba_dpex(
241-
item: numba_dpex.kernel_api.Item,
242-
train,
243-
train_labels,
244-
test,
245-
k,
246-
predictions,
247-
votes_to_classes_lst,
248-
):
249-
dtype = train.dtype
250-
i = item.get_id(0)
251-
queue_neighbors = numba_dpex.kernel_api.PrivateArray(shape=(3, 2), dtype=dtype)
234+
import dpnp as dpnp
252235

253-
for j in range(k):
254-
x1 = train[j, 0]
255-
x2 = test[i, 0]
236+
def knn_dpnp(train, train_labels, test, k):
237+
# 1. Calculate pairwise distances between test and train points
238+
distances = dpnp.sqrt(dpnp.sum((test[:, None, :] - train[None, :, :])**2, axis=-1))
256239

257-
distance = dtype.type(0.0)
258-
diff = x1 - x2
259-
distance += diff * diff
260-
dist = math.sqrt(distance)
240+
# 2. Find the indices of the k nearest neighbors for each test point
241+
nearest_neighbors = dpnp.argsort(distances, axis=1)[:, :k]
261242

262-
queue_neighbors[j, 0] = dist
263-
queue_neighbors[j, 1] = train_labels[j]
243+
# 3. Get the labels of the nearest neighbors
244+
nearest_labels = train_labels[nearest_neighbors]
264245

265-
for j in range(k):
266-
new_distance = queue_neighbors[j, 0]
267-
new_neighbor_label = queue_neighbors[j, 1]
268-
index = j
246+
# 4. Determine the most frequent label among the k nearest neighbors
247+
unique_labels, counts = np.unique(nearest_labels, return_counts=True)
248+
predicted_labels = nearest_labels[np.argmax(counts)]
269249

270-
while index > 0 and new_distance < queue_neighbors[index - 1, 0]:
271-
queue_neighbors[index, 0] = queue_neighbors[index - 1, 0]
272-
queue_neighbors[index, 1] = queue_neighbors[index - 1, 1]
273-
274-
index = index - 1
275-
276-
queue_neighbors[index, 0] = new_distance
277-
queue_neighbors[index, 1] = new_neighbor_label
278-
279-
for j in range(k, len(train)):
280-
x1 = train[j, 0]
281-
x2 = test[i, 0]
282-
283-
distance = dtype.type(0.0)
284-
diff = x1 - x2
285-
distance += diff * diff
286-
dist = math.sqrt(distance)
287-
288-
if dist < queue_neighbors[k - 1, 0]:
289-
queue_neighbors[k - 1, 0] = dist
290-
queue_neighbors[k - 1, 1] = train_labels[j]
291-
new_distance = queue_neighbors[k - 1, 0]
292-
new_neighbor_label = queue_neighbors[k - 1, 1]
293-
index = k - 1
294-
295-
while index > 0 and new_distance < queue_neighbors[index - 1, 0]:
296-
queue_neighbors[index, 0] = queue_neighbors[index - 1, 0]
297-
queue_neighbors[index, 1] = queue_neighbors[index - 1, 1]
298-
299-
index = index - 1
300-
301-
queue_neighbors[index, 0] = new_distance
302-
queue_neighbors[index, 1] = new_neighbor_label
303-
304-
votes_to_classes = votes_to_classes_lst[i]
305-
306-
for j in range(len(queue_neighbors)):
307-
votes_to_classes[int(queue_neighbors[j, 1])] += 1
308-
309-
max_ind = 0
310-
max_value = dtype.type(0)
311-
312-
for j in range(3):
313-
if votes_to_classes[j] > max_value:
314-
max_value = votes_to_classes[j]
315-
max_ind = j
316-
317-
predictions[i] = max_ind
250+
return predicted_labels
318251

319252

320253
# Next, like before, let's test the prepared k-NN function.
321-
#
322-
# In this case, we will need to provide the container for predictions: `predictions` and the container for votes per class: `votes_to_classes_lst` (the container size is 3, as we have 3 classes in our dataset).
323-
#
324-
# We are running a prepared k-NN function on a CPU device as the input data was allocated on the CPU. Numba-dpex will infer the execution queue based on where the input arguments to the kernel were allocated. Refer: https://intelpython.github.io/oneAPI-for-SciPy/details/programming_model/#compute-follows-data
325-
# In[ ]:
326-
254+
#
255+
# We are running a prepared k-NN function on a CPU device as the input data was allocated on the CPU using DPNP.
327256

328-
import dpnp
257+
# In[ ]:
329258

330-
predictions = dpnp.empty(len(X_test.values), device="cpu")
331-
# we have 3 classes
332-
votes_to_classes_lst = dpnp.zeros((len(X_test.values), 3), device="cpu")
333259

334260
X_train_dpt = dpnp.asarray(X_train.values, device="cpu")
335261
y_train_dpt = dpnp.asarray(y_train.values, device="cpu")
336262
X_test_dpt = dpnp.asarray(X_test.values, device="cpu")
337263

338-
numba_dpex.call_kernel(
339-
knn_numba_dpex,
340-
numba_dpex.Range(len(X_test.values)),
341-
X_train_dpt,
342-
y_train_dpt,
343-
X_test_dpt,
344-
3,
345-
predictions,
346-
votes_to_classes_lst,
347-
)
264+
pred = knn_dpnp(X_train_dpt, y_train_dpt, X_test_dpt, 3)
348265

349266

350-
# Like before, let's measure the accuracy of the prepared implementation. It is measured as the number of well-assigned classes for the test set. The final result is the same for all: NumPy, numba and numba-dpex implementations.
267+
# Like before, let's measure the accuracy of the prepared implementation. It is measured as the number of well-assigned classes for the test set. The final result is the same for all: NumPy, numba and dpnp implementations.
351268

352269
# In[ ]:
353270

354271

355272
predictions_numba = dpnp.asnumpy(predictions)
356273
true_values = y_test.to_numpy()
357274
accuracy = np.mean(predictions_numba == true_values)
358-
print("Numba_dpex accuracy:", accuracy)
275+
print("Data Parallel Extension for NumPy accuracy:", accuracy)
359276

360277

361278
# In[ ]:
362279

363280

364-
print("[CODE_SAMPLE_COMPLETED_SUCCESSFULLY]")
281+
print("[CODE_SAMPLE_COMPLETED_SUCCESFULLY]")
282+

0 commit comments

Comments
 (0)