refs:

Using vectorization, ex. for efficient kNN calculation:¶

kNN

ufuncs in NumPy (vectorized expressions instead of loops)¶

In [281]:

import numpy as np
np.random.seed(0)

def compute_reciprocals(values):
    output = np.empty(len(values))
    for i in range(len(values)):
        output[i] = 1.0 / values[i]
    return output

values = np.random.randint(1, 100, size=1000000)
%timeit compute_reciprocals(values)

3.51 s ± 124 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [2]:

%timeit 1.0 / values

8.4 ms ± 988 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

so, ufuncs much faster, since there is no type-checking. In a compiled language, that would've been taken care of already.

works with multi-dimensional arrays as well

In [3]:

x = np.arange(9).reshape((3,3))
x

Out[3]:

array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])

In [4]:

x ** 2

Out[4]:

array([[ 0,  1,  4],
       [ 9, 16, 25],
       [36, 49, 64]], dtype=int32)

The following table lists the arithmetic operators implemented in NumPy:

Operator	Equivalent ufunc	Description
`+`	`np.add`	Addition (e.g., `1 + 1 = 2`)
`-`	`np.subtract`	Subtraction (e.g., `3 - 2 = 1`)
`-`	`np.negative`	Unary negation (e.g., `-2`)
`*`	`np.multiply`	Multiplication (e.g., `2 * 3 = 6`)
`/`	`np.divide`	Division (e.g., `3 / 2 = 1.5`)
`//`	`np.floor_divide`	Floor division (e.g., `3 // 2 = 1`)
`**`	`np.power`	Exponentiation (e.g., `2 ** 3 = 8`)
`%`	`np.mod`	Modulus/remainder (e.g., `9 % 4 = 1`)

Additionally there are Boolean/bitwise operators; we will explore these in Comparisons, Masks, and Boolean Logic.

more: https://colab.research.google.com/github/devinrourke/PythonDataScienceHandbook/blob/8a34a4f653bdbdc01415a94dc20d4e9b97438965/notebooks/02.03-Computation-on-arrays-ufuncs.ipynb#scrollTo=y4-6zm1JFBRP

Aggregations¶

Function Name	NaN-safe Version	Description
`np.sum`	`np.nansum`	Compute sum of elements
`np.prod`	`np.nanprod`	Compute product of elements
`np.mean`	`np.nanmean`	Compute mean of elements
`np.std`	`np.nanstd`	Compute standard deviation
`np.var`	`np.nanvar`	Compute variance
`np.min`	`np.nanmin`	Find minimum value
`np.max`	`np.nanmax`	Find maximum value
`np.argmin`	`np.nanargmin`	Find index of minimum value
`np.argmax`	`np.nanargmax`	Find index of maximum value
`np.median`	`np.nanmedian`	Compute median of elements
`np.percentile`	`np.nanpercentile`	Compute rank-based statistics of elements
`np.any`	N/A	Evaluate whether any elements are true
`np.all`	N/A	Evaluate whether all elements are true

Computation on Arrays: Broadcasting¶

In [5]:

M = np.ones((3,3))

In [6]:

a = [1,2,3]

In [7]:

M + a

Out[7]:

array([[2., 3., 4.],
       [2., 3., 4.],
       [2., 3., 4.]])

In [8]:

a = np.arange(3)
b = np.arange(3).reshape((3,1))

print(a)
print(b)

[0 1 2]
[[0]
 [1]
 [2]]

In [9]:

a + b

Out[9]:

array([[0, 1, 2],
       [1, 2, 3],
       [2, 3, 4]])

Rule 1: If the two arrays differ in their number of dimensions, the shape of the one with fewer dimensions is padded with ones on its leading (left) side.
Rule 2: If the shape of the two arrays does not match in any dimension, the array with shape equal to 1 in that dimension is stretched to match the other shape.
Rule 3: If in any dimension the sizes disagree and neither is equal to 1, an error is raised.

In [10]:

X = np.random.random((10,3))

In [11]:

Xmean = X.mean(0)
Xmean

Out[11]:

array([0.46110098, 0.49893467, 0.36013246])

In [12]:

X_centered = X - Xmean
X_centered

Out[12]:

array([[ 0.14736296,  0.21900803,  0.07023161],
       [-0.20869088,  0.4767899 , -0.24036926],
       [-0.2525321 , -0.25910342,  0.03020778],
       [ 0.15208766,  0.36143593, -0.0136786 ],
       [-0.16716128, -0.10618668,  0.33431049],
       [ 0.25124886,  0.14288258, -0.21493999],
       [ 0.45686835, -0.47092551,  0.17158316],
       [-0.11609482,  0.30802404, -0.28857307],
       [-0.25315294, -0.25342349,  0.41424483],
       [-0.0099358 , -0.41850138, -0.26301694]])

In [13]:

X_centered.mean(0)

Out[13]:

array([ 4.44089210e-17, -6.66133815e-17, -3.33066907e-17])

In [14]:

x = np.linspace(0, 5, 50)
y = np.linspace(0, 5, 50).reshape(50, 1)

In [15]:

# use broadcasting to compute z across the grid:
z = np.sin(x) ** 10 + np.cos(10 + y * x) * np.cos(x)

In [16]:

import matplotlib.pyplot as plt

In [17]:

plt.imshow(z, origin='lower', extent=[0, 5, 0, 5], cmap='viridis')
plt.colorbar();

Comparisons, Masks, Boolean Logic¶

Extract, modify, or otherwise manipulate values in an array based on some criterion. For example, count all values greater than a certain value, or remove all outliers above a threhold. In NumPy, Boolean masking is often most efficient.

In [18]:

x = np.array([1, 2, 3, 4, 5])

The result of ufunc comparison operators is always an array with a Boolean data type. <, >, <=, >=, !=, == are all available

In [19]:

x > 3

Out[19]:

array([False, False, False,  True,  True])

In [20]:

(2 * x) == (x ** 2)

Out[20]:

array([False,  True, False, False, False])

In [21]:

rng = np.random.RandomState(0)
x = rng.randint(10, size=(3, 4))
x

Out[21]:

array([[5, 0, 3, 3],
       [7, 9, 3, 5],
       [2, 4, 7, 6]])

In [22]:

x < 6

Out[22]:

array([[ True,  True,  True,  True],
       [False, False,  True,  True],
       [ True,  True, False, False]])

In [23]:

np.sum(x < 6)

Out[23]:

^ False == 0 and True == 1 is used here

In [24]:

# how many values less than 6 in each row?
np.sum(x < 6, axis=1)

Out[24]:

array([4, 2, 2])

Can also use np.any(), np.all(), np.sum(), but these are different from built-in Python any(), all(), sum(), so be careful.

In [25]:

import pandas as pd

In [26]:

rainfall = pd.read_csv('Seattle2014.csv')['PRCP'].values
inches = rainfall / 254.0  # 1/10mm -> inches
inches.shape

Out[26]:

(365,)

In [27]:

np.sum((inches > 0.5) & (inches < 1))

Out[27]:

Operator	Equivalent ufunc		Operator	Equivalent ufunc
`&`	`np.bitwise_and`		\|	`np.bitwise_or`
`^`	`np.bitwise_xor`		`~`	`np.bitwise_not`

In [28]:

print("Number days without rain:      ", np.sum(inches == 0))
print("Number days with rain:         ", np.sum(inches != 0))
print("Days with more than 0.5 inches:", np.sum(inches > 0.5))
print("Rainy days with < 0.2 inches  :", np.sum((inches > 0) &
                                                (inches < 0.2)))

Number days without rain:       215
Number days with rain:          150
Days with more than 0.5 inches: 37
Rainy days with < 0.2 inches  : 75

In [29]:

Out[29]:

array([[5, 0, 3, 3],
       [7, 9, 3, 5],
       [2, 4, 7, 6]])

In [30]:

x < 5

Out[30]:

array([[False,  True,  True,  True],
       [False, False,  True, False],
       [ True,  True, False, False]])

In [31]:

# masking operation to select from array
x[x < 5]

Out[31]:

array([0, 3, 3, 3, 2, 4])

In [32]:

# construct a mask of all rainy days
rainy = (inches > 0)

# construct a mask of all summer days (June 21st is the 172nd day)
days = np.arange(365)
summer = (days > 172) & (days < 262)

print("Median precip on rainy days in 2014 (inches):   ",
      np.median(inches[rainy]))
print("Median precip on summer days in 2014 (inches):  ",
      np.median(inches[summer]))
print("Maximum precip on summer days in 2014 (inches): ",
      np.max(inches[summer]))
print("Median precip on non-summer rainy days (inches):",
      np.median(inches[rainy & ~summer]))

Median precip on rainy days in 2014 (inches):    0.19488188976377951
Median precip on summer days in 2014 (inches):   0.0
Maximum precip on summer days in 2014 (inches):  0.8503937007874016
Median precip on non-summer rainy days (inches): 0.20078740157480315

One common point of confusion is the difference between the keywords and and or on one hand, and the operators & and | on the other hand. When would you use one versus the other?

The difference is this: and and or gauge the truth or falsehood of entire object, while & and | refer to bits within each object.

When you use and or or, it's equivalent to asking Python to treat the object as a single Boolean entity. In Python, all nonzero integers will evaluate as True.

In [33]:

bool(42), bool(0)

Out[33]:

(True, False)

In [34]:

bool(42 and 0)

Out[34]:

False

In [35]:

bool(42 or 0)

Out[35]:

True

Notice that the corresponding bits of the binary representation are compared in order to yield the result. When you use & and | on integers, the expression operates on the bits of the element, applying the and or the or to the individual bits making up the number:

In [36]:

# bin() is binary representation of the number
bin(42)

Out[36]:

'0b101010'

In [37]:

bin(59)

Out[37]:

'0b111011'

In [38]:

bin(42 & 59)

Out[38]:

'0b101010'

In [39]:

A = np.array([1, 0, 1, 0, 1, 0], dtype=bool)
B = np.array([1, 1, 1, 0, 1, 1], dtype=bool)
A | B

Out[39]:

array([ True,  True,  True, False,  True,  True])

In [40]:

A or B

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-40-ea2c97d9d9ee> in <module>()
----> 1 A or B

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

In [41]:

x = np.arange(10)
(x > 4) & (x < 8)

Out[41]:

array([False, False, False, False, False,  True,  True,  True, False,
       False])

In [42]:

(x > 4) and (x < 8)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-42-eecf1fdd5fb4> in <module>()
----> 1 (x > 4) and (x < 8)

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

So remember this: and and or perform a single Boolean evaluation on an entire object, while & and | perform multiple Boolean evaluations on the content (the individual bits or bytes) of an object.

For Boolean NumPy arrays, the latter is nearly always the desired operation.

Fancy Indexing¶

how to access and modify portions of arrays using:

simple indices arr[0]
slices arr[:5]
Boolean masks arr[arr > 0]

Fancy indexing is just passing an array of indices to access multiple array elements at once:

In [ ]:

rand = np.random.RandomState(42)

x = rand.randint(100, size=10)
print(x)

In [ ]:

[x[3], x[7], x[2]]

In [ ]:

ind = [3, 7, 2]
x[ind]

In [ ]:

ind = np.array([[3, 7],
                [2, 6]])
x[ind]

It is always important to remember with fancy indexing that the return value reflects the broadcasted shape of the indices, rather than the shape of the array being indexed.

In [52]:

x = np.zeros(10)
x[[0, 0]] = [4, 6]
x

Out[52]:

array([6., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

In [53]:

i = [3, 3, 3]
x[i] += 1   # this *assignment* happens 3 times, but not the augmentation
x

Out[53]:

array([6., 0., 0., 1., 0., 0., 0., 0., 0., 0.])

to get the augmentation to happen multiple times, use at():

In [55]:

np.add.at(x, i, 1)

In [56]:

Out[56]:

array([6., 0., 0., 4., 0., 0., 0., 0., 0., 0.])

np.add.reduceat() is similarly useful..

Efficient manual histogram with `searchsorted()` and `at()`¶

In [62]:

np.random.seed(42)
x = np.random.randn(100)
x

Out[62]:

array([ 0.49671415, -0.1382643 ,  0.64768854,  1.52302986, -0.23415337,
       -0.23413696,  1.57921282,  0.76743473, -0.46947439,  0.54256004,
       -0.46341769, -0.46572975,  0.24196227, -1.91328024, -1.72491783,
       -0.56228753, -1.01283112,  0.31424733, -0.90802408, -1.4123037 ,
        1.46564877, -0.2257763 ,  0.0675282 , -1.42474819, -0.54438272,
        0.11092259, -1.15099358,  0.37569802, -0.60063869, -0.29169375,
       -0.60170661,  1.85227818, -0.01349722, -1.05771093,  0.82254491,
       -1.22084365,  0.2088636 , -1.95967012, -1.32818605,  0.19686124,
        0.73846658,  0.17136828, -0.11564828, -0.3011037 , -1.47852199,
       -0.71984421, -0.46063877,  1.05712223,  0.34361829, -1.76304016,
        0.32408397, -0.38508228, -0.676922  ,  0.61167629,  1.03099952,
        0.93128012, -0.83921752, -0.30921238,  0.33126343,  0.97554513,
       -0.47917424, -0.18565898, -1.10633497, -1.19620662,  0.81252582,
        1.35624003, -0.07201012,  1.0035329 ,  0.36163603, -0.64511975,
        0.36139561,  1.53803657, -0.03582604,  1.56464366, -2.6197451 ,
        0.8219025 ,  0.08704707, -0.29900735,  0.09176078, -1.98756891,
       -0.21967189,  0.35711257,  1.47789404, -0.51827022, -0.8084936 ,
       -0.50175704,  0.91540212,  0.32875111, -0.5297602 ,  0.51326743,
        0.09707755,  0.96864499, -0.70205309, -0.32766215, -0.39210815,
       -1.46351495,  0.29612028,  0.26105527,  0.00511346, -0.23458713])

In [58]:

bins = np.linspace(-5, 5, 20)
bins

Out[58]:

array([-5.        , -4.47368421, -3.94736842, -3.42105263, -2.89473684,
       -2.36842105, -1.84210526, -1.31578947, -0.78947368, -0.26315789,
        0.26315789,  0.78947368,  1.31578947,  1.84210526,  2.36842105,
        2.89473684,  3.42105263,  3.94736842,  4.47368421,  5.        ])

In [67]:

counts = np.zeros_like(bins)
counts

Out[67]:

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0.])

In [65]:

i = np.searchsorted(bins, x)
i

Out[65]:

array([11, 10, 11, 13, 10, 10, 13, 11,  9, 11,  9,  9, 10,  6,  7,  9,  8,
       11,  8,  7, 13, 10, 10,  7,  9, 10,  8, 11,  9,  9,  9, 14, 10,  8,
       12,  8, 10,  6,  7, 10, 11, 10, 10,  9,  7,  9,  9, 12, 11,  7, 11,
        9,  9, 11, 12, 12,  8,  9, 11, 12,  9, 10,  8,  8, 12, 13, 10, 12,
       11,  9, 11, 13, 10, 13,  5, 12, 10,  9, 10,  6, 10, 11, 13,  9,  8,
        9, 12, 11,  9, 11, 10, 12,  9,  9,  9,  7, 11, 10, 10, 10],
      dtype=int64)

In [68]:

np.add.at(counts, i, 1)
counts

Out[68]:

array([ 0.,  0.,  0.,  0.,  0.,  1.,  3.,  7.,  9., 23., 22., 17., 10.,
        7.,  1.,  0.,  0.,  0.,  0.,  0.])

In [69]:

bins

Out[69]:

array([-5.        , -4.47368421, -3.94736842, -3.42105263, -2.89473684,
       -2.36842105, -1.84210526, -1.31578947, -0.78947368, -0.26315789,
        0.26315789,  0.78947368,  1.31578947,  1.84210526,  2.36842105,
        2.89473684,  3.42105263,  3.94736842,  4.47368421,  5.        ])

In [86]:

import matplotlib.pyplot as plt
import seaborn; seaborn.set()  # for plot styling

In [87]:

plt.plot(bins, counts, drawstyle='steps');

Sorting and Big-O¶

In [88]:

def selection_sort(x):
    for i in range(len(x)):
        swap = i + np.argmin(x[i:])
        (x[i], x[swap]) = (x[swap], x[i])
    return x

In [104]:

x = np.array([2, 1, 4, 3, 5])
selection_sort(x)

Out[104]:

array([1, 2, 3, 4, 5])

$N$ loops: for i in range(len(x)):
$N$ comparisons: np.argmin()
So, this sort function is slow, $\mathcal{O}[N^2]$

By default, NumPy's np.sort() function uses an $\mathcal{O}[N\log N]$ quicksort algorithm:

In [108]:

x = np.array([2, 1, 4, 3, 5])
np.sort(x)

Out[108]:

array([1, 2, 3, 4, 5])

In [109]:

x #x is unaffected

Out[109]:

array([2, 1, 4, 3, 5])

In [110]:

# sort in-place using .sort() method:
x.sort()

In [111]:

Out[111]:

array([1, 2, 3, 4, 5])

In [115]:

x = np.array([2, 1, 4, 3, 5])
i = np.argsort(x)
i

Out[115]:

array([1, 0, 3, 2, 4], dtype=int64)

In [118]:

x[i]

Out[118]:

array([1, 2, 3, 4, 5])

In [119]:

rand = np.random.RandomState(42)
X = rand.randint(0, 10, (4, 6))
print(X)

[[6 3 7 4 6 9]
 [2 6 7 4 3 7]
 [7 2 5 4 1 7]
 [5 1 4 0 9 5]]

In [120]:

np.sort(X, axis=0) # sort each column

Out[120]:

array([[2, 1, 4, 0, 1, 5],
       [5, 2, 5, 4, 3, 7],
       [6, 3, 7, 4, 6, 7],
       [7, 6, 7, 4, 9, 9]])

In [122]:

np.sort(X, axis=1) # sort each row

Out[122]:

array([[3, 4, 6, 6, 7, 9],
       [2, 3, 4, 6, 7, 7],
       [1, 2, 4, 5, 7, 7],
       [0, 1, 4, 5, 5, 9]])

In [123]:

x = np.array([7, 2, 3, 1, 6, 5, 4])
np.partition(x, 3)

Out[123]:

array([2, 1, 3, 4, 6, 5, 7])

In [124]:

np.partition(X, 2, axis=1)

Out[124]:

array([[3, 4, 6, 7, 6, 9],
       [2, 3, 4, 7, 6, 7],
       [1, 2, 4, 5, 7, 7],
       [0, 1, 4, 5, 9, 5]])

The result is an array where the first two slots in each row contain the smallest values from that row, with the remaining values filling the remaining slots.
Finally, just as there is a np.argsort that computes indices of the sort, there is a np.argpartition that computes indices of the partition:

In [126]:

np.argpartition(X, 2)

Out[126]:

array([[1, 3, 0, 2, 4, 5],
       [0, 4, 3, 2, 1, 5],
       [4, 1, 3, 2, 0, 5],
       [3, 1, 2, 0, 4, 5]], dtype=int64)

Using `np.newaxis` to promote array to a higher dimension¶

In [163]:

a = np.arange(4)
a

Out[163]:

array([0, 1, 2, 3])

In [164]:

a.shape

Out[164]:

(4,)

In [168]:

row_vec = a[np.newaxis, :]
row_vec.shape

Out[168]:

(1, 4)

In [170]:

col_vec = a[:, np.newaxis]
col_vec.shape

Out[170]:

(4, 1)

Example: kNN¶

In [262]:

X = rand.rand(200, 2)

In [263]:

plt.scatter(X[:, 0], X[:, 1], s=50);

Given $(x_1, y_1)$, $(x_2, y_2)$, ... $(x_n, y_n)$, first compute:
$x_1 - x_1$
$x_1 - x_2$
...
$x_n - x_n$

and

$y_1 - y_1$
$y_1 - y_2$
...
$y_1 - y_3$

In [264]:

differences = X[:, np.newaxis, :] - X[np.newaxis, :, :]
differences.shape

Out[264]:

(200, 200, 2)

then square them.. $(x_1 - x_i)^2$ and $(y_1 - y_i)^2$:

In [265]:

sq_differences = differences ** 2
sq_differences.shape

Out[265]:

(200, 200, 2)

then add the squared coordinate differences, so we have $(x_1 - x_i)^2 + (y_1 - y_i)^2$

In [266]:

dist_sq = sq_differences.sum(-1)
dist_sq.shape

Out[266]:

(200, 200)

In [267]:

dist_sq.diagonal()

Out[267]:

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

In [268]:

nearest = np.argsort(dist_sq, axis=1)
print(nearest)

[[  0  75  79 ... 155  38  64]
 [  1 194  68 ...  91   3 180]
 [  2 127 141 ...  98 124   9]
 ...
 [197  99 119 ... 194 122  41]
 [198 178 121 ... 122  64  41]
 [199  51   2 ...  98 124   9]]

In [269]:

K = 3
nearest_partition = np.argpartition(dist_sq, K + 1, axis=1)

In [270]:

plt.scatter(X[:, 0], X[:, 1], s=50)

# draw lines from each point to its two nearest neighbors
K = 3

for i in range(X.shape[0]):
    for j in nearest_partition[i, :K+1]:
        # plot a line from X[i] to X[j]
        # use some zip magic to make it happen:
        plt.plot(*zip(X[j], X[i]), color='black', lw=1)

Structured Data¶

In [271]:

data = np.zeros(4, dtype={'names':('name', 'age', 'weight'),
                          'formats':('U10', 'i4', 'f8')})
print(data.dtype)

[('name', '<U10'), ('age', '<i4'), ('weight', '<f8')]

Here 'U10' translates to "Unicode string of maximum length 10," 'i4' translates to "4-byte (i.e., 32 bit) integer," and 'f8' translates to "8-byte (i.e., 64 bit) float."

In [272]:

name = ['Alice', 'Bob', 'Cathy', 'Doug']
age = [25, 45, 37, 19]
weight = [55.0, 85.5, 68.0, 61.5]

In [273]:

data['name'] = name
data['age'] = age
data['weight'] = weight
print(data)

[('Alice', 25, 55. ) ('Bob', 45, 85.5) ('Cathy', 37, 68. )
 ('Doug', 19, 61.5)]

In [274]:

data['name']

Out[274]:

array(['Alice', 'Bob', 'Cathy', 'Doug'], dtype='<U10')

In [275]:

data[0]

Out[275]:

('Alice', 25, 55.)

In [279]:

data[-1][['name', 'age']]

Out[279]:

('Doug', 19)

In [280]:

data[data['age'] < 30]['name']

Out[280]:

array(['Alice', 'Doug'], dtype='<U10')

The shortened string format codes may seem confusing, but they are built on simple principles. The first (optional) character is < or >, which means "little endian" or "big endian," respectively, and specifies the ordering convention for significant bits. The next character specifies the type of data: characters, bytes, ints, floating points, and so on (see the table below). The last character or characters represents the size of the object in bytes.

Character	Description	Example
`'b'`	Byte	`np.dtype('b')`
`'i'`	Signed integer	`np.dtype('i4') == np.int32`
`'u'`	Unsigned integer	`np.dtype('u1') == np.uint8`
`'f'`	Floating point	`np.dtype('f8') == np.int64`
`'c'`	Complex floating point	`np.dtype('c16') == np.complex128`
`'S'`, `'a'`	String	`np.dtype('S5')`
`'U'`	Unicode string	`np.dtype('U') == np.str_`
`'V'`	Raw data (void)	`np.dtype('V') == np.void`

In [ ]:

Using vectorization, ex. for efficient kNN calculation:¶

ufuncs in NumPy (vectorized expressions instead of loops)¶

Aggregations¶

Computation on Arrays: Broadcasting¶

Comparisons, Masks, Boolean Logic¶

Fancy Indexing¶

Efficient manual histogram with searchsorted() and at()¶

Sorting and Big-O¶

Using np.newaxis to promote array to a higher dimension¶

Example: kNN¶

Structured Data¶

Efficient manual histogram with `searchsorted()` and `at()`¶

Using `np.newaxis` to promote array to a higher dimension¶