import numpy as np
The numpy.random module supplements(补充) the built-in Python random with functions for efficiently generating whole arrays of sample values from many kinds of probalility distributions. For example, you can get a 4x4 array of samples from the standard normal distribution using normal:
samples = np.random.normal(size=(4,4))
samples
array([[-0.49777854, 1.01894039, 0.3542692 , 1.0187122 ],
[-0.07139068, -0.44245259, -2.05535526, 0.49974435],
[ 0.80183078, -0.11299759, 1.22759314, 1.37571884],
[ 0.32086762, -0.79930024, -0.31965109, 0.23004107]])
Python‘s built-in random module, by contrast(对比), only samples one value at a time. As you can see from this benchmark, numpy.random is well over an order of magnitude faster for generating very large samples:
from random import normalvariate
n = 100000
'python 运行时间:'
%timeit samples = [normalvariate(0,1) for _ in range(n)]
'python 运行时间:'
127 ms ± 7.72 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
'np.random 运行时间:'
%timeit np.random.normal(size=n)
'np.random 运行时间:'
4.2 ms ± 277 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%time
# 我后来还是喜欢 %%time 这样的计时方式
from random import normalvariate
n = 1000000
'Python run time:'
samples = [normalvariate(0,1) for _ in range(n)]
Wall time: 1.08 s
We say that these are pseudorandom numbers(伪随机数) because they are generated by an algorithim with deterministic behavior(确定行为的算法生成的) You can change NumPy‘s random number generation seed number generation seed using np.random.seed:
"cj还是不理解seed(), 是让算法不改变吗? 每次都是同一个?"
np.random.seed(1234)
"现在理解了, seed(n),叫做随机种子, n表示在接下来的n次(抽样)的结果是一样的"
'cj还是不理解seed(), 是让算法不改变吗? 每次都是同一个?'
'现在理解了, seed(n),叫做随机种子, n表示在接下来的n次(抽样)的结果是一样的'
The data generation functions in numpy.random use a global random seed. To avoid global state, you can use numpy.random.RandomState to create a random number generator islolated(单独的) from others.
rng = np.random.RandomState(1234)
rng.randn(10)
array([ 0.47143516, -1.19097569, 1.43270697, -0.3126519 , -0.72058873,
0.88716294, 0.85958841, -0.6365235 , 0.01569637, -2.24268495])
See Table 4-8 for a partial list of functions available in numpy.random. I wil give some examples of leveraging(利用) these function‘s ablility to generate large arrays of samples all at onece in the next section.
"permutaion(x), 产生0-x范围内x个随机自然数的一个排列"
np.random.permutation(6)
"shuffle(seq) 将一个序列随机打乱, in-place 哦 "
arr1 = np.arange(6)
np.random.shuffle(arr1)
arr1
'permutaion(x), 产生0-x范围内x个随机自然数的一个排列'
array([1, 2, 5, 0, 3, 4])
'shuffle(seq) 将一个序列随机打乱, in-place 哦 '
array([5, 2, 3, 0, 4, 1])
"rand(shape) 0-1的均匀分布哦"
"shape=(2x2x3)-> 2里的每个1是3,每个1里面是1 "
"[[], []]"
"[ [ [6],[6],[6] ], [ [6],[6],[6] ] ]"
np.random.rand(2,1,1)
"uniform()"
np.random.uniform(3,4,5)
'rand(shape) 0-1的均匀分布哦'
'shape=(2x2x3)-> 2里的每个1是3,每个1里面是1 '
'[[], []]'
'[ [ [6],[6],[6] ], [ [6],[6],[6] ] ]'
array([[[0.06152103]],
[[0.8725525 ]]])
'uniform()'
array([3.35219682, 3.62783057, 3.51758469, 3.37434633, 3.64026243])
"randn(shape), 标准正态分布"
np.random.randn(1,2,3)
"满足normal(loc=0, scale=1, size=None) 均值80, 标准差10"
np.random.normal(80, 20, 6)
'randn(shape), 标准正态分布'
array([[[0.49112636, 0.90638754, 0.05000051],
[1.21431522, 0.67847748, 1.3797269 ]]])
'满足normal(loc=0, scale=1, size=None) 均值80, 标准差10'
array([55.07130243, 56.34397557, 68.95608996, 31.40875572, 89.80741058,
37.38567435])
"binorma(n, p, size), n次试验, p次成功概率, 5个样本量"
np.random.binomial(100, 0.4, 5)
"chisquare(10,2) 服从于自由度为10的卡方分布下的2个样本"
np.random.chisquare(10, 2)
'binorma(n, p, size), n次试验, p次成功概率, 样本量'
array([47, 42, 41, 34, 44])
array([20.07301382, 14.54581473])
The simulation of random walks(随机漫步) provides an illustrative(实例) of utilizing(使用) array operations. Let‘s first consider a simple random walk starting at 0 with steps of 1 and -1 occuring(出现) with equal probalility. (1,-1 等概率出现)
Here is a pure Python way to implement a single random walk with 1000 steps using the built-in random module.
import random
import matplotlib.pyplot as plt
def random_walk(position=0, steps=1000, walk=[]):
"""
position=0 # 初始位置
walk = [] # 结果列表
steps = 1000
"""
for i in range(steps):
state = 1 if random.randint(0,1) else -1
position += state
# 将每次位置存入结果列表中
walk.append(position)
return walk
# test
walk_result = random_walk()
"plot the first 100 values on one of these random walks:"
"默认折线图"
plt.plot(walk_result[:100])
'plot the first 100 values on one of these random walks:'
'默认折线图'
[<matplotlib.lines.Line2D at 0x19329c95208>]
You might make the observation that walk is simply the
cumulative(累积的) sum of the random steps and could be evaluate as an array expression, Thus, I use the np.random module to draw 1000 coin flips at once, set these to 1 and -1, and compute the cumulative sum:
nsteps = 1000
"用size代替for循环, 面向数组编程哦"
draws = np.random.randint(0, 2, size=nsteps)
"np.where 三元很厉害, 代替if-else"
steps = np.where(draws > 0, 1, -1)
'cumsum 累积求和'
walk = steps.cumsum()
plt.plot(walk)
'用size代替for循环, 面向数组编程哦'
'np.where 三元很厉害, 代替if-else'
'cumsum 累积求和'
[<matplotlib.lines.Line2D at 0x1932a032278>]
From this we can begin to extract(提取) statistics like the minmun and maxmun value along the walks trajectory(轨迹)
"walk的极值"
walk.min()
walk.max()
type(walk)
'walk的极值'
-17
15
numpy.ndarray
# cj test cumsum()
cj_arr = np.array([[1,2,3],[4,5,6]])
cj_arr
"cumsum()累积求和, 返回的是narray 可看到累积的过程哦"
np.cumsum(cj_arr, axis=0) # 往下, 显示没列的和
np.cumsum(cj_arr, axis=1) # 往右, 显示每行的和
array([[1, 2, 3],
[4, 5, 6]])
'cumsum()累积求和, 返回的是narray 可看到累积的过程哦'
array([[1, 2, 3],
[5, 7, 9]], dtype=int32)
array([[ 1, 3, 6],
[ 4, 9, 15]], dtype=int32)
A more complicated(更复杂的) statistic is the first crossing time, the step at which the random walk reaches a particular value. Here we might want to know how long it took the random walk to get at least 10 steps away from the orgin 0 in either direction. (距离原点为10, 需要多少次) np.abs(walk) >= 10 gives us a boolean array indicating(指明) where(是否) the walk has reached or exceeded 10, but we want the index of the first 10 or -10, Turn out(结果是), we can compute this using argmax, which returns the first index of maximum value in the boolean array(True is the maximum): -> argmax()返回数组中, 最大值的第一个索引, 配合maximum使用
"查询累积数组中, 绝对距离大于10的,的第一个值的索引 "
(np.abs(walk) >= 10).argmax()
'查询累积数组中, 绝对距离大于10的,的第一个值的索引 '
49
Note that(注意) using argmax here is not always efficient because it always makes a full scan of the array. -> 用argmax()来找最大值索引, 效率是不高的,因为需要遍历整个数组. In this special case, once a True is observed we know it to be the maximum value.
# cj test
np.max([1,3,2,5,7])
"maximum返回一个数组, 逐个比较传入数组的值,和第二个参数比较, quit其大,替换"
np.maximum([10,1,3,2,5,7],6)
7
'maximum返回一个数组, 逐个比较传入数组的值,和第二个参数比较, quit其大,替换'
array([10, 6, 6, 6, 6, 7])
"待补充中..."
np.maximum(np.abs(walk), 10).argmax()
530
# cj_arr1 = np.array([10,6,6,6,6,7]).argmax()
# cj_arr1
0
If your goal to simulate many random walks, say 5000 of them, you can generate all of the random walks with minor modiffcations(微小的修改) to the preceding code(之前的代码). If passed a 2-tuple, the numpy.random functions will generate a two-dimensional array of draws, and we can compute the cumulative sum across the rows to compute all 5000 random walks in one shot:
# cj test
np.random.randint(0,2, size=(3,4))
array([[0, 0, 0, 1],
[0, 1, 0, 1],
[0, 0, 1, 1]])
"一次生成500次行走记录, 每次走1000不的大数据集5000 x 1000"
nwalks = 5000
nsteps = 1000
# 0 or 1
%time draws = np.random.randint(0, 2, size=(nwalks, nsteps))
"5000000万次, 这耗时也太短了,厉害了, 1秒=10^3ms"
" >0, 1, <0, -1 "
steps = np.where(draws > 0, 1, -1)
"轴1, 列方向, 右边, 按照每行累积"
walks = steps.cumsum(axis=1)
walks
'一次生成500次行走记录, 每次走1000不的大数据集5000 x 1000'
Wall time: 58 ms
'5000000万次, 这耗时也太短了,厉害了'
' >0, 1, <0, -1 '
'轴1, 列方向, 右边, 按照每行累积'
array([[ -1, 0, -1, ..., 22, 21, 22],
[ -1, -2, -3, ..., 52, 53, 54],
[ 1, 0, -1, ..., -20, -19, -20],
...,
[ -1, -2, -3, ..., -48, -47, -48],
[ 1, 2, 1, ..., -8, -7, -8],
[ -1, -2, -1, ..., -10, -11, -10]], dtype=int32)
Now, we can compute the maximun and minimum values obtained(获得) over all of the walks: ->获取最大值, 最小值
"整个数组的"
'最大值', walks.max()
'最小值', walks.min()
'整个数组的'
('最大值', 112)
('最小值', -109)
Out of these walks, let‘s compute the minimum crossing time to 30 or -30. This is slightly tricky(稍微有些棘手) because not all 5000 of them reach 30. We can check this using the any method:
hist30 = (np.abs(walks) >= 30).any(axis=1) # 按照行
hist30
'统计 number that hit 30 or -30, 有多少行'
hist30.sum()
array([ True, True, True, ..., True, True, True])
'统计 number that hit 30 or -30, 有多少行'
3441
We can use this boolean array to select out the rows of walks that actually cross the absolute 30 level an d call argmax across axis=1 to get the crossing times:
'cj待理解'
crossing_times = (np.abs(walks[hist30]) >= 30).argmax(1)
crossing_times.mean()
'cj待理解'
497.102877070619
Feel free to experiment(积极地尝试) with other distributions for the steps other than equal-sized coin flips(硬币试验). You need only use a different random number generation function, like normal to generate normally distribute steps with some mean and standard deviation(标准差)
import sys
steps = np.random.normal(loc=0, scale=0.25, size=(5000, 1000))
"查看这个对象占多少内存"
"{}的{}占用{}字节".format(steps.shape, steps.dtype, sys.getsizeof(steps))
'查看这个对象占多少内存'
'(5000, 1000)的float64占用40000112字节'
While(当然) much of the rest of the book will focus on building data wrangling(数据整理) skills with pandas, we will continue to work in a similar array-based style. In Appendix A, we will dig deeper(深入挖掘) into NumPy features to help you further develop your array computing skills.
原文:https://www.cnblogs.com/chenjieyouge/p/11863557.html