机器学习之隐马尔科夫模型(HMM)原理及Python实现 (大章节)

2020年08月08日 阅读数:70
这篇文章主要向大家介绍机器学习之隐马尔科夫模型(HMM)原理及Python实现 (大章节),主要内容包括基础应用、实用技巧、原理机制等方面,希望对大家有所帮助。

HMM

隐马尔可夫模型(hidden Markov model, HMM)是可用于标注问题的统计学模型,是生成模型。html

本章节内容参考李航博士的《统计学习方法》
本章节添加了一些结论性结果的推导过程。python

1. 从一个天然语言处理的例子开始

例若有三个个句子:
句子一:我/名词 看见/动词 猫/名词
句子二:猫/名词 是/动词 可爱的/形容词
句子三:我/名词 是/动词 可爱的/形容词
通常只能观察到具体的词,因此像"我 看见 猫 …"是观测集合,而词性如"名词 动词 形容词 …"是状态序列web

Q Q 是全部可能的状态集合, V V 是全部可能的观测集合:算法

Q = { q 1 , q 2 , . . . , q N } , V = { v 1 , v 2 , . . . , v M } Q = \{q_1, q_2, ..., q_N\}, V=\{v_1, v_2, ..., v_M\} 编程

其中, N是可能的状态数,M是可能的观测数。app

例如: Q = { } V = { } N = 3 , M = 5 Q=\{名词,动词,形容词 \},V=\{我, 看见, 猫, 是,可爱的\},N=3, M=5 dom

I I 是长度为 T T 的状态序列, O O 是对应的观测序列:svg

I = { i 1 , i 2 , . . . , i T } , O = { o 1 , o 2 , . . . , o T } I = \{i_1, i_2,..., i_T \}, O=\{o_1, o_2,..., o_T\} 函数

例如: I = ( ) O = ( ) I=(名词,动词,名词), O=(我,看见,猫) 学习

A A 是状态转移矩阵:

A = [ a i j ] N N (1) A=[a_{ij}]_{N*N} \tag1

其中,

a i j = p ( i t + 1 = q j i t = q i ) , i = 1 , 2 , . . . , N ; j = 1 , 2 , . . . , N (2) a_{ij} = p(i_{t+1}=q_j|i_t=q_i), i=1,2,...,N; j=1,2,...,N \tag2

例如:

转态转移几率 名词 动词 形容词
名词 0 1 0
动词 1/3 0 2/3
形容词 1/3 1/3 1/3

B B 是观测几率矩阵,也就是发射矩阵:

B = [ b j ( k ) ] N M (3) B=[b_j(k)]_{N*M} \tag3

其中,

b j ( k ) = p ( o t = v k i t = q j ) , k = 1 , 2 , . . . , M ; j = 1 , 2 , . . . , N (4) b_j(k) = p(o_t=v_k|i_t=q_j), k=1,2,...,M; j=1,2,...,N \tag4

例如:

观测矩阵几率 看见 可爱的
名词 1 0 1 0 0
动词 0 1 0 1 0
形容词 0 0 0 0 1

π \pi 是初始状态几率向量:

π = ( π i ) (5) \pi = (\pi_i) \tag5

其中,

π i = p ( i 1 = q i ) , i = 1 , 2 , . . . , N (6) \pi_i = p(i_1 = q_i), i = 1,2,...,N \tag6

A , B A,B π \pi 是HMM的参数,用 λ \lambda 表示:

λ = ( A , B , π ) (7) \lambda = (A,B,\pi) \tag7

例如:

名词 动词 形容词
1 0 0

隐马尔可夫的三个基本问题
1.几率计算问题。给定模型 λ = ( A , B , π ) \lambda=(A,B,\pi) 和观测序列 O = ( o 1 , o 2 , . . . , o T ) O=(o_1,o_2,...,o_T) ,计算在已知模型参数的状况下,观测序列的几率,即 p ( O λ ) p(O|\lambda)
2.学习问题。已知观测序列 O = ( o 1 , o 2 , . . . , o T ) O=(o_1,o_2,...,o_T) ,估计模型参数 λ = ( A , B , π ) \lambda=(A,B,\pi) ,使 p ( O λ ) p(O|\lambda) 最大。
3.预测问题,也称解码问题。已知模型 λ = ( A , B , π ) \lambda=(A,B,\pi) O = ( o 1 , o 2 , . . . , o T ) O=(o_1,o_2,...,o_T) ,求条件几率最大 p ( I O ) p(I|O) 最大的状态序列 I = ( i 1 , i 2 , . . . , i T ) I=(i_1,i_2,...,i_T)

2. 几率预测问题

几率问题预测用直接计算法,计算复杂度高,能够采用动态规划形式的前向和后向算法下降计算复杂度。
为了表示方便,记:

( o 1 : t ) = ( o 1 , o 2 , . . . , o n ) ; ( o t : T ) = ( o t , o t + 1 , . . . , o T ) (o_{1:t} )= (o_1,o_2,...,o_n); (o_{t_:T})=(o_t,o_{t+1},...,o_T)

2.1 前向算法

接下来就是解前向几率 p ( i t , o 1 : t λ ) p(i_t,o_{1:t}|\lambda)

p ( i t , o 1 : t λ ) = i t 1 p ( i t 1 , i t , o 1 : t 1 , o t λ ) = i t 1 p ( o t i t 1 , i t , o 1 : t 1 , λ ) p ( i t i t 1 , o 1 : t 1 , λ ) p ( i t 1 , o 1 : t 1 λ ) \begin{aligned} p(i_t,o_{1:t}|\lambda) &=\sum_{i_{t-1}} p(i_{t-1},i_t,o_{1:t-1},o_t|\lambda) \\ &=\sum_{i_{t-1}} p(o_t|i_{t-1},i_t,o_{1:t-1},\lambda)p(i_t|i_{t-1},o_{1:t-1},\lambda)p(i_{t-1},o_{1:t-1}|\lambda) \end{aligned}

由隐马尔科夫的条件独立性假设可得:

p ( o t i t 1 , i t , o 1 : t 1 , λ ) = p ( o t i t , λ ) p(o_t|i_{t-1},i_t,o_{1:t-1},\lambda) = p(o_t|i_t,\lambda)

p ( i t i t 1 , o 1 : t 1 , λ ) = p ( i t i t 1 , λ ) p(i_t|i_{t-1},o_{1:t-1},\lambda)=p(i_t|i_{t-1},\lambda)

p ( i t , o 1 : t λ ) = i t 1 p ( o t i t , λ ) p ( i t i t 1 , λ ) p ( i t 1 , o 1 : t 1 λ ) = [ i t 1 p ( i t 1 , o 1 : t 1 λ ) p ( i t i t 1 , λ ) ] p ( o t i t , λ ) p(i_t,o_{1:t}|\lambda)=\sum_{i_{t-1}} p(o_t|i_t,\lambda) p(i_t|i_{t-1},\lambda)p(i_{t-1},o_{1:t-1}|\lambda)=[\sum_{i_{t-1} } p(i_{t-1},o_{1:t-1}|\lambda) p(i_t|i_{t-1},\lambda)] p(o_t|i_t,\lambda)

设:

α t + 1 ( i ) = p ( o 1 : t + 1 , i t + 1 = q i λ ) (8) \alpha_{t+1}(i) = p(o_{1:t+1},i_{t+1}=q_i|\lambda) \tag8

且:

p ( i t + 1 = q i i t = q j , λ ) ] = a j i p(i_{t+1}=q_i|i_t=q_j,\lambda)] = a_{ji}

p ( o t + 1 i t + 1 , λ ) = b i ( o t + 1 ) p(o_{t+1}|i_{t+1},\lambda)=b_i(o_{t+1})

则:

α t + 1 ( i ) = [ j = 1 N α t ( j ) a j i ] b i ( o t + 1 ) (9) \alpha_{t+1}(i)=[\sum_{j=1}^N \alpha_t(j)a_{ji}]b_i(o_{t+1}) \tag9

因此前向算法就可迭代进行。

前向算法:
1.初值

α 1 ( i ) = π i b i ( o 1 ) \alpha_1(i) = \pi_ib_i(o_1)

2.递推 t = 1 , 2 , . . . , T 1 t=1,2,...,T-1

α t + 1 ( i ) = [ j = 1 N α t ( j ) a j i ] b i ( o t + 1 ) \alpha_{t+1}(i)=[\sum_{j=1}^N \alpha_t(j)a_{ji}]b_i(o_{t+1})

3.终止
p ( O λ ) = i = 1 N α T ( i ) p(O|\lambda) = \sum_{i=1}^N \alpha_T(i)

2.2 后向算法

后向算法解决后向几率 p ( o t + 1 : T i t , λ ) p(o_{t+1:T}|i_t, \lambda) :

p ( o t + 1 : T i t , λ ) = i t + 1 p ( i t + 1 , o t + 1 , o t + 2 : T i t , λ ) = i t + 1 p ( o t + 2 : T i t + 1 , i t , o t + 1 , λ ) p ( o t + 1 i t + 1 , i t , λ ) p ( i t + 1 i t , λ ) \begin{aligned} p(o_{t+1:T}|i_t, \lambda) &= \sum_{i_{t+1}} p(i_{t+1},o_{t+1},o_{t+2:T} | i_t, \lambda) \\ &= \sum_{i_{t+1}} p(o_{t+2:T}|i_{t+1}, i_t, o_{t+1}, \lambda) p(o_{t+1}|i_{t+1}, i_t, \lambda) p(i_{t+1}|i_t,\lambda)\\ \end{aligned}

由隐马尔科夫的条件独立假设得:

p ( o t + 2 : T i t + 1 , i t , o t + 1 , λ ) = p ( o t + 2 : T i t + 1 , λ ) p(o_{t+2:T}|i_{t+1}, i_t, o_{t+1}, \lambda)=p(o_{t+2:T}|i_{t+1}, \lambda)

p ( o t + 1 i t + 1 , i t , λ ) = p ( o t + 1 i t + 1 , λ ) p(o_{t+1}|i_{t+1}, i_t, \lambda) = p(o_{t+1}|i_{t+1}, \lambda)

设:

β t ( i ) = p ( o t + 1 : T i t = q i , λ ) (10) \beta_t(i) = p(o_{t+1:T}|i_t=q_i, \lambda) \tag{10}

又:

p ( i t + 1 = q j i t = q i , λ ) = a i j p(i_{t+1}=q_j|i_t=q_i,\lambda) = a_{ij}

p ( o t + 1 i t + 1 = q j , λ ) = b j ( o t + 1 ) p(o_{t+1}|i_{t+1}=q_j, \lambda) = b_j(o_{t+1})

则:

β t ( i ) = j = 1 N a i j b j ( o t + 1 ) β t + 1 ( i ) (11) \beta_t(i) = \sum_{j=1}^N a_{ij} b_j(o_{t+1}) \beta_{t+1}(i) \tag{11}

后向算法:
(1)

β T ( i ) = 1 \beta_T (i) = 1

(2) 对t=T-1,T-2,…,1

β t ( i ) = j = 1 N a i j b j ( o t + 1 ) β t + 1 ( i ) \beta_t(i) = \sum_{j=1}^N a_{ij} b_j(o_{t+1}) \beta_{t+1}(i)

(3)

p ( O λ ) = i = 1 N π i b i ( o 1 ) β 1 ( i ) p(O|\lambda) = \sum_{i=1}^N \pi_i b_i(o_1) \beta_1(i)

2.3 一些几率与指望值

这两个指望值都是后面EM算法用到的中间参量
1.计算 t t 时刻处于状态 q i q_i 的几率。
几率计算问题是计算 p ( O λ ) p(O|\lambda) ,则有:

p ( O λ ) = i t p ( O , i t λ ) p(O|\lambda)=\sum_{i_t}p(O,i_t|\lambda)

依据隐马尔科夫的独立性假设:

p ( o t + 1 : T i t , o 1 : t , λ ) = p ( o t + 1 : T i t , λ ) p(o_{t+1:T}|i_t,o_{1:t}, \lambda) = p(o_{t+1:T}|i_t, \lambda)

因此:

p ( O λ ) = i t p ( O , i t λ ) = i t p ( o t + 1 : T i t , o 1 : t , λ ) p ( i t , o 1 : t λ ) = i t p ( o t + 1 : T i t , λ ) p ( i t , o 1 : t λ ) \begin{aligned} p(O|\lambda) &=\sum_{i_t}p(O,i_t|\lambda) \\ &=\sum_{i_t} p(o_{t+1:T}|i_t,o_{1:t}, \lambda) p(i_t,o_{1:t}|\lambda) \\ &=\sum_{i_t} p(o_{t+1:T}|i_t, \lambda) p(i_t,o_{1:t}|\lambda) \\ \end{aligned}

又有:

α t ( i ) = p ( o 1 : t , i t = q i λ ) (12) \alpha_t(i) = p(o_{1:t},i_t=q_i|\lambda) \tag{12}

β t ( i ) = p ( o t + 1 : T i t = q i , λ ) (13) \beta_t(i) = p(o_{t+1:T}|i_t=q_i, \lambda) \tag{13}

故:

p ( O , i t = q i λ ) = p ( o t + 1 : T i t = q i , λ ) p ( i t = q i , o 1 : t λ ) = α t ( i ) β t ( i ) p(O,i_t=q_i|\lambda) = p(o_{t+1:T}|i_t=q_i, \lambda) p(i_t=q_i,o_{1:t}|\lambda) = \alpha_t(i) \beta_t(i)

p ( O λ ) = i t α t ( i ) β t ( i ) p(O|\lambda) = \sum_{i_t} \alpha_t(i) \beta_t(i)

设:

γ t ( i ) = p ( i t = q i O , λ ) \gamma_t(i) = p(i_t=q_i|O,\lambda)

因而能够获得:

γ t ( i ) = p ( i t = q i O , λ ) = p ( i t = q i , O λ ) p ( O λ ) = α t ( i ) β t ( i ) j = 1 N α t ( j ) β t ( j ) (14) \gamma_t(i) = p(i_t=q_i|O,\lambda) = \frac {p(i_t=q_i,O|\lambda)}{p(O|\lambda)} = \frac {\alpha_t(i) \beta_t(i)}{\sum_{j=1}^N \alpha_t(j) \beta_t(j)} \tag{14}

2.计算计算 t t 时刻处于状态 q i q_i 且计算 t + 1 t+1 时刻处于状态 q j q_j 的几率

p ( O λ ) = i t i t + 1 p ( O , i t , i t + 1 λ ) = i t i t + 1 p ( o 1 : t , o t + 1 , o t + 2 : T , i t , i t + 1 λ ) = i t i t + 1 p ( o t + 2 : T o 1 : t , o t + 1 , i t , i t + 1 , λ ) p ( o t + 1 o 1 : t , i t , i t + 1 , λ ) p ( i t + 1 i t , o 1 : t , λ ) p ( i t , o 1 : t λ ) \begin{aligned} p(O|\lambda) &=\sum_{i_t} \sum_{i_{t+1}} p(O,i_t, i_{t+1}|\lambda) \\ &=\sum_{i_t} \sum_{i_{t+1}} p(o_{1:t},o_{t+1},o_{t+2:T},i_t, i_{t+1}|\lambda) \\ &=\sum_{i_t} \sum_{i_{t+1}} p(o_{t+2:T}|o_{1:t},o_{t+1},i_t, i_{t+1},\lambda)p(o_{t+1}|o_{1:t},i_t,i_{t+1},\lambda) p(i_{t+1}|i_t,o_{1:t},\lambda) p(i_t,o_{1:t}|\lambda) \\ \end{aligned}

由隐马尔科夫的独立性假设可得:

p ( O λ ) = i t i t + 1 p ( o t + 2 : T i t + 1 , λ ) p ( o t + 1 i t + 1 , λ ) p ( i t + 1 i t , λ ) p ( i t , o 1 : t λ ) p(O|\lambda) = \sum_{i_t} \sum_{i_{t+1}} p(o_{t+2:T}| i_{t+1},\lambda)p(o_{t+1}|i_{t+1},\lambda) p(i_{t+1}|i_t,\lambda) p(i_t,o_{1:t}|\lambda)

设:

ξ t ( i , j ) = p ( i t = q i , i t + 1 = q j O , λ ) \xi_t(i,j)=p(i_t=q_i,i_{t+1}=q_j|O,\lambda)

又有公式(2)(4)(12)(13)

得:

ξ t ( i , j ) = p ( i t = q i , i t + 1 = q j O , λ ) p ( O λ ) = α t ( i ) a i j b j ( o t + 1 ) β t + 1 ( j ) i = 1 N j = 1 N α t ( i ) a i j b j ( o t + 1 ) β t + 1 ( j ) (15) \xi_t(i,j) = \frac {p(i_t=q_i,i_{t+1}=q_j|O,\lambda)}{p(O|\lambda)} =\frac {\alpha_t(i) a_{ij} b_j(o_{t+1}) \beta_{t+1}(j)} {\sum_{i=1}^N \sum_{j=1}^N \alpha_t(i) a_{ij} b_j(o_{t+1}) \beta_{t+1}(j)} \tag{15}

3. 学习问题
3.1 监督学习

若是有标记好状态序列的样本,那就太好办了,直接将接个矩阵统计的各个维度定义后进行统计就能够了。统计过程当中注意几率之和为一的约束。

3.2 无监督学习

若是没有标记状态序列的样本,能够用Baum-Welch算法(EM算法)实现。

已知:包含 S S 个长度为 T T 的观测序列的观测序列 { O 1 , O 2 , . . . , O S } \{O_1,O_2,...,O_S \}
目标:学习隐马尔可夫模型的参数 λ = ( A , B , π ) \lambda=(A,B,\pi)

记观测数据 O O ,隐数据 I I ,那么隐马尔可夫模型能够表示为:

p ( O λ ) = I p ( O I , λ ) p ( I λ ) p(O|\lambda) = \sum_I p(O|I,\lambda) p(I|\lambda)

E步:

由于对 λ \lambda 而言, 1 / p ( O λ ) 1/p(O| \overline \lambda) 是常数项,因此

Q ( λ , λ ) = E I [ log p ( O , I λ ) O , λ ] = I log p ( O , I λ ) p ( I O , λ ) = I log p ( O , I λ ) p ( I , O λ ) p ( O λ ) = I log p ( O , I λ ) p ( I , O λ ) \begin{aligned} Q(\lambda,\overline \lambda) &= E_I[\log p(O,I|\lambda)|O, \overline \lambda] \\ &= \sum_I \log p(O,I|\lambda) p(I|O,\overline \lambda) \\ &= \sum_I \log p(O,I|\lambda) \frac {p(I,O|\overline \lambda)}{p(O| \overline \lambda)} \\ &= \sum_I \log p(O,I|\lambda) p(I,O|\overline \lambda) \\ \end{aligned}

将几率计算问题2.1小姐中前向算法的递归公式展开就能够获得:

p ( O , I λ ) = π i 1 b i 1 ( o 1 ) a i 1 i 2 b i 2 ( o 2 ) . . . a i T 1 i T b i T ( o T ) = π i 1 [ t = 1 T 1 a i t i t + 1 ] [ t = 1 T b i t ( o t ) ] p(O,I|\lambda) = \pi_{i_1} b_{i_1}(o_1) a_{i_1i_2} b_{i_2}(o_2) ... a_{i_{T-1}i_T} b_{iT}(o_T) = \pi_{i_1} [\prod_{t=1}^{T-1} a_{i_ti_{t+1}}][\prod_{t=1}^T b_{it}(o_t)]

因而:

Q ( λ , λ ) = I log π i 1 p ( O , I λ ) + I ( t = 1 T 1 a i t i t + 1 ) p ( O , I λ ) + I ( t = 1 T b i t ( o t ) ) p ( O , I λ ) (16) Q(\lambda, \overline \lambda) = \sum_I \log \pi_{i_1} p(O, I| \overline \lambda) + \sum_I (\sum_{t=1}^{T-1} a_{i_ti_{t+1}}) p(O, I| \overline \lambda) + \sum_I (\sum_{t=1}^T b_{it}(o_t)) p(O, I| \overline \lambda) \tag{16}

特此说明隐变量
隐马尔可夫模型的隐变量就是观测序列对应的状态序列,因此隐变量能够用(14)式的变量表示
后面在M步中更新模型参数的时候也用到了(15)式,是否是就说明隐变量是两个,其实不是的,这儿只是为了表示的方便和算法的方便。
也就是在E步中,用 γ \gamma ξ \xi 表示隐变量,只是为了编程和表示的便利,这两个变量在E步中信息是重复的。

M步:

1.求解 π i \pi_i
由(15)式可得:

L ( π i 1 ) = I log π i 1 p ( O , I λ ) = i N log π i 1 p ( O , i 1 = i λ ) L(\pi_{i_1}) = \sum_I \log \pi_{i_1} p(O, I| \overline \lambda) = \sum_{i}^N \log \pi_{i_1} p(O, i_1=i| \overline \lambda)

又由于 π i \pi_i 知足约束条件 i = 1 N π i 1 = 1 \sum_{i=1}^N \pi_{i_1}=1 ,利用拉格朗日乘子法,写出拉格朗日函数:

i = 1 N log π i p ( O , i 1 = i λ ) + γ ( i = 1 N π i 1 ) \sum_{i=1}^N \log \pi_{i} p(O, i_1=i| \overline \lambda) + \gamma(\sum_{i=1}^N \pi_{i} - 1)

对其求偏导而且令其结果为0得:

π i [ i = 1 N log π i p ( O , i = i λ ) + γ ( i 1 = 1 N π i 1 ) ] = 0 (17) \frac {\partial} {\partial \pi_i} [\sum_{i=1}^N \log \pi_{i} p(O, i=i| \overline \lambda) + \gamma(\sum_{i_1=1}^N \pi_{i} - 1)]=0 \tag{17}

得:

p ( O , i 1 = i λ ) + γ π i = 0 p(O, i_1=i| \overline \lambda) + \gamma \pi_i=0

获得:

π i = p ( O , i 1 = i λ ) λ \pi_i = \frac {p(O, i_1=i| \overline \lambda)} {-\lambda}

带入 i = 1 N π i 1 = 1 \sum_{i=1}^N \pi_{i_1}=1 的:

λ = i = 1 N p ( O , i 1 = i λ ) = p ( o λ ) -\lambda = \sum_{i=1}^N p(O, i_1=i| \overline \lambda) = p(o|\overline \lambda)

求得并有公式(14):

π i = p ( O , i 1 = i λ ) p ( o λ ) = γ 1 ( i ) (18) \pi_i = \frac {p(O, i_1=i| \overline \lambda)} {p(o|\overline \lambda)} = \gamma_1(i) \tag{18}

2.求解 a i j a_{ij} :

L ( a i j ) = I ( t = 1 T 1 a i t i t + 1 ) p ( O , I λ ) = i = 1 N ( t = 1 T 1 a i t i t + 1 ) ( j = 1 N p ( O , i t = i , i t + 1 = j λ ) ) = i = 1 N j = 1 N t = 1 T 1 a i j p ( O , i t = i , i t + 1 = j λ ) L(a_{ij})=\sum_I (\sum_{t=1}^{T-1} a_{i_ti_{t+1}}) p(O, I| \overline \lambda) = \sum_{i=1}^N (\sum_{t=1}^{T-1} a_{i_ti_{t+1}}) ( \sum_{j=1}^N p(O, i_t=i, i_{t+1}=j| \overline \lambda) ) \\ = \sum_{i=1}^N \sum_{j=1}^N \sum_{t=1}^{T-1} a_{ij} p(O, i_t=i, i_{t+1}=j| \overline \lambda)

应用约束条件 j = 1 N a i j = 1 \sum_{j=1}^N a_{ij} = 1 ,用拉格朗日乘子法能够求出:

i = 1 N j = 1 N t = 1 T 1 a i j p ( O , i t = i , i t + 1 = j λ ) + λ ( j = 1 N a i j 1 ) \sum_{i=1}^N \sum_{j=1}^N \sum_{t=1}^{T-1} a_{ij} p(O, i_t=i, i_{t+1}=j| \overline \lambda) + \lambda(\sum_{j=1}^N a_{ij} - 1)

对上式求骗到并等于0获得:

a i j [ i = 1 N j = 1 N t = 1 T 1 a i j p ( O , i t = i , i t + 1 = j λ ) + λ ( j = 1 N a i j 1 ) ] = 0 \frac {\partial}{\partial a_{ij}} [\sum_{i=1}^N \sum_{j=1}^N \sum_{t=1}^{T-1} a_{ij} p(O, i_t=i, i_{t+1}=j| \overline \lambda) + \lambda(\sum_{j=1}^N a_{ij} - 1)] = 0

获得:

t = 1 T 1 p ( O , i t = i , i t + 1 = j λ ) + λ a i j = 0 \sum_{t=1}^{T-1} p(O, i_t=i, i_{t+1}=j| \overline \lambda) + \lambda a_{ij} = 0

因此:

a i j = t = 1 T 1 p ( O , i t = i , i t + 1 = j λ ) λ a_{ij} = \frac {\sum_{t=1}^{T-1} p(O, i_t=i, i_{t+1}=j| \overline \lambda)}{- \lambda}

将上式带入 j = 1 N a i j = 1 \sum_{j=1}^N a_{ij} = 1

λ = j = 1 N t = 1 T 1 p ( O , i t = i , i t + 1 = j λ ) = t = 1 T 1 p ( O , i t = i λ ) - \lambda = \sum_{j=1}^N \sum_{t=1}^{T-1} p(O, i_t=i, i_{t+1}=j| \overline \lambda) = \sum_{t=1}^{T-1} p(O, i_t=i| \overline \lambda)

故得:

a i j = t = 1 T 1 p ( O , i t = i , i t + 1 = j λ ) t = 1 T 1 p ( O , i t = i λ ) = t = 1 T 1 p ( O , i t = i , i t + 1 = j λ ) / p ( o λ ) t = 1 T 1 p ( O , i t = i λ ) / p ( o λ ) a_{ij} = \frac {\sum_{t=1}^{T-1} p(O, i_t=i, i_{t+1}=j| \overline \lambda)}{\sum_{t=1}^{T-1} p(O, i_t=i| \overline \lambda) } = \frac {\sum_{t=1}^{T-1} p(O, i_t=i, i_{t+1}=j| \overline \lambda) / p(o|\overline \lambda)} {\sum_{t=1}^{T-1} p(O, i_t=i| \overline \lambda) / p(o|\overline \lambda) }

将(14)和(15)带入的:

a i j = t = 1 T 1 ξ t ( i , j ) t = 1 T 1 γ t ( i ) (19) a_{ij} = \frac {\sum_{t=1}^{T-1} \xi_t(i,j)} {\sum_{t=1}^{T-1} \gamma_t(i) } \tag{19}

3.求解 b j k b_j{k} :

L ( b j k ) = I ( t = 1 T b i t ( o t ) ) p ( O , I λ ) = j = 1 N t = 1 T b j ( o t ) p ( O , i t = j λ ) L(b_j{k}) = \sum_I (\sum_{t=1}^T b_{it}(o_t)) p(O, I| \overline \lambda) = \sum_{j=1}^N \sum_{t=1}^T b_{j}(o_t) p(O, i_t=j| \overline \lambda)

在约束条件 k = 1 M b j ( k ) = 1 \sum_{k=1}^M b_j(k) = 1 的拉格朗日乘子法:

j = 1 N t = 1 T b j ( o t ) p ( O , i t = j λ ) + λ ( k = 1 M b j ( k ) 1 ) \sum_{j=1}^N \sum_{t=1}^T b_{j}(o_t) p(O, i_t=j| \overline \lambda) + \lambda(\sum_{k=1}^M b_j(k) - 1)

对其求偏导得:

b j ( k ) [ j = 1 N t = 1 T b j ( o t ) p ( O , i t = j λ ) + λ ( k = 1 M b j ( k ) 1 ) ] = 0 \frac {\partial}{\partial b_j(k)} [\sum_{j=1}^N \sum_{t=1}^T b_{j}(o_t) p(O, i_t=j| \overline \lambda) + \lambda(\sum_{k=1}^M b_j(k) - 1)] = 0

由于只有在 o t = v k o_t=v_k 时偏导才不会等于0,以 I ( o t = v k ) I(o_t=v_k) 表示,则:

t = 1 T p ( O , i t = j λ ) I ( o t = v k ) + λ b j ( o t ) I ( o t = v k ) = 0 \sum_{t=1}^T p(O, i_t=j| \overline \lambda) I(o_t=v_k) + \lambda b_{j}(o_t)I(o_t=v_k) = 0

b j ( o t ) I ( o t = v k ) b_{j}(o_t)I(o_t=v_k) 能够写做 b j ( k ) b_{j}(k) ,故:

b j ( k ) = t = 1 T p ( O , i t = j λ ) I ( o t = v k ) λ b_{j}(k) = \frac {\sum_{t=1}^T p(O, i_t=j| \overline \lambda) I(o_t=v_k)} {- \lambda}

将上式带入 k = 1 M b j ( k ) = 1 \sum_{k=1}^M b_j(k) = 1 得:

λ = k = 1 M t = 1 T p ( O , i t = j λ ) I ( o t = v k ) = t = 1 T p ( O , i t = j λ ) - \lambda = \sum_{k=1}^M \sum_{t=1}^T p(O, i_t=j| \overline \lambda) I(o_t=v_k) = \sum_{t=1}^T p(O, i_t=j| \overline \lambda)

获得:

b j ( k ) = t = 1 T p ( O , i t = j λ ) I ( o t = v k ) t = 1 T p ( O , i t = j λ ) b_{j}(k) = \frac {\sum_{t=1}^T p(O, i_t=j| \overline \lambda) I(o_t=v_k)} {\sum_{t=1}^T p(O, i_t=j| \overline \lambda)}

又有(14)式可得:

b j ( k ) = t = 1 , o t = v k T γ t ( j ) t = 1 T γ t ( j ) (20) b_{j}(k) = \frac {\sum_{t=1,o_t=v_k}^T \gamma_t(j)} {\sum_{t=1}^T \gamma_t(j)} \tag{20}

EM算法总结:
E步:

γ t ( i ) = p ( i t = q i O , λ ) = p ( i t = q i , O λ ) p ( O λ ) = α t ( i ) β t ( i ) j = 1 N α t ( j ) β t ( j ) \gamma_t(i) = p(i_t=q_i|O,\lambda) = \frac {p(i_t=q_i,O|\lambda)}{p(O|\lambda)} = \frac {\alpha_t(i) \beta_t(i)}{\sum_{j=1}^N \alpha_t(j) \beta_t(j)}

ξ t ( i , j ) = p ( i t = q i , i t + 1 = q j O , λ ) p ( O λ ) = α t ( i ) a i j b j ( o t + 1 ) β t + 1 ( j ) i = 1 N j = 1 N α t ( i ) a i j b j ( o t + 1 ) β t + 1 ( j ) \xi_t(i,j) = \frac {p(i_t=q_i,i_{t+1}=q_j|O,\lambda)}{p(O|\lambda)} =\frac {\alpha_t(i) a_{ij} b_j(o_{t+1}) \beta_{t+1}(j)} {\sum_{i=1}^N \sum_{j=1}^N \alpha_t(i) a_{ij} b_j(o_{t+1}) \beta_{t+1}(j)}

M步:
π i = p ( O , i 1 = i λ ) p ( o λ ) = γ 1 ( i ) \pi_i = \frac {p(O, i_1=i| \overline \lambda)} {p(o|\overline \lambda)} = \gamma_1(i)

a i j = t = 1 T 1 ξ t ( i , j ) t = 1 T 1 γ t ( i ) a_{ij} = \frac {\sum_{t=1}^{T-1} \xi_t(i,j)} {\sum_{t=1}^{T-1} \gamma_t(i) }

b j ( k ) = t = 1 , o t = v k T γ t ( j ) t = 1 T γ t ( j ) b_{j}(k) = \frac {\sum_{t=1,o_t=v_k}^T \gamma_t(j)} {\sum_{t=1}^T \gamma_t(j)}

4. 预测问题(解码问题)

用维特比算法进行求解:
已知:模型 λ = ( A , B , π ) \lambda=(A,B,\pi) O = ( o 1 , o 2 , . . . , o T ) O=(o_1,o_2,...,o_T)
求:条件几率最大 p ( I O , λ ) p(I|O,\lambda) 最大的状态序列 I = ( i 1 , i 2 , . . . , i T ) I=(i_1,i_2,...,i_T)
由于 p ( O ) p(O) 是一个定值,因此:

max I p ( I O , λ ) = max I p ( I , O λ ) / p ( O λ ) = max I p ( I , O λ ) \max_I p(I|O,\lambda) = \max_I p(I, O|\lambda) / p(O|\lambda) = \max_I p(I, O|\lambda)

定义在时刻 t t 状态为 i i 的全部单个路径 ( i 1 , i 2 , . . . , i t ) (i_1,i_2,...,i_t) 中几率最大值为:

δ t ( i ) = max i 1 , i 2 , . . . , i t 1 p ( i t = i , i t 1 : i 1 , o t : 1 λ ) \delta_t(i) = \max_{i_1,i_2,...,i_{t-1}} p(i_t=i, i_{t-1:i_1},o_{t:1}|\lambda)

递推推导:

p ( i t + 1 = i , i t : 1 , o t + 1 : 1 λ ) = p ( i t + 1 = i , i t , i t 1 : 1 , o t + 1 , o t : 1 λ ) = p ( o t + 1 i t + 1 = i , i t , o t : 1 , λ ) p ( i t + 1 = i i t , i t 1 : 1 , o t : 1 , λ ) p ( i t , i t 1 : 1 , o t : 1 λ ) = p ( o t + 1 i t + 1 = i , λ ) p ( i t + 1 = i i t , λ ) p ( i t , i t 1 : 1 , o t : 1 λ ) \begin{aligned} &p(i_{t+1}=i,i_{t:1},o_{t+1:1}| \lambda) \\ &=p(i_{t+1}=i,i_t,i_{t-1:1},o_{t+1},o_{t:1}| \lambda) \\ &= p(o_{t+1}|i_{t+1}=i,i_t,o_{t:1},\lambda) p(i_{t+1}=i|i_t,i_{t-1:1},o_{t:1}, \lambda) p(i_t,i_{t-1:1},o_{t:1}|\lambda) \\ &= p(o_{t+1}|i_{t+1}=i,\lambda) p(i_{t+1}=i|i_t,\lambda) p(i_t,i_{t-1:1},o_{t:1}|\lambda) \\ \end{aligned}

故:

δ t + 1 ( i ) = max i 1 , i 2 , . . . , i t 1 p ( i t + 1 = i , i t : 1 , o t + 1 : 1 λ ) = max 1 j N [ δ t ( j ) a j i ] b i ( o t + 1 ) (21) \delta_{t+1}(i) = \max_{i_1,i_2,...,i_{t-1}} p(i_{t+1}=i,i_{t:1},o_{t+1:1}| \lambda) = \max_{1 \le j \le N} [\delta _t(j) a_{ji}] b_i(o_{t+1}) \tag{21}

定义在时刻 t t 状态为 i i 的全部单个路径 ( i 1 , i 2 , . . . , i t 1 ) (i_1,i_2,...,i_{t-1}) 中几率最大的第 t 1 t-1