2013-01-31

Pegasos算法

本文参考了博文Online Learning in Clojure和论文Pegasos: Primal Estimated sub-GrAdient SOlver for SVM(PDF)

online learning

Online learning的算法结构是非常简单的，下面的描述是监督的online learning算法框架，其中有经验损失函数$L$，样本流$S$，样本的格式为$(x,y)$:

Initialise a starting model w
While there are more examples in S
    Get the next feature vector x
    Predict the label y' for x using the model w
    Get the true label y for x and incur a penaly L(y,y')
    Update the model w if y ≠ y'

一般来是，训练出来的模型都是一个与样本相同维度的向量。对应二分的分类器，往往涉及到的是计算内积$\langle w,x \rangle$，模型的更新是沿着损失函数的梯度下降方向的。

Pegasos

论文Pegasos: Primal Estimated sub-GrAdient SOlver for SVM是一种svm的online learning算法。

首先来看svm的经验合页损失函数：

\[ \begin{array}{l} L(w,S) = \frac{\lambda }{2}{\left\| w \right\|^2} + \frac{1}{k}\sum\limits_{(x,y) \in S} {h(w;(x,y))} \\ h(w;(x,y)) = \max \{ 0,1 - y \langle w,x \rangle \} \end{array} \]

上面式子中，$k$是训练集$S$的大小，$h()$是the hinge loss（合页损失函数），$\langle w, x\rangle$表示$w,x$的内积，$\lambda$是正则化项。

在《统计学习方法》这本书的7.2.4证明了合页损失函数与引入松弛变量后的损失函数是等价的，并证明了$\lambda\$与惩罚系数$C$是成反比的。引入松弛变量后的损失函数为:

\[ \frac{1}{2}\left \| w \right \|^{2} + C\sum_{i=1}^{N}\xi _{i} \]

训练过程中，如果遇到了一个预测错误的样本$(x,y)$, 对模型的更新方法如下：

\[ {w_{t + \frac{1}{2}}} = (1 - \frac{1}{t}){w_t} + \frac{1} { {\lambda t} } yx \]

其中$t$表示已经训练过的样本个数，$ {w_{t + }}$表示训练过$t$个的样本后的模型，${w_{t + }}$ 表示新模型。根据pegasos算法，新模型的$l\_2$范数如果超出了以 $\frac{1}{ {\sqrt \lambda } }$ 为半径的超球面，那么需要将新模型投射到这个超球面上。即：

\[ {w_{t + 1}} = \min \{ 1,\frac{1}{ {\sqrt \lambda \left\| { {w_{t + \frac{1}{2} } } } \right\|}}\} {w_{t + \frac{1}{2}}} \]

为什么需要讲新的模型投射到以$\frac{1}{ {\sqrt \lambda } }$为半径的超球面上呢？论文证明了svm的最优解是在下面这个集合中的：

\[ B = \{ w:\left\| w \right\| \le \frac{1}{ {\sqrt \lambda } }\} \]

而且在pegasos算法的推导，以及模型初始化$w$的时候，都使用了条件

\[ \left\| w \right\| \le \frac{1}{ {\sqrt \lambda } } \]

由上面模型的更新公式可以简单分析一下正则化参数$\lambda$的作用，它决定了训练过程中，后面出现的预测错误的样本，对应模型的修正程度。$\lambda$越大，修正程度越小，$\lambda$越小，修正程度越大。同时$\lambda$与惩罚系数$C$是成反比的，所以也可理解为，在训练过称中，出现预测错误样本时，对模型的惩罚程度。$\lambda$越大，惩罚越小，$\lambda$越小，惩罚越大。

Pegasos的算法描述在论文“Pegasos: Primal Estimated sub-GrAdient SOlver for SVM”也是给出了的，可以参考。

但实际上pegasos是一个线性的svm，而且还是一个没有bias的svm，训练出来的线性函数是$y=\langle w,x \rangle$，在上面的论文中的Extensions小节中也讲到了，目前pegasos还没有证明可应用于线性模型$y=\langle w,x \rangle + b$或者是非线性svm模型。

Pegasos的实现例子

前面的博客基于SVM的手写数字识别，实现了一个基于SMO算法的svm，今天就来基于Pegasos实现数字手写识别。svm用于多分类，还是一对多的方式，手写数据还是来自“Machine Learning in Action”的第二章的数据。下面是实现代码

基于pegasos的数字手写识别view raw

#!/usr/bin/env python
# -*- coding: utf-8 -*-

# Pegasos implemented in Python

import os
import sys
import math

G_WEIGHT = []

def parse_image(path):
    img_map = []
    fp = open(path, "r") 
    for line in fp:
        line = line[:-2]
        for ch in line:
            img_map.append(int(ch))
    return img_map

def predict(model, data):
    ret = 0
    for i in range(32*32):
        ret += model[i]*data[i]
    return ret

def train_one_model(data, label, sampleNum, modelNum):
    pvalue = predict(G_WEIGHT[modelNum], data)
    # the hinge loss
    if pvalue * label >= 1:
        return
    
    # update model
    lambd = 0.5
    new_weight = []
    for i in range(32*32):
        # pegasos
        a = G_WEIGHT[modelNum][i] * ( 1 - 1.0/sampleNum) + (1.0 / (lambd * sampleNum))*label*data[i]
        new_weight.append(a)

    # projection
    norm2 = 0
    for i in range(32*32):
        norm2 += math.pow(new_weight[i], 2)
    norm2 = math.sqrt(norm2)
    if norm2 > (1/math.sqrt(lambd)):
        for i in range(32*32): 
            G_WEIGHT[modelNum][i] = new_weight[i]/(norm2 * math.sqrt(lambd)) 
    else:
        G_WEIGHT[modelNum] = new_weight

def train_one_sample(data, num, sampleNum):
    for modelNum in range(10):
        label = -1
        if num == modelNum:
            label = 1
        train_one_model(data, label, sampleNum, modelNum)

if __name__== "__main__":
    for i in range(10):
        G_WEIGHT.append([])
        for j in range(32 * 32):
            G_WEIGHT[i].append(0)

    dirpath = "./trainingDigits/"
    files = os.listdir(dirpath)
    sampleNum = 0
    for file in files:
        print "training:", file
        data = parse_image(dirpath + file)
        num = int(file[0])
        sampleNum += 1
        train_one_sample(data, num, sampleNum)

    # test
    testdir = "./testDigits/"
    files = os.listdir(testdir)  
    right = 0
    wrong = 0
    can_not_classify = 0
    total = 0
    for file in files:
        total += 1
        data = parse_image(testdir + file)
        print "testing:", file
        num = int(file[0])
        classify_failed = True
        for i in range(10):
            pvalue = predict(G_WEIGHT[i], data)
            if pvalue > 0:
                classify_failed = False
                print i, "prdict:", 1
                if i == num:
                    right += 1
                else:
                    wrong += 1
            else:
                print i, "prdict:", -1
        if classify_failed:
            can_not_classify += 1
        
    print "right=", right
    print "wrong=", wrong
    print "can_not_classify=", can_not_classify
    print "total=", total

训练出来的模型测试结果如下：

right= 849
wrong= 46
can_not_classify= 72
total= 946

一共有946个测试样本，其中46个分类错误，72个没有找到分类，849个正确分类，正确分类率89.7%。$\lambda$取值为0.5。我也没有仔细调整$\lambda$的取值，不过看来结果还是慢不错的。但比起SMO算法实现的svm效果要差一些。但是pegasos的优势是快啊，同样的1934个训练样本，基于SMO的svm，花了3、4个小时训练，而pegasos算法只用了30多秒，逆天了。

实现例子的代码和数据可以在github上下载。pegasos有两个版本，pegasos2.py是pegasos.py的升级版，用了numpy库，使得代码更精简好看，同时运行效率更高。这个目录下还包含了论文的pdf文档Pegasos.pdf。

PS：发现numpy和scipy、matplotlib真是好东西啊，python数学运算离不开。另外发现了一个讲numpy/scipy文档翻译为中文的网站用Python做科学计算，好东西啊。

还发现了一个和机器学习相关的网站http://hunch.net/，有很不多不错的学术方面的东西。