机器学习笔记--逻辑斯蒂回归(logistic regression)

logistic

logistic regression实际上不是一个回归器，而是一个二分类器，即：给定的训练样本中一部分被标记为1（positive），剩下的被标记为0（negative），我们从这些样本中训练出一个分类器，给定输入特征（x），此分类器能够输出它预测的x的类别y，此时输出y只有两种情况：0和1. 先看一下logistic函数：

logistic函数

下面是它的曲线图：

logistic曲线图

下面给出我们的logistic regression函数模型：

逻辑回归函数模型

这个模型是这样工作的：当h(x)输出值大于等于0.5时，预测x的类别为1；当h(x)输出值小于0.5时，预测x的类别为0。在这个函数模型中，x是输入，theta是我们要通过训练样本求的参数，求出theta之后我们就可以用这个模型去分类了。那我们如何从训练样本中求出theta呢？我们需要一个cost function，下面给出sklearn库中logisticRegression的实现使用的cost function：

logisticRegressionCostFunction

上下两个式子分别采用L2和L1方法进行regularization,防止模型过拟合。式子左边是regularization term；右边是分类错误惩罚项；参数C权衡两者，实验中经验值，后面会讲到在实际中如何选取C。得到cost function之后我们的目标就是选出一组参数theta使costFunction最小，我们可以采用梯度下降流方法，得到最优theta。

polynomial

很多时候数据的特征维度不高(比如2)，但是样本却不是直线可分的，这个时候就需要人为地增加特征维度，如

polynomial

这样我们得到的decision boundary 就不会只能是直线，而可能是各种各样的曲线，这些曲线分隔出样本的能力比直线更强。

def poly_data(x_data, degree):
    poly = PolynomialFeatures(degree)
    return poly.fit_transform(x_data)

交叉验证(cross validation)

def grid_search(x_data, y_data):
    """
    y_data:label
    :return: searched grid
    """
    logisReg = linear_model.LogisticRegression()
    C_range = 10.0 ** np.arange(-4,3,1)
    grid_parame = dict(C=C_range)
    cvk = cv.StratifiedKFold(y=y_data,n_folds=10)
    grid = GridSearchCV(logisReg, param_grid=grid_parame, cv=cvk)
    grid.fit(x_data, y_data)
    return grid

将样本分成10份，遍历参数C的取值范围内的所有C：对于当前参数C，取一份样本作为test set，剩下的9份为train set，这样的取法总共有10种情况。用train set训练出来的分类器去评估test set，得到分类器在test set上的score，10种情况就有10个score，对这10个score取平均值作为当前参数C的score。遍历完所有C之后，取分数最高的那个C作为我们的模型中的C。

python

import numpy as np
from sklearn import linear_model
from sklearn import cross_validation as cv
from sklearn.grid_search import GridSearchCV
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures

def load_data(filename):
    """
    read data into numpy array
    return a numpy array
    """
    return np.genfromtxt(open(filename, 'rb'), delimiter=',')

def grid_search(x_data, y_data):
    """
    y_data:label
    :return: searched grid
    """
    logisReg = linear_model.LogisticRegression()
    C_range = 10.0 ** np.arange(-4,3,1)
    grid_parame = dict(C=C_range)
    cvk = cv.StratifiedKFold(y=y_data,n_folds=10)
    grid = GridSearchCV(logisReg, param_grid=grid_parame, cv=cvk)
    grid.fit(x_data, y_data)
    return grid

def poly_data(x_data, degree):
    poly = PolynomialFeatures(degree)
    return poly.fit_transform(x_data)

def main():
    data = load_data('logisticRegData.txt')
    X = data[:, 0:-1]
    y = data[:, -1]
    plt.plot(X[np.nonzero(y==1)[0],0], X[np.nonzero(y==1)[0], 1], 'ro',X[np.nonzero(y==0)[0],0], X[np.nonzero(y==0)[0], 1], 'rx')

    degree = 6
    X_poly = poly_data(X, degree)

    print("with cross vadition-----------------------------------------:")
    grid = grid_search(X_poly, y)
    print("the best classifier is :")
    print(grid.best_estimator_)
    #print("logistic regression's coefficient: ")
    #print(grid.best_estimator_.coef_)
    print("score on train data: ")
    print(grid.best_estimator_.score(X_poly, y))
    scores = cv.cross_val_score(grid.best_estimator_, X_poly, y, cv=10)
    print('Estimated score: %0.5f (+/- %0.5f)' % (scores.mean(), scores.std() / 2))

    # .plot decision boundary,create a mesh to plot in
    h = 0.02  #step size in the mesh
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
    xxyy_poly = poly_data(np.c_[xx.ravel(), yy.ravel()], degree)
    Z = grid.best_estimator_.predict(xxyy_poly)
    Z = Z.reshape(xx.shape)
    plt.contourf(xx,yy,Z,cmap=plt.cm.Paired)
    plt.show()

'''
    print("without cross vadition------------------------------------------:")
    logisReg = linear_model.LogisticRegression()
    logisReg.fit(X_poly, y)
    print("logistic regression's coefficient: ")
    print(logisReg.coef_)
    print("score on train data: ")
    print(logisReg.score(X_poly, y))
    scores = cv.cross_val_score(logisReg, X_poly, y, cv=10)
    print('Estimated score: %0.5f (+/- %0.5f)' % (scores.mean(), scores.std() / 2))
'''


if __name__ == "__main__":
    main()

效果图：

效果图

数据在这里：logisticRegression.txt

logistic

polynomial

交叉验证(cross validation)

python

blogroll

social