使用Python和Scikit-Learn做机器学习

2015-03-15

原文：Introduction to Machine Learning with Python and Scikit-Learn

事实上，这是一个关于数据科学的介绍，这门学科已经越来越热门。现在数据科学最常用的工具是R和Python，每个工具都有优缺点，但是最近python在各方面都强于R(其实我两个都用)，因为包含了很多很多机器学习算法并且拥有大量文档的python库Scikit-Learn出现啦。

注意在这篇文章中我们的重点是在机器学习算法上。通常使用Pandas库做一些基础的数据分析是很棒的。因此，我们把重点放在结果上。更确定地说，使用一个特征对象矩阵作为输入，这个矩阵存放在一个.csv文件里面。

加载数据

首先，这些数据应该被加载到内存里，然后才能处理它，Scikit-Learn使用的是NumPy数组来实现，所以使用NumPy来加载*.csv文件。数据来自UCI机器学习数据集

import numpy as np
import urllib
# url with dataset
url = "http://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
# download the file
raw_data = urllib.urlopen(url)
# load the CSV file as a numpy matrix
dataset = np.loadtxt(raw_data, delimiter=",")
# separate the data from the target attributes
X = dataset[:,0:7]
Y = dataset[:,8]

在下面的例子中使用的都是这组数据，其中，X是包含了7个特征的矩阵，Y是每组数据的真实类标。

数据归一化

几乎所有基于一些梯度方法的机器学习算法对数据数值是敏感的.因此，在使用一个算法之前，应该对数据做归一化处理，或者叫做标准化处理。处理的目的是把所有数据映射在[0,1]这个区间，Scikit-Learn库有一个现成的方法做这件事情：

from sklearn import preprocessing
#normalize the data attributes
normalized_X = preprocessing(X)
standardized_X = preprocessing.scale(X)

特征选择

解决一个问题最重要的是选择或者提取特征的能力.叫做特征选择和特征提取.在Scikit-Learn里面也有很多写好的特征选择的算法。树算法用来计算特征的信息量

from sklearn import metrics
from sklearn.ensemble import ExtraTreeClassifier
model = ExtraTreesClassifier()
model.fit(X, Y)
#display the relative importance of each attribute
print(model.feature_importances_)

其他方法是根据一个高效的搜索方法从特征的子集中找出一个最优的子集，因此可以产生一个最优模型。Scikit-Learning库种也有这样的一个算法叫Recursive Feature Elimination Algorithm

from sklearn.feature_selection import RFE
from sklearn.liner_model import LogisticRegression
model = LogisticRegression()
#create the RFE model and select 3 attributes
rfe = RFE(model, 3)
rfe = rfe.fit(X, Y)
#summmarize the selection of the attributes
print(rfe.support_)
print(rfe.ranking_)

算法一览

Scikit-Learn库已经内置了好多基本的机器学习算法，先来大概看一下。

Logistic Regression(Logistic回归)

这个算法主要用来解决二分类问题，但是多分类问题也会用到。算法的优点是输出是每个对象属于某个类别的概率。

from sklearn import metrics
from sklearn.liner_model import LogisticRegression
model = LogisticRegression()
model.fit(X,Y)
print(model)
#make predictions
expected = Y
predicted = model.predict(X)
#summarize the fit of the model
print(metrics.classfication_report(expected, predicted))
print(metrics.confusion_matrix(expected, predicted))

朴素贝叶斯(Naive Bayes)

朴素贝叶斯是一个很著名的机器学习算法，主要是根据训练样本的特征来计算各个类别的概率，在多分类问题上用的比较多。

from sklearn import metrics
from sklearn.native_bayes import GaussianNB
model = GaussianNB()
model.fit(X, Y)
print(model)
#make predictions
expected = Y
predicted = model.predict(X)
#summarize the fit of the model
print(metrics.classfication_report(expected, predicted))
print(metrics.confusion_matrix(expected, predicted))

k-近邻算法(k-Nearest Neighbours)

kNN(K-近邻算法)一般用来处理复杂的分类问题。例如，可以用它的评价值作为对象的某个特征，有时特征选的比较好的话kNN的效果也比较好。如果参数合适的话，算法在回归问题的效果也很好。

from sklearn import metrics
from sklearn.neighbors import KNeighborsClassifier
#fit a k-nearest neighbor model to the data
model = KNeighborsClassifier()
model.fit(X, Y)
print(model)
#make predictions
expected = Y
predicted = model.predicted(X)
#summarize the fit of the model
print(metrics.classfication_report(expected, predicted))
print(metrics.confusion_matrix(expected, predicted))

决策树算法(Decision Trees)

Classification and Regression Trees (CART) 在实际中经常被用来做回归和分类问题，在多分类问题中效果极佳。

from sklearn import metrics
fron sklearn.tree import DecisonTreeClassifier
#fit a CART model to the data
model = DecisonTreeClassifier()
model.fit(X, Y)
print(model)
#make predictions
expected = Y
predicted = model.predict(X)
#summarize the fit of the model
print(metrics.classfication_report(expected, predicted))
print(metrics.confusion_matrix(expected, predicted))

支持向量机(Support Vector Machines)

SVM (Support Vector Machines) 是处理分类问题中经常被用的一个很好用很好用的算法，也可以用于多分类问题

from sklearn import metrics
from sklearn.svm import SVC
#fit a SVM model to the data
model = SVC()
model.fit(X, Y)
print(model)
#make predictions
expected = Y
predicted = model.predict(X)
#summmarize the fit of the model
print(metrics.classfication_report(expected, predicted))
print(metrics.confusion_matrix(expected, predicted))

除了分类问题和回归问题，Scikit-Learn有很多更复杂的算法，包括聚类，和一些集成算法，比如Bagging和Boosting

如何确定算法的最优参数

建立一个高效算法很重要的一步是确定正确的参数。有经验的话当然比较轻松，但是其他人的话可能就需要好好看看书了。不过，Scikit-Learn内置了一些处理这个问题的方法。举个栗子，从一组规范化的数据里面挑一组值出来

一些不熟悉的概念

Confusion Matrix

点击查看评论

言：