## 1 获取数据集

$cd ~/code/datasets/mlcomp$ ls 379
metadata  raw  test  train


Hahahahahaha. gasp pant Hm, I’m not sure whether the above
was just a silly remark or a serious remark. But in case there are
some misconceptions, I think Henry Robertson hasn’t updated his data
file on Korea since…mid 1970s. Owning a car in Korea is no longer
a luxury. Most middle class people in Korea can afford a car and do
have at least one car. The problem in Korea, especially in Seoul, is
that there are just so many privately-owned cars, as well as taxis and
buses, the rush-hour has become a 24 hour phenomenon and that there is
no place to park. Last time I heard, back in January, the Kim Administration
wanted to legislate a law requireing a potential car owner to provide
his or her own parking area, just like they do in Japan.

Also, Henry would be glad to know that Hyundai isn’t the only
car manufacturer in Korea. Daewoo has always manufactured cars and
I believe Kia is back in business as well. Imported cars, such as
Mercury Sable are becoming quite popular as well, though they are still
quite expensive.

Finally, please ignore Henry’s posting about Korean politics
and bureaucracy. He’s quite uninformed.

## 2 文档的数学表达

TF-IDF 是一种统计方法，用以评估一个词语对于一份文档的重要程度。TF 表示词频 (Term Frequency)，对一份文档而言，词频为特定词语在这篇文档里出现的次数除以文档的词语总数。比如一篇文档总共有 1000 个词，其中 “朴素贝叶斯” 出现了 5 次，“的” 出现了 25 次，“应用” 出现了 12 次，那么它们的词频分别是 0.005, 0.025, 0.012。

IDF 表示一个词的逆向文档频率指数 (Inverse Document Frequency) ，可以由总文档数目除以包含该词语的文档的数目，再将得到的商取对数得到，它表达的是词语的权重指数。比如，我们的数据集总共有 10000 篇文档，其中 “朴素贝叶斯” 只出现在 10 篇文档中，则其权重指数 $IDF = log(\frac {10000} {10}) = 3$ 。“的” 在所有的文档中都出现，则其权重指数 $IDF = log(1) = 0$ 。“应用” 在 1000 篇文档中出现，则其权重指数 $IDF = log(\frac {10000} {1000}) = 1$ 。

from time import time
from sklearn.datasets import load_files

t = time()
print("summary: {0} documents in {1} categories.".format(
len(news_train.data), len(news_train.target_names)))
print("done in {0} seconds".format(time() - t))


loading train dataset ...
summary: 13180 documents in 20 categories.
done in 0.212177991867 seconds


from sklearn.feature_extraction.text import TfidfVectorizer

print("vectorizing train dataset ...")
t = time()
vectorizer = TfidfVectorizer(encoding='latin-1')
X_train = vectorizer.fit_transform((d for d in news_train.data))
print("n_samples: %d, n_features: %d" % X_train.shape)
print("number of non-zero features in sample [{0}]: {1}".format(
news_train.filenames[0], X_train[0].getnnz()))
print("done in {0} seconds".format(time() - t))


vectorizing train dataset ...
n_samples: 13180, n_features: 130274
number of non-zero features in sample
[datasets/mlcomp/379/train/talk.politics.misc/17860-178992]: 108
done in 4.15024495125 seconds


## 3 模型训练

from sklearn.naive_bayes import MultinomialNB

print("traning models ...".format(time() - t))
t = time()
y_train = news_train.target
clf = MultinomialNB(alpha=0.0001)
clf.fit(X_train, y_train)
train_score = clf.score(X_train, y_train)
print("train score: {0}".format(train_score))
print("done in {0} seconds".format(time() - t))


traning models ...
train score: 0.997875569044
done in 0.274363040924 seconds


print("loading test dataset ...")
t = time()
print("summary: {0} documents in {1} categories.".format(
len(news_test.data), len(news_test.target_names)))
print("done in {0} seconds".format(time() - t))


loading test dataset ...
summary: 5648 documents in 20 categories.
done in 0.117918014526 seconds


print("vectorizing test dataset ...")
t = time()
X_test = vectorizer.transform((d for d in news_test.data))
y_test = news_test.target
print("n_samples: %d, n_features: %d" % X_test.shape)
print("number of non-zero features in sample [{0}]: {1}".format(
news_test.filenames[0], X_test[0].getnnz()))
print("done in %fs" % (time() - t))


vectorizing test dataset ...
n_samples: 5648, n_features: 130274
number of non-zero features in sample
[datasets/mlcomp/379/test/rec.autos/7429-103268]: 61
done in 2.915759s


pred = clf.predict(X_test[0])
print("predict: {0} is in category {1}".format(
news_test.filenames[0], news_test.target_names[pred[0]]))
print("actually: {0} is in category {1}".format(
news_test.filenames[0], news_test.target_names[news_test.target[0]]))


predict: datasets/mlcomp/379/test/rec.autos/7429-103268 is in category rec.autos
actually: datasets/mlcomp/379/test/rec.autos/7429-103268 is in category rec.autos


## 4 模型评价

print("predicting test dataset ...")
t0 = time()
pred = clf.predict(X_test)
print("done in %fs" % (time() - t0))


predicting test dataset ...
done in 0.090978s


from sklearn.metrics import classification_report

print("classification report on test set for classifier:")
print(clf)
print(classification_report(y_test, pred,
target_names=news_test.target_names))


classification report on test set for classifier:
MultinomialNB(alpha=0.0001, class_prior=None, fit_prior=True)
precision    recall  f1-score   support

alt.atheism       0.90      0.91      0.91       245
comp.graphics       0.80      0.90      0.85       298
comp.os.ms-windows.misc       0.82      0.79      0.80       292
comp.sys.ibm.pc.hardware       0.81      0.80      0.81       301
comp.sys.mac.hardware       0.90      0.91      0.91       256
comp.windows.x       0.88      0.88      0.88       297
misc.forsale       0.87      0.81      0.84       290
rec.autos       0.92      0.93      0.92       324
rec.motorcycles       0.96      0.96      0.96       294
rec.sport.baseball       0.97      0.94      0.96       315
rec.sport.hockey       0.96      0.99      0.98       302
sci.crypt       0.95      0.96      0.95       297
sci.electronics       0.91      0.85      0.88       313
sci.med       0.96      0.96      0.96       277
sci.space       0.94      0.97      0.96       305
soc.religion.christian       0.93      0.96      0.94       293
talk.politics.guns       0.91      0.96      0.93       246
talk.politics.mideast       0.96      0.98      0.97       296
talk.politics.misc       0.90      0.90      0.90       236
talk.religion.misc       0.89      0.78      0.83       171

avg / total       0.91      0.91      0.91      5648


from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, pred)
print("confusion matrix:")
print(cm)


confusion matrix:
[[224   0   0   0   0   0   0   0   0   0   0   0   0   0   2   5   0   0   1  13]
[  1 267   5   5   2   8   1   1   0   0   0   2   3   2   1   0   0   0   0   0]
[  1  13 230  24   4  10   5   0   0   0   0   1   2   1   0   0   0   0   1   0]
[  0   9  21 242   7   2  10   1   0   0   1   1   7   0   0   0   0   0   0   0]
[  0   1   5   5 233   2   2   2   1   0   0   3   1   0   1   0   0   0   0   0]
[  0  20   6   3   1 260   0   0   0   2   0   1   0   0   2   0   2   0   0   0]
[  0   2   5  12   3   1 235  10   2   3   1   0   7   0   2   0   2   1   4   0]
[  0   1   0   0   1   0   8 300   4   1   0   0   1   2   3   0   2   0   1   0]
[  0   1   0   0   0   2   2   3 283   0   0   0   1   0   0   0   0   0   1   1]
[  0   1   1   0   1   2   1   2   0 297   8   1   0   1   0   0   0   0   0   0]
[  0   0   0   0   0   0   0   0   2   2 298   0   0   0   0   0   0   0   0   0]
[  0   1   2   0   0   1   1   0   0   0   0 284   2   1   0   0   2   1   2   0]
[  0  11   3   5   4   2   4   5   1   1   0   4 266   1   4   0   1   0   1   0]
[  1   1   0   1   0   2   1   0   0   0   0   0   1 266   2   1   0   0   1   0]
[  0   3   0   0   1   1   0   0   0   0   0   1   0   1 296   0   1   0   1   0]
[  3   1   0   1   0   0   0   0   0   0   1   0   0   2   1 280   0   1   1   2]
[  1   0   2   0   0   0   0   0   1   0   0   0   0   0   0   0 236   1   4   1]
[  1   0   0   0   0   1   0   0   0   0   0   0   0   0   0   3   0 290   1   0]
[  2   1   0   0   1   1   0   1   0   0   0   0   0   0   0   1  10   7 212   0]
[ 16   0   0   0   0   0   0   0   0   0   0   0   0   0   0  12   4   1   4 134]]


# Show confusion matrix
import matplotlib.pyplot as plt

plt.figure(figsize=(8, 8), dpi=144)
plt.title('Confusion matrix of the classifier')
ax = plt.gca()
ax.spines['right'].set_color('none')
ax.spines['top'].set_color('none')
ax.spines['bottom'].set_color('none')
ax.spines['left'].set_color('none')
ax.xaxis.set_ticks_position('none')
ax.yaxis.set_ticks_position('none')
ax.set_xticklabels([])
ax.set_yticklabels([])
plt.matshow(cm, fignum=1, cmap='gray')
plt.colorbar()
plt.show()


## 5 参数选择

MultinomialNB 有一个重要的参数是 alpha ，用来控制模型拟合时的平滑度。我们选择了 0.0001 这个值。实际上，有一个更科学的方法，是利用 scikit-learn 的 sklearn.model_selection.GridSearchCV 来进行自动选择。即我们可以设置一个 alpha 参数的范围，然后让用代码选择出一个在这个范围内最优的参数值。感兴趣的朋友可以阅读 GridSearchCV 的文档偿试一下。

(完)