0. 数据准备
首先从搜狗全网新闻数据下载训练数据集,解压后用以下命令生成文件列表保存于文件“sougouCA_list.csv”中:
find `pwd` -name "*.txt" > sougouCA_list.csv
然后准备了一个中文停用词文档,文件中包含有“的、其中、例如”等2000多个中文停用词,每行一个,文件名“all_stopword.txt”。
1. 训练模型
1.1 读取停用词
从文档中读取所有停用词,存放于一个集合中,届时用于查询某单词是否属于停用词。
def get_stopWords(stopWords_fn):
with open(stopWords_fn, 'rb') as f:
stopWords_set = {line.strip('\r\t').decode('utf-8') for line in f}
return stopWords_set
1.2 分词
结巴分词对中文分词已经支持地非常好,此处我们只需从结巴分词的结果中筛选掉停用词。
def sentence2words(sentence, stopWords=False, stopWords_set=None):
"""
split a sentence into words based on jieba
"""
# seg_words is a generator
seg_words = jieba.cut(sentence)
if stopWords:
words = [word for word in seg_words if word not in stopWords_set and word != ' ']
else:
words = [word for word in seg_words]
return words
1.3 训练并保存
由于训练模型时是迭代进行,一次迭代只需要一小部分数据,所以没有必要一次性将所有数据载入内存,因此需要写一个生成器,一次生成一篇文档的数据:
class MySentences(object):
def __init__(self, list_csv):
stopWords_fn = 'all_stopword.txt'
self.stopWords_set = get_stopWords(stopWords_fn)
self.pattern = re.compile(u'<content>(.*?)</content>')
with open(list_csv, 'r') as f:
self.fns = [line.strip() for line in f]
def __iter__(self):
for fn in self.fns:
with open(fn, 'r') as f:
for line in f:
if line.startswith('<content>'):
content = self.pattern.findall(line)
if len(content) != 0:
yield sentence2words(content[0].strip(), True, self.stopWords_set)
然后训练一个gensim.Word2Vec()模型并保存:
def train_save(list_csv, model_fn):
sentences = MySentences(list_csv)
num_features = 200
min_word_count = 10
num_workers = 48
context = 20
epoch = 20
sample = 1e-5
model = Word2Vec(
sentences,
size=num_features,
min_count=min_word_count,
workers=num_workers,
sample=sample,
window=context,
iter=epoch,
)
model.save(model_fn)
return model
训练好模型后,便可用此模型查找相似词、获取两个词的相似度、获取词向量:
model = train_save('sougouCA_list.csv', 'word2vec_model_CA')
# get the word vector
for w in model.most_similar(u'互联网'):
print w[0], w[1]
print model.similarity(u'网络', u'互联网')
country_vec = model[u"国家"]
print country_vec
基于此训练好的模型可以进一步做文本分类、文本聚类。
本文所用源代码戳我。