Fork me on GitHub

Python版Word2Vec 尝试

0. 数据准备


首先从搜狗全网新闻数据下载训练数据集,解压后用以下命令生成文件列表保存于文件“sougouCA_list.csv”中:

find `pwd` -name "*.txt" > sougouCA_list.csv

然后准备了一个中文停用词文档,文件中包含有“的、其中、例如”等2000多个中文停用词,每行一个,文件名“all_stopword.txt”。

1. 训练模型


1.1 读取停用词

从文档中读取所有停用词,存放于一个集合中,届时用于查询某单词是否属于停用词。

def get_stopWords(stopWords_fn):
    with open(stopWords_fn, 'rb') as f:
        stopWords_set = {line.strip('\r\t').decode('utf-8') for line in f}
    return stopWords_set

1.2 分词

结巴分词对中文分词已经支持地非常好,此处我们只需从结巴分词的结果中筛选掉停用词。

def sentence2words(sentence, stopWords=False, stopWords_set=None):
    """ 
    split a sentence into words based on jieba
    """
    # seg_words is a generator
    seg_words = jieba.cut(sentence)
    if stopWords:
        words = [word for word in seg_words if word not in stopWords_set and word != ' ']
    else:
        words = [word for word in seg_words]
    return words

1.3 训练并保存

由于训练模型时是迭代进行,一次迭代只需要一小部分数据,所以没有必要一次性将所有数据载入内存,因此需要写一个生成器,一次生成一篇文档的数据:

class MySentences(object):
    def __init__(self, list_csv):
        stopWords_fn = 'all_stopword.txt'
        self.stopWords_set = get_stopWords(stopWords_fn)
        self.pattern = re.compile(u'<content>(.*?)</content>')
        with open(list_csv, 'r') as f:
            self.fns = [line.strip() for line in f]

    def __iter__(self):
        for fn in self.fns:
            with open(fn, 'r') as f:
                for line in f:
                    if line.startswith('<content>'):
                        content = self.pattern.findall(line)
                        if len(content) != 0:
                            yield sentence2words(content[0].strip(), True, self.stopWords_set)

然后训练一个gensim.Word2Vec()模型并保存:

def train_save(list_csv, model_fn):
    sentences = MySentences(list_csv)
    num_features = 200
    min_word_count = 10
    num_workers = 48
    context = 20
    epoch = 20
    sample = 1e-5
    model = Word2Vec(
        sentences,
        size=num_features,
        min_count=min_word_count,
        workers=num_workers,
        sample=sample,
        window=context,
        iter=epoch,
    )
    model.save(model_fn)
    return model

训练好模型后,便可用此模型查找相似词、获取两个词的相似度、获取词向量:

model = train_save('sougouCA_list.csv', 'word2vec_model_CA')

# get the word vector
for w in model.most_similar(u'互联网'):
    print w[0], w[1]

print model.similarity(u'网络', u'互联网')

country_vec = model[u"国家"]
print country_vec

基于此训练好的模型可以进一步做文本分类、文本聚类。

本文所用源代码戳我

转载请注明出处:BackNode

My zhiFuBao

Buy me a cup of coffee

blogroll

social