IT编程 > 开发语言 > .net

cs224n笔记05-探索词向量

107人参与2020-07-07

1. 加载词向量

def load_word2vec(embeddings_fp="./GoogleNews-vectors-negative300.bin"):
    """ Load Word2Vec Vectors
        Param:
            embeddings_fp (string) - path to .bin file of pretrained word vectors
        Return:
            wv_from_bin: All 3 million embeddings, each lengh 300
                This is the KeyedVectors format: https://radimrehurek.com/gensim/models/deprecated/keyedvectors.html
    """
    embed_size = 300
    print("Loading 3 million word vectors from file...")
    ## 自己下载的文件
    wv_from_bin = KeyedVectors.load_word2vec_format(embeddings_fp, binary=True)
    vocab = list(wv_from_bin.vocab.keys())
    print("Loaded vocab size %i" % len(vocab))
    return wv_from_bin
wv_from_bin = load_word2vec()
print()

2. 降维

def get_matrix_of_vectors(wv_from_bin, required_words=['barrels', 'bpd', 'ecuador', 'energy', 'industry', 'kuwait', 'oil', 'output', 'petroleum', 'venezuela']):
    """ Put the word2vec vectors into a matrix M.
        将word2vec向量放入矩阵M中。
        Param:
            wv_from_bin: KeyedVectors object; the 3 million word2vec vectors loaded from file
            从文件中加载的300万个word2vec向量
        Return:
        
            M: numpy matrix shape (num words, 300) containing the vectors
            M:包含向量的numpy矩阵形状(num字,300)
            word2Ind: dictionary mapping each word to its row number in M
            word2Ind:字典将每个单词映射到它在M中的行号
    """
    import random
    words = list(wv_from_bin.vocab.keys())
    print("Shuffling words ...")
    random.shuffle(words)
    words = words[:10000]
    print("Putting %i words into word2Ind and matrix M..." % len(words))
    word2Ind = {}
    M = []
    curInd = 0
    for w in words:
        try:
            M.append(wv_from_bin.word_vec(w))
            word2Ind[w] = curInd
            curInd += 1
        except KeyError:
            continue
    for w in required_words:
        try:
            M.append(wv_from_bin.word_vec(w))
            word2Ind[w] = curInd
            curInd += 1
        except KeyError:
            continue
    M = np.stack(M)
    print("Done.")
    return M, word2Ind

3. 单词类比测试

#man和woman对应king和queen
pprint.pprint(wv_from_bin.most_similar(positive=['woman', 'king'], negative=['man']))

 

 

本文地址:https://blog.csdn.net/z1103757047/article/details/107167967

您对本文有任何疑问!!点此进行留言回复

推荐阅读

猜你喜欢

cs224n笔记05-探索词向量

07-07

Leetcode231. 2的幂

07-07

【LeetCode】二叉树各种遍历大汇总(秒杀前序、中序、后序、层序)递归 & 迭代

07-07

LeetCode1498. 满足条件的子序列数目

07-07

【Leet-Code】41. 缺失的第一个正数

07-07

Mac下Sublime Text的Emmet无法补全React代码

07-07

拓展阅读

大家都在看

API调用微信getWXACodeUnlimit()获取小程序码

10-25

RabbitMQ单机集群搭建出现Error: unable to perform an operation on node 'rabbit1@ClusterNode1'

07-18

DotNetty在window和linux下的性能对比

03-28

.net core 图片合并,图片水印,等比例缩小,SixLabors.ImageSharp

08-09

前后端分离,https站点无法通过Ajax访问http资源(Mixed Content,The page at 'https://xxx.com' was loaded over HTTPS)

12-20

WPF简单的分页控件实现

03-30

ASP.NET中实现中文简/繁体自动转换的类

07-02

layui,返回的数据不符合规范,正确的成功状态码 (code) 应为:0

01-28

热门评论