如何用Python-pandas和gensim将数据框中的单词映射为整数ID?[英] How to map the word in data frame to integer ID with python-pandas and gensim?

本文是小编为大家收集整理的关于如何用Python-pandas和gensim将数据框中的单词映射为整数ID?的处理方法,想解了如何用Python-pandas和gensim将数据框中的单词映射为整数ID?的问题怎么解决?如何用Python-pandas和gensim将数据框中的单词映射为整数ID?问题的解决办法?那么可以参考本文帮助大家快速定位并解决问题。

问题描述

给定这样一个数据框,包括项目和相应的评论文本:

item_id          review_text
B2JLCNJF16       i was attracted to this...
B0009VEM4U       great snippers...

我想映射 review_text 中最频繁出现的 5000 单词,所以生成的数据框应该是这样的:

item_id            review_text
B2JLCNJF16         1  2  3  4  5...
B0009VEM4U         6... #as the word "snippers"  is out of the top 5000 most frequent word

或者,词袋向量是首选:

item_id            review_text
B2JLCNJF16         [1,1,1,1,1....]
B0009VEM4U         [0,0,0,0,0,1....] 

我该怎么做?非常感谢!

编辑:我试过@ayhan 的回答.现在我已经成功地将评论文本改成了 doc2bow 形式:

item_id            review_text
B2JLCNJF16         [(123,2),(130,3),(159,1)...]
B0009VEM4U         [(3,2),(110,2),(121,5)...]

表示ID为123的单词在该文档中出现了2次.现在我想把它转移到像这样的向量:

[0,0,0,.....,2,0,0,0,....,3,0,0,0,......1...]
        #123rd         130th        159th

你是怎么做到的?提前谢谢!

推荐答案

首先,获取每行的单词列表:

df["review_text"] = df["review_text"].map(lambda x: x.split(' '))

现在您可以将 df["review_text"] 传递给 gensim 的字典:

from gensim import corpora
dictionary = corpora.Dictionary(df["review_text"])

对于 5000 个最常用的词,使用 filter_extremes 方法:

dictionary.filter_extremes(no_below=1, no_above=1, keep_n=5000)

doc2bow 方法将为您提供词袋表示(word_id、频率):

df["bow"] = df["review_text"].map(dictionary.doc2bow)

0     [(1, 2), (3, 1), (5, 1), (11, 1), (12, 3), (18...
1     [(0, 3), (24, 1), (28, 1), (30, 1), (56, 1), (...
2     [(8, 1), (15, 1), (18, 2), (29, 1), (36, 2), (...
3     [(69, 1), (94, 1), (115, 1), (123, 1), (128, 1...
4     [(2, 1), (18, 4), (26, 1), (32, 1), (55, 1), (...
5     [(6, 1), (18, 1), (30, 1), (61, 1), (71, 1), (...
6     [(0, 5), (13, 1), (18, 6), (31, 1), (42, 1), (...
7     [(0, 10), (5, 1), (18, 1), (35, 1), (43, 1), (...
8     [(0, 24), (1, 4), (4, 2), (7, 1), (10, 1), (14...
9     [(0, 7), (18, 3), (30, 1), (32, 1), (34, 1), (...
10    [(0, 5), (9, 1), (18, 3), (19, 1), (21, 1), (2...

得到词袋表示后,你可以在每一行中concat系列(可能效率不高):

df2 = pd.concat([pd.DataFrame(s).set_index(0) for s in df["bow"]], axis=1).fillna(0).T.set_index(df.index)


    0   1   2   3   4   5   6   7   8   9   ... 728 729 730 731 732 733 734 735 736 737
0   0   2   0   1   0   1   0   0   0   0   ... 0   0   0   0   0   0   0   0   0   0
1   3   0   0   0   0   0   0   0   0   0   ... 0   0   0   0   0   0   0   0   0   0
2   0   0   0   0   0   0   0   0   1   0   ... 0   0   0   0   0   1   1   0   0   0
3   0   0   0   0   0   0   0   0   0   0   ... 0   0   0   0   0   0   0   0   0   0
4   0   0   1   0   0   0   0   0   0   0   ... 0   0   0   0   0   1   0   0   1   0
5   0   0   0   0   0   0   1   0   0   0   ... 0   0   0   1   0   0   0   0   0   0
6   5   0   0   0   0   0   0   0   0   0   ... 0   0   0   1   0   0   0   0   0   0
7   10  0   0   0   0   1   0   0   0   0   ... 0   0   0   0   0   0   0   1   0   0
8   24  4   0   0   2   0   0   1   0   0   ... 1   1   2   0   1   3   1   0   1   0
9   7   0   0   0   0   0   0   0   0   0   ... 0   0   0   0   0   0   0   0   0   1
10  5   0   0   0   0   0   0   0   0   1   ... 0   0   0   0   0   0   0   0   0   0

本文地址:https://www.itbaoku.cn/post/1728148.html