Java 求文档中出现频率最高的三个单词

编写一个主类Word2Vec，当在属性文件(作为参数传递的文件名)上调用时，它通过处理给定的语料库来矢量化每个单词(即，计算每个出现的上下文的多集合，如上面两个框架框所示)。然后，您需要将向量保存在二级存储上。您可以自由使用自己选择的文件格式(文本或随机访问；参考讲座幻灯片)。出于标记的目的，我们希望在“威尼斯”这个词的上下文中看到最常见的前3个词的输出。

Note that this multiset is a sparse vector representation of a word (in our example, ‘venice’).
The first part of the pair indicates the string value of the word itself, whereas the second part is the
word’s frequency. In the example the frequency of of is 2 because it occurs twice (make sure you thoroughly
understand this example before commencing with the exercise).
To define the contexts precisely, you need to decide on a few things –

The set of punctuation symbols, i.e. during the process of tokenization, the token ‘Venice,’ should be
converted into ‘Venice’ (a simple way to do this is to remove the punctuation symbols from the text
file as a pre-processing step or invoke Java’s inbuilt StringTokenizer class with the appropriate set
of delimiters).
标点符号集，即在标记化过程中，标记“威尼斯”应该转换为“威尼斯”(一个简单的方法是从文本文件中删除标点符号作为预处理步骤，或者使用适当的分隔符集调用Java内置的StringTokenizer类)。

2.Apply case normalization, e.g. treat Scene and scene in an identical manner.

应用案例规范化，例如以相同的方式处理场景和场景。
3. Decide on the length of the context (on both the left and right sides) of a word. Let this number (a
parameter) be k. Our example used k = 2. Too small a k may not be able to capture the informative
words, while too large a k could add noise to the contexts.
For your exercise, use k = 5 (i.e. 5 words on both the left and the right of a word), and use the following
set of symbols as delimiters – '"{}[].;,! plus the whitespaces (i.e., carriage return, new line, tab and space). You should also use case normalization, i.e. convert every token to its lowercase form, i.e. Scene and SCENE should both be converted to scene. 决定一个单词的上下文长度(左右两边)。假设这个数(一个参数)是k，我们的例子使用了k = 2。太小的k可能无法捕获信息量大的单词，而太大的k可能会给上下文增加噪音。在您的练习中，使用k = 5(即一个单词左右各5个单词)，并使用以下一组符号作为分隔符– " { }[]。；,!加上空格(即回车、换行符、制表符和空格)。您还应该使用大小写规范化，即将每个标记转换为小写形式，即场景和场景都应该转换为场景。

用空格当做分隔符，把文章内容转换为字符串数组，再统计。

HDFS 默认示例就是