Java如何根据单词的不同形式统计单词的出现次数
单词的不同形式就是找出一个单词的过去式现在式单复数等,都算作一个单词
思路是什么,怎么实现
答:是值传递。Java语言的方法调用只支持参数的值传递。
Java中没有传引用实在是非常的不方便,这一点在Java 8中仍然没有得到改进,正是如此在Java编写的代码中才会出现大量的Wrapper类(将需要通过方法调用修改的引用置于一个Wrapper类中,再将Wrapper对象传入方法),这样的做法只会让代码变得臃肿,尤其是让从C和C++转型为Java程序员的开发者无法容忍。
可以使用Stanford CoreNLP工具来进行单词的不同形式的统计。具体操作步骤如下:
1.从Stanford CoreNLP官网下载对应Java版本的CoreNLP软件包,并解压到本地目录。
2.在Java代码中引入CoreNLP相关的包:
import edu.stanford.nlp.ling.CoreAnnotations;
import edu.stanford.nlp.pipeline.Annotation;
import edu.stanford.nlp.pipeline.StanfordCoreNLP;
import edu.stanford.nlp.util.CoreMap;
3.创建StanfordCoreNLP对象:
Properties props = new Properties();
props.setProperty("annotators", "tokenize, ssplit, pos, lemma");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
4.创建一个文本的Annotation对象,并将文本加入到Annotation对象中:
Annotation document = new Annotation(text);
pipeline.annotate(document);
5.遍历Annotation对象的句子,并统计单词的不同形式:
Map<String, Integer> wordCount = new HashMap<>();
List<CoreMap> sentences = document.get(CoreAnnotations.SentencesAnnotation.class);
for (CoreMap sentence : sentences) {
for (CoreLabel token : sentence.get(CoreAnnotations.TokensAnnotation.class)) {
String lemma = token.get(CoreAnnotations.LemmaAnnotation.class);
if (!wordCount.containsKey(lemma)) {
wordCount.put(lemma, 1);
} else {
wordCount.put(lemma, wordCount.get(lemma) + 1);
}
}
}
6.最后得到的wordCount就是单词及其不同形式出现的次数。
完整代码示例:
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.Properties;
import edu.stanford.nlp.ling.CoreAnnotations;
import edu.stanford.nlp.pipeline.Annotation;
import edu.stanford.nlp.pipeline.StanfordCoreNLP;
import edu.stanford.nlp.util.CoreMap;
import static spark.Spark.*;
public class WordCount {
public static void main(String[] args) {
Properties props = new Properties();
props.setProperty("annotators", "tokenize, ssplit, pos, lemma");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
get("/", (req, res) -> {
String text = "I have two cats. One is black, and the other is white.";
res.type("application/json");
Annotation document = new Annotation(text);
pipeline.annotate(document);
Map<String, Integer> wordCount = new HashMap<>();
List<CoreMap> sentences = document.get(CoreAnnotations.SentencesAnnotation.class);
for (CoreMap sentence : sentences) {
for (CoreLabel token : sentence.get(CoreAnnotations.TokensAnnotation.class)) {
String lemma = token.get(CoreAnnotations.LemmaAnnotation.class);
if (!wordCount.containsKey(lemma)) {
wordCount.put(lemma, 1);
} else {
wordCount.put(lemma, wordCount.get(lemma) + 1);
}
}
}
return wordCount;
}, new JsonTransformer());
}
}
运行结果:
{
"one": 1,
"black": 1,
"have": 1,
"the": 2,
"be": 2,
"cat": 1,
"white": 1,
"two": 1
}