This script is compling without errors in play.golang.org: http://play.golang.org/p/Hlr-IAc_1f
But when I run in on my machine, much longer than I expect happens with nothing happening in the terminal.
What I am trying to build is a PartOfSpeech Tagger.
I think the longest part is loading lexicon.txt into a map and then comparing each word with every word there to see if it has already been tagged in the lexicon. The lexicon only contains verbs. But doesn't every word need to be checked to see if it is a verb.
The larger problem is that I don't know how to determine if a word is a verb with an easy heuristic like adverbs, adjectives, etc.
You've got a large array argument in this function:
func stringInArray(a string, list [214]string) bool{
for _, b := range list{
if b == a{
return true;
}
}
return false
}
The array of stopwords gets copied each time you call this function.
Mostly in Go, you should uses slices rather than arrays most of the time. Change the definition of this to be list []string
and define stopWords
as a slice rather than an array:
stopWords := []string{
"and", "or", ...
}
Probably an even better approach would be to build a map of the stopWords:
isStopWord := map[string]bool{}
for _, sw := range stopWords {
isStopWord[sw] = true
}
and then you can check if a word is a stopword quickly:
if isStopWord[word] { ... }
(Quoting):
I don't know how to determine if a word is a verb with an easy heuristic like adverbs, adjectives, etc.
I can't speak to any issues in your Go implementation, but I'll address the larger problem of POS tagging in general. It sounds like you're attempting to build a rule-based unigram tagger. To elaborate a bit on those terms:
State-of-the-art for POS tagging is above 97% accuracy on well-formed newswire text (accuracy on less formal genres is naturally lower). A rule-based tagger will probably perform considerably worse (you'll have to determine the accuracy level needed to meet your requirements). If you want to continue down the rule-based path, I'd recommend reading this tutorial. The code is based on Haskell, but it will help you learn the concepts and issues in rule-based tagging.
That said, I'd strongly recommend you look at other tagging methods. I mentioned the weaknesses of unigram tagging. Related approaches would be 'bigram', meaning that we consider the previous word when tagging word n, 'trigram' (usually the previous 2 words, or the previous word, the current word, and the following word); more generally, 'n-gram' refers to considering a sequence of n words (often, a sliding window around the word we're currently tagging). That context can help us disambiguate 'fish', 'last', 'flies', etc.
E.g., in
We fish
we probably want to tag fish as a verb, whereas in
ate fish
it's certainly a noun.
The NLTK tutorial might be a good reference here. An solid n-gram tagger should get you above 90% accuracy; likely above 95% (again on newswire text).
More sophisticated methods (known as 'structured inference') consider the entire tag sequence as a whole. That is, instead of trying to predict the most probable tag for each word separately, they attempt to predict the most probable sequence of tags for the entire input sequence. Structured inference is of course more difficult to implement and train, but will usually improve accuracy vs. n-gram approaches. If you want to read up on this area, I suggest Sutton and McCallum's excellent introduction.