I'm using filepath.Walk()
to search through all the files in a directory. I'm implementing a search tool, so I'm only interested in opening files with text in them. I'm wondering if there's a way to ignore stuff like binary files that I wouldn't want to search through. I'm trying to minimize os calls, so it would be great if this could be done with just os.FileInfo
.
The only way to know if a file (or any byte stream) contains only "text" is to read the entire contents of the stream and determine if every rune is a "text" character according to your definition.
For example, one might consider a file "ASCII text" if all runes have integer values in [0,128]
, are not control characters, or are whitespace:
func isASCIITextStream(rd io.Reader) (bool, error) {
reader := bufio.NewReader(rd)
for {
r, _, err := reader.ReadRune()
if err == io.EOF {
return true, nil // Every rune was text.
}
if err != nil {
return false, err // Unexpected error.
}
if !isASCIIText(r) {
return false, nil // At least one rune was not text.
}
}
return true, fmt.Errorf("did not find EOF") // Unexpected state.
}
func isASCIIText(r rune) bool {
x := int64(r)
return (x >= 0) && (x <= 128) && (!unicode.IsControl(r) || unicode.IsSpace(r))
}
Of course, most people would consider many other Unicode character classes as containing "text", so whatever your approach is, the unicode
package will likely be helpful for classifying runes.