GO:独特的结构片段可有效重复使用

I often need to get rid of duplicates based on arbitrary equals function. I need implementation that:

  1. is fast and memory effective (does not create map)
  2. is reusable and easy to use, think of slice.Sort() (github.com/bradfitz/slice)
  3. it's not required to keep order of the original slice or preserve original slice
  4. would be nice to minimize copying

Can this be implemented in go? Why this function is not part of some library I am aware of? I was looking e.g. godash (github.com/zillow/godash) implementation uses map and does not allow arbitrary less and equal.

Here is how it should approximately look like. Test:

import (
    "reflect"
    "testing"
)

type bla struct {
    ID string
}

type blas []bla

func (slice blas) Less(i, j int) bool {
    return slice[i].ID < slice[j].ID
}

func (slice blas) EqualID(i, j int) bool {
    return slice[i].ID == slice[j].ID
}

func Test_Unique(t *testing.T) {
    input := []bla{bla{ID: "d"}, bla{ID: "a"}, bla{ID: "b"}, bla{ID: "a"}, bla{ID: "c"}, bla{ID: "c"}}
    expected := []bla{bla{ID: "a"}, bla{ID: "b"}, bla{ID: "c"}, bla{ID: "d"}}
    Unique(input, blas(input).Less, blas(input).EqualID)
    if !reflect.DeepEqual(expected, input) {
        t.Errorf("2: Expected: %v but was %v 
", expected, input)
    }
}

What I think will need to be used to implement this:

  • Only slices as data structure to keep it simple and for easy sorting.
  • Some reflection - the hard part for me! Since I am new to go.

Options

  • You can sort slice and check for adjacent nodes creation = O(n logn),lookup = O(log n) , insertion = O(n), deletion = O(n)
  • You can use a Tree and the original slice together creation = O(n logn),lookup = O(log n) , insertion = O(log n), deletion = O(log n)

In the tree implementation you may put only the index in tree nodes and evaluation of nodes will be done using the Equal/Less functions defined for the interface.

Here is an example with tree, here is the play link

You have to add more functions to make it usable ,and the code is not cache friendly so you may improve the code for make it cache friendly

How to use

  1. Make the type representing slice implement Setter interface
  2. set := NewSet(slice),creates a slice
  3. now set.T has only unique values indexes
  4. implement more functions to Set for other set operations

Code

type Set struct {
    T Tree
    Slice Setter
}

func NewSet(slice Setter) *Set {
    set := new(Set)
    set.T = Tree{nil, 0, nil}
    set.Slice = slice
    for i:=0;i < slice.Len();i++ {
        insert(&set.T, slice, i)
    }
    return set
}

type Setter interface {
    Len() int
    At(int) (interface{},error)
    Less(int, int) bool
    Equal(int, int) bool
}


// A Tree is a binary tree with integer values.
type Tree struct {
    Left  *Tree
    Value int
    Right *Tree
}

func insert(t *Tree, Setter Setter, index int) *Tree {
    if t == nil {
        return &Tree{nil, index, nil}
    }
    if Setter.Equal(t.Value, index) {
        return t
    }

    if Setter.Less(t.Value, index) {
        t.Left = insert(t.Left, Setter, index)
        return t
    }
    t.Right = insert(t.Right, Setter, index)
    return t
}

Bloom filter is frequently used for equality test. There is https://github.com/willf/bloom for example, which awarded some stars on github. This particular implementation uses murmur3 for hashing and bitset for filter, so can be more efficient than map.