I am trying to figure out how I can align slightly imperfect binary slices using golang. The following four slices all align correctly with different offsets. However, not every bit is the same (marked below) so I can't just compare raw chunks.
func main() {
// Match all three slices up (ignoring occasional errors)
s1 := []int16{0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1}
s2 := []int16{ /* */ 0, 1, 1, 0, 0, 0, 1, 1, 1, 1}
// ^ ^
s3 := []int16{0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1}
// ^
s4 := []int16{ /* */ 0, 0, 0, 1, 1, 1, 0, 0}
slices := make([][]int16, 3)
slices = append(slices, s1, s2, s3, s4)
offsets := forgivingSyncHere(slices)
}
Here is a https://play.golang.org/p/zqJ_4qLc8O
It depends on what your "cost" function is, where your goal is to minimize your "cost".
A cost function could be something like this. The idea is that a "mismatch" is more costly than if there isn't anything to match, which we'll call "overruns" (say twice as costly). Take the number of cases where a[i] != b[i + offset]
for a
and b
equal to s1
,s2
,s3
,s4
and double it. Then add to that the absolute value of each offset
for each pairing (in this case 6 pairings for 4 arrays) for the number of overruns at the beginning. Then add onto that the overruns at the end.
Sample cost function:
func cost(sn [][]int16, offsets [][]int) int {
// cost accumulator
c := 0.0
// the factor of how much more costly a mismatch is than an overrun
mismatchFactor := 2.0
// define what you want, but here is an example of what I said above
for i1:=0;i1<len(sn);i++ {
for i2:=i1+1;i2<len(sn);i2++ {
c += mismatchFactor * diff(sn[i1], sn[i2], offsets[i1][i2])
c += math.Abs(offsets[i1][i2])
c += math.Abs(len(sn[i1]) + offsets[i1][i2] - len(sn[i2]))
}
}
}
// offset of the offset of s1 wrt s2
func diff(s1 []int16, s2 []int16, offset int) int {
// index, index, diff total
i1,i2,d := 0,0,0
if offset >= 0 {
i1 += offset
} else {
i2 -= offset
}
while i1<len(s1) && i2<len(s2) {
if s1[i1] != s2[i2] {
d++
}
i1++
i2++
}
return d
}
Make your cost function however you want, this is just an example. However, assuming you have a cost function, a brute force algorithm is pretty easy to come up with. You can try to optimize the algorithm, though :). There are many ideas. This is very similar to string search algorithms, with edit distances. Wikipedia and Google have many results.
Disclaimer: all of this is untested :), but it should get you started
Similarity can be described using Levenshtein distance, aka edit distance - the number of insertions, deletions and mutations a string has to undergo to become another.
This allows you to talk quantitatively about the degree of similarity, for example the ratio of the edit distance to the length shorter of the two string lengths might be a reasonable metric, but this depends on what exactly you mean by similar.
From there you can found a lower bound for the length and number of identical substrings in each pair of strings to be compared. In your case, by eyeballing your example, looks like 4 bits are what you consider a match. IOW, if you use 4 bit chunks, and check for exact matches for a given pair of sequences, what does the number of matching chunks tell you about their similarity?
If you do exact comparisons on small enough substrings, you're guaranteed at least a few matches, even if the differences are evenly spaced (if they're clustered, larger sections will have exact matches, so the length of the shorter string divided by the maximum edit distance).
An exact solution for generic approximate matching of strings that takes the position of the similarities into account is computationally intensive (and a related and well studied problem is the longest common subsequence), and furthermore has to be applied for each pair of sequences to be compared, not each sequence independently. In contrast, choosing an appropriate chunk size and indexing each string by its possible substrings only depends on each string in isolation, and may provide an approximate answer. You could of course note the position and further constrain that as well.
To sum it up, a naive solution for the problem you're describing is likely to be intractable, and in practice sequence alignment and approximate string matching is done by solving a simpler problem instead, and then double checking those solutions with a more naive/expensive approach.
Rabin Fingerprinting is a more sophisticated method of breaking up strings into substrings than n-grams (sliding window), but it is a stochastic approach and results in variable sized strings (but deterministic ones), so working out the bounds is a bit more involved.