Here is a typical example of what I need to do
$testArr = array(2.05080E6,29400,420);
$stockArrays = array(
array(2.05080E6,29400,0),
array(2.05080E6,9800,420),
array(1.715E6,24500,280),
array(2.05080E6,29400,140),
array(2.05080E6,4900,7));
I need to identify the stockArray that is the least different. A few clarifications
The stock arrays will always have the same number of elements. The test array will have the same or fewer elements. However, when fewer testArr would be padded out so that potentially matching elements are always in the same place as the stockArray. e.g.
$testArray(29400,140)
would be transformed to
$testArray(0,29400,140);
prior to being subjected to difference testing.
In my example the result would be
$result = array(0=>array(0,0,1),3=>array(0,0,1));
indicating that the least different stock arrays are at indices 0 & 3 with the differences being at position 2.
In PHP I would handle all of this with array_diff as my starting point. For Node/JavaScript I would probably be tempted to the php.js array_diff port though I would be inclined to explore a bit given that in the worst cast scenario it is an O(n2) affair.
I am a newbie when it comes to Golang so I am not sure how I would implement this problem there. I have noted that Node does have an array_diff npm module.
One off-beat idea I have had is converting the array to a padded string (smaller array elements are 0 padded) and effectively do an XOR on the ordinal value of each character but have dismissed that as probably a rather nutty thing to do.
I am concerned with speed but not at all costs. In an ideal world the same solution (algorithm) would be used in each target language though in reality the differences between them might mean that is not possible/not a good idea.
Perhaps someone here might be able to point me to less pedestrian ways of accomplishing this - i.e. not just array_diff ports.
Here's the equivalent of the array_diff solution: (assuming I didn't make a mistake)
package main
import "fmt"
func FindLeastDifferent(needle []float64, haystack [][]float64) int {
if len(haystack) == 0 {
return -1
}
var currentIndex, currentDiff int
for i, arr := range haystack {
diff := 0
for j := range needle {
if arr[j] != needle[j] {
diff++
}
}
if i == 0 || diff < currentDiff {
currentDiff = diff
currentIndex = i
}
}
return currentIndex
}
func main() {
idx := FindLeastDifferent(
[]float64{2.05080E6, 29400, 420},
[][]float64{
{2.05080E6, 29400, 0},
{2.05080E6, 9800, 420},
{1.715E6, 24500, 280},
{2.05080E6, 29400, 140},
{2.05080E6, 4900, 7},
{2.05080E6, 29400, 420},
},
)
fmt.Println(idx)
}
Like you said its O(n * m)
where n
is the number of elements in the needle array, and m
is the number of arrays in the haystack.
If you don't know the haystack ahead of time, then there's probably not much you can do to improve this. But if, instead, you're storing this list in a database, I think your intuition about string search has some potential. PostgreSQL for example supports string similarity indexes. (And here's an explanation of a similar idea for regular expressions: http://swtch.com/~rsc/regexp/regexp4.html)
One other idea: if your arrays are really big you can calculate fuzzy hashes (http://ssdeep.sourceforge.net/) which would make your n
smaller.