我可以进一步优化它，使其运行更快吗？

As you can see in the following pprof output, I have these nested for loops which take most of the time of my program. Source is in golang but code is explained below:

  8.55mins    1.18hrs     20:   for k := range mapSource {
  4.41mins    1.20hrs     21:           if positions, found := mapTarget[k]; found {
         .          .     22:                   // save all matches
  1.05mins   1.05mins     23:                   for _, targetPos := range positions {
  2.25mins   2.33mins     24:                           for _, sourcePos := range mapSource[k] {
     1.28s     15.78s     25:                                   matches = append(matches, match{int32(targetPos), int32(sourcePos)})
         .          .     26:                           }
         .          .     27:                   }
         .          .     28:           }
         .          .     29:   }

At the moment the structures I'm using are 2 map[int32][]int32, targetMap and sourceMap.

These maps contain, for a given key, an array of ints. Now I want to find the keys that match in both maps, and save the combinations of the elements in the arrays.

So for example:

sourceMap[1] = [3,4]
sourceMap[5] = [9,10]

targetMap[1] = [1,2,3]
targetMap[2] = [2,3]
targetMap[3] = [1,2]

The only key in common is 1 and the result will be [(3,1), (3,2), (3,3), (4,1), (4,2), (4,3)]

Is there any possible way (a more appropriate data structure or whatever) that could improve the speed of my program?

In my case, maps can contain somewhere between 1000 and 150000 keys, while the arrays inside are usually pretty small.

EDIT: Concurrency is not an option as this is already being run several times in several threads at the same time.

I would probably do it like this so that I can do some of the work concurrently:

https://play.golang.org/p/JHAmPRh7jr

package main

import (
    "fmt"
    "sync"
)

var final [][]int32
var wg sync.WaitGroup
var receiver chan []int32
func main() {
    final = [][]int32{}
    mapTarget := make(map[int32][]int32)
    mapSource := make(map[int32][]int32)
    mapSource[1] = []int32{3, 4}
    mapSource[5] = []int32{9, 10}

    mapTarget[1] = []int32{1, 2, 3}
    mapTarget[2] = []int32{2, 3}
    mapTarget[3] = []int32{1, 2}
    wg = sync.WaitGroup{}
    receiver = make(chan []int32)
    go func() {
        for elem := range receiver {
            final = append(final, elem)
            wg.Done()
        }
    }()
    for k := range mapSource {
        if _, ok := mapTarget[k]; ok {
            wg.Add(1)
            go permutate(mapSource[k], mapTarget[k])
        }
    }
    wg.Wait()
    fmt.Println(final)

}

func permutate(a, b []int32) {
    for i := 0; i < len(a); i++ {
        for j := 0; j < len(b); j++ {
            wg.Add(1)
            receiver <- []int32{a[i], b[j]}
        }
    }
    wg.Done()
}

You may even want to see if you get any benefit from this:

for k := range mapSource {
      wg.Add(1)
      go func(k int32) {
          if _, ok := mapTarget[k]; ok {
              wg.Add(1)
              go permutate(mapSource[k], mapTarget[k])
          }
          wg.Done()
      }(k)
 }

Can I optimise this further so that it runs faster?
Is there any possible way (a more appropriate data structure or whatever) that could improve the speed of my program?

Probably.

The XY problem is asking about your attempted solution rather than your actual problem. This leads to enormous amounts of wasted time and energy, both on the part of people asking for help, and on the part of those providing help.

We don't have even the most basic information about your problem, a description of the form, content, and frequency of your original input data, and your desired output. What original data should drive a benchmark?

I created some fictional original data, which produced some fictional output and results:

BenchmarkPeterSO-4   30    44089894 ns/op    5776666 B/op      31 allocs/op
BenchmarkIvan-4      10   152300554 ns/op   26023924 B/op    6022 allocs/op

It is possible that your algorithms are slow.

At first I was thinking:

Calculate keys in common in one batch, and calculate the final slice size.
Make slice with capacity that step 1 calculated.
Append one by one.

Then next structure, but it would not generate final result as array, but all the append work would be just link the node.

type node struct {
    val    int
    parent *node
    next   *node
    child  *node
}

type tree struct {
    root  *node
    level int
}

var sourceMap map[int]*tree

The best optimization probably involves changing the source and target data structures to begin with so you're don't have to iterate as much, but it's hard to be sure without knowing more about what the underlying problem you're solving is, and how the maps are generated.

However, there is an optimization that should get you a roughly 2x boost (just an educated guess), depending on the exact numbers.

var sources, targets []int32

for k, srcPositions := range mapSource {
    if tgtPositions, found := mapTarget[k]; found {
        sources = append(sources, srcPositions...)
        targets = append(targets, tgtPositions...)
    }
}

matches = make([]match, len(sources) * len(targets))
i := 0
for _, s := range(sources) {
    for _, t := range(targets) {
        matches[i] = match{s, t}
        i++
    }
}

The general idea is to minimize the amount of copying that has to be done, and improve the locality of memory references. I think this is about the best you can do with this data structure. My hunch is this isn't the best datastructure to begin with for the underlying problem, and there are much bigger gains to be had.