If you have a good understanding of algorithms and data structures, fluency with the intricacies of your language, and experience with your compiler, you should be capable of writing functions that deliver excellent performance. However, some functions in the Go standard library will still do better. They take advantage of Go internals and low-level, architecture-specific instructions beyond the reach of most experienced Go developers.

Count

One such function is Count in the strings package. It returns the number of non-overlapping occurrences of string substr in string s:

func Count(s, substr string) int

If you follow the code down the rabbit hole, you see it checking to see if you have a particular x86 instruction available. If so, that takes you down into some low-level instructions:

TEXT runtime·countByte(SB),NOSPLIT,$0
    // Shuffle X0 around so that each byte contains
    // the character we're looking for.
    MOVD AX, X0
    PUNPCKLBW X0, X0
    PUNPCKLBW X0, X0
    PSHUFL $0, X0, X0
    // [bunch more low-level instructions...]

and so on.

User-Defined Count

For comparison purposes, I wrote my own count function. Perhaps it’s not perfectly optimized, but it should perform quite well if substr isn’t large, and the code is relatively readable and straightforward:

func count(s, substr string) int {
    if s == "" || substr == "" {
        return 0
    }
    if len(substr) > len(s) {
        return 0
    }
    result := 0
    j := 0
    for i := 0; i < len(s); {
        if s[i] == substr[j] {
            j++
            i++
            if j == len(substr) {
                result++
                j = 0
            }
        } else {
            i = i - j + 1
            j = 0
        }
    }
    return result
}

i is the index of s, j is the index of substr. Each iteration I see if the current element in s matches the expected element in substr, if it does I advance both and continue on, and if it happens to be the last element in substr then I increment my result count and reset j to the start of substr. If it didn’t match, i goes back to one more than the point it started matching substr, and I start looking for a match with the start of substr again in the next iteration. This continues until i reaches the end of s.

Then I wrote a basic test to make sure the function worked and returned the same results as the strings.Count function for a few examples:

func TestCountBasic(t *testing.T) {
    tests := []struct{
        s, substr   string
        r           int
    }{
        {"mississippi", "is", 2},
        {"mississippi", "i", 4},
        {"mississippi", "sis", 1},
        {"mississippi", "pie", 0},
        {"mississippi", "issi", 1},
        {"mississippi", "pi", 1},
        {"powowpowopowowp", "powowp", 2},
        {Multiply("foob", 10), "foo", 10},
    }
    for _, test := range tests {
        result := count(test.s, test.substr)
        if result != test.r {
            t.Errorf("count(%s, %s) yielded %d, expected %d.\n",
                test.s, test.substr, result, test.r)
        }
    }
    for _, test := range tests {
        result := strings.Count(test.s, test.substr)
        if result != test.r {
            t.Errorf("strings.Count(%s, %s) yielded %d, expected %d.\n",
                test.s, test.substr, result, test.r)
        }
    }
}

The result was the same for all tested examples. Multiply is a function I wrote that takes a string s and an int count and returns a string made of s repeated count times. The point is to generate longer strings:

func Multiply(s string, count int) string

Benchmarking

The next step was to write some benchmark functions that ran my custom function and strings.Count on the same inputs:

var globalCountResult int

var foob10, foob100, foob1000, foob10000 string

func init() {
    foob10 = Multiply("foob", 10) + "bar"
    foob100 = Multiply("foob", 100) + "bar"
    foob1000 = Multiply("foob", 1000) + "bar"
    foob10000 = Multiply("foob", 10000) + "bar"
}

func benchMyCount(s, substr string, b *testing.B) {
    var localCountResult int
    for n := 0; n < b.N; n++ {
        localCountResult = count(s, substr)
    }
    globalCountResult = localCountResult
}

func BenchmarkMyCount10(b *testing.B) { benchMyCount(foob10, "foo", b) }
func BenchmarkMyCount100(b *testing.B) { benchMyCount(foob100, "foo", b) }
func BenchmarkMyCount1000(b *testing.B) { benchMyCount(foob1000, "foo", b) }
func BenchmarkMyCount10000(b *testing.B) { benchMyCount(foob10000, "foo", b) }
func BenchmarkMyCountLast10(b *testing.B) { benchMyCount(foob10, "bar", b) }
func BenchmarkMyCountLast100(b *testing.B) { benchMyCount(foob100, "bar", b) }
func BenchmarkMyCountLast1000(b *testing.B) { benchMyCount(foob1000, "bar", b) }
func BenchmarkMyCountLast10000(b *testing.B) { benchMyCount(foob10000, "bar", b) }

func benchStringsCount(s, substr string, b *testing.B) {
    var localCountResult int
    for n := 0; n < b.N; n++ {
        localCountResult = strings.Count(s, substr)
    }
    globalCountResult = localCountResult
}

func BenchmarkStringsCount10(b *testing.B) { benchStringsCount(foob10, "foo", b) }
func BenchmarkStringsCount100(b *testing.B) { benchStringsCount(foob100, "foo", b) }
func BenchmarkStringsCount1000(b *testing.B) { benchStringsCount(foob1000, "foo", b) }
func BenchmarkStringsCount10000(b *testing.B) { benchStringsCount(foob10000, "foo", b) }
func BenchmarkStringsCountLast10(b *testing.B) { benchStringsCount(foob10, "bar", b) }
func BenchmarkStringsCountLast100(b *testing.B) { benchStringsCount(foob100, "bar", b) }
func BenchmarkStringsCountLast1000(b *testing.B) { benchStringsCount(foob1000, "bar", b) }
func BenchmarkStringsCountLast10000(b *testing.B) { benchStringsCount(foob10000, "bar", b) }

Names with ‘Last’ hide one copy of substr at the end, with no matches prior to that. The others just have one match repeated after the other the whole way down. These opposite use cases exhibit the different competitive advantages of the two count implementations:

$ go test -bench=Count
BenchmarkMyCount10              20000000               110 ns/op
BenchmarkMyCount100              2000000               995 ns/op
BenchmarkMyCount1000              200000              9843 ns/op
BenchmarkMyCount10000              10000            103432 ns/op
BenchmarkMyCountLast10          10000000               144 ns/op
BenchmarkMyCountLast100          1000000              1285 ns/op
BenchmarkMyCountLast1000          100000             12645 ns/op
BenchmarkMyCountLast10000          10000            125480 ns/op
BenchmarkStringsCount10         10000000               194 ns/op
BenchmarkStringsCount100         1000000              2409 ns/op
BenchmarkStringsCount1000          50000             24506 ns/op
BenchmarkStringsCount10000          5000            252444 ns/op
BenchmarkStringsCountLast10     50000000                28.5 ns/op
BenchmarkStringsCountLast100    10000000               194 ns/op
BenchmarkStringsCountLast1000    2000000               979 ns/op
BenchmarkStringsCountLast10000    200000              8862 ns/op

Conclusions

The above results demonstrate that strings.Count is incredibly fast at finding instances of substr in s. While my function performs better if every non-overlapping substring of len(substr) is a match, that’s an edge case. So unless that’s what you expect the vast majority of your inputs to be, you should use the standard library function.

More broadly, unless you know better, you should try to use the Go standard library where practicable. A little creativity may be required to make it fit your application, but as long as you don’t have to twist into a pretzel to get to a solution it’s probably the right thing to do. If you want to be sure, benchmark a comparison – but in general it’s a better investment to write more application code that takes advantage of the standard library than to write benchmarks to show that you should.

This brief study also reinforces something I have learned in my experience as a software engineer: to be really good at development in a given language, you have to gain fluency with the standard library in addition to the language itself (and any packages you rely on). As one of my mentors concisely put it: “Know your APIs.”

I can’t say that every function in the standard library has the kind of performance advantage you see with strings.Count. But it is more important to keep in mind that a package you download from Github (no matter how popular) is not the same thing as the standard library, and you may need to verify its performance prior to relying on it.

Postscript: Multiply Implementations

My implementation of the Multiply function you saw above:

func Multiply(s string, count int) string {
    length := len(s) * count
    bresult := make([]byte, length, length)
    bs := []byte(s)
    for x := 0; x < length; x += len(s) {
        ruler := bresult[x:x+len(s)]
        copy(ruler, bs)
    }
    return string(bresult)
}

Here I used my understanding of how slices and strings work in Go to write a function that performs slightly better than a more naive implementation, such as one that uses strings.Join:

func MultiplyJoin(s string, count int) string {
    collection := make([]string, count, count)
    for x := 0; x < count; x++ {
        collection[x] = s
    }
    return strings.Join(collection, "")
}

For small inputs, say 10 < count < 150, the difference is trivial:

BenchmarkMultiply            500000              2696 ns/op
BenchmarkMultiplyJoin        500000              3824 ns/op

At larger inputs, say count = 500, you start to be able to tell the difference:

BenchmarkMultiply            300000              4506 ns/op
BenchmarkMultiplyJoin        200000             10623 ns/op

But even then the difference is so small that for most applications you just can’t justify adding any complexity, and you should implement the function in the way you find most readable and straightforward. Except you really shouldn’t use a repeated concatenation implementation. This is just bad:

func MultiplyConcatenations(s string, count int) string {
    result := ""
    for x := 0; x < count; x++ {
        result += s
    }
    return result
}

The terrible results speak for themselves:

$ go test -bench=Multiply
BenchmarkMultiplyBasic                    300000              4240 ns/op
BenchmarkMultiplyJoinBasic                200000              9942 ns/op
BenchmarkMultiplyConcatenationsBasic       10000            170523 ns/op

Multiply Buffer

I wondered if using the bytes.Buffer type from the standard library might perform better than my homemade byte slice Multiply implementation. So I tested it. First a relatively naive use of bytes.Buffer:

func MultiplyBuffer1(s string, count int) string {
    var b bytes.Buffer
    for x := 0; x < count; x++ {
        b.WriteString(s)
    }
    return b.String()
}

The expected performance issue here is that the default size will not be enough for any significant count, and the resize will typically underestimate how fast this will grow several times over. That’s why I put a “1” in the function name and added a second function that allocates the entire buffer up front:

func MultiplyBuffer2(s string, count int) string {
    bresult := make([]byte, 0, len(s) * count)
    bs := []byte(s)
    b := bytes.NewBuffer(bresult)
    for x := 0; x < count; x++ {
        b.Write(bs)
    }
    return b.String()
}

This looks much better. Conversion of s to a byte slice has a trivial performance cost, but even so I did it outside the loop so it doesn’t have to be repeated. I gave bresult length 0 because the package documentation says the argument passed to NewBuffer: “should have the desired capacity but a length of zero.”

So: can using bytes.Buffer in this way outperform my homemade Multiply? Here are the benchmark results for count = 500:

$ go test -bench=Multiply
BenchmarkMultiplyBasic                    300000              4422 ns/op
BenchmarkMultiplyJoinBasic                200000              9904 ns/op
BenchmarkMultiplyConcatenationsBasic       10000            158656 ns/op
BenchmarkMultiplyBuffer1Basic             200000             12336 ns/op
BenchmarkMultiplyBuffer2Basic             200000              9667 ns/op

It turns out my homemade implementation still outperforms it. But the difference is too small to be compelling, unless this is the major activity of your application.