Skip to content

Conversation

@wz1000
Copy link
Contributor

@wz1000 wz1000 commented Jul 7, 2022

Also avoid linking against gcc/gcc_s on all platforms.

This works around https://gitlab.haskell.org/ghc/ghc/-/issues/21787
and https://gitlab.haskell.org/ghc/ghc/-/issues/19900 which cause
problems when GHC's RTS linker tries to load text, which occurs if
you use a statically linked GHC to compile a file with a TH splice that
depends on text.

Since we don't require SSE4.2 to build text -simdutf, this shouldn't
be much of a pessimisation.

Fixes #450

This is meant to be a temporary workaround for https://gitlab.haskell.org/ghc/ghc/-/issues/21787 while we work on a robust method for properly exposing all GCC symbols from the RTS linker.

However, older versions of GHC (particularly the 9.0 series and earlier) which won't be patched still need a workaround so that text-2.0 is usable under all configurations.

It is also problematic to include extra-libraries: gcc_s when compiling with clang.

@Bodigrim
Copy link
Contributor

Bodigrim commented Jul 7, 2022

Thanks @wz1000. First of all, before we discuss the chosen approach, I'd like to see a reproducible evidence of the issue as an additional CI job. This would help us to validate the solution and prevent future regressions.

@wz1000
Copy link
Contributor Author

wz1000 commented Jul 11, 2022

@Bodigrim I've modified the -simdutf CI job to test on windows (which is the only GHC distribution we ship with a statically linked GHC). I also modified it to test GHC-9.2.1 instead of latest because currently latest resolves to 9.2.2, and chocolatey includes a workaround on that version which also masks the text issue.

See https://github.com/haskell/text/runs/7279337599 (which is a run from #454) for an example of the failure without this patch.

@wz1000 wz1000 force-pushed the wip/no-popcount branch from e85f4e0 to 9524ab3 Compare July 11, 2022 09:33
@Bodigrim
Copy link
Contributor

@wz1000

text/text.cabal

Lines 197 to 198 in 971051b

if os(windows) && impl(ghc < 9.3)
extra-libraries: gcc_s

is there for a reason indeed. If you remove it, the build breaks. But this does not motivate me to avoid __builtin_popcountll, because extra-libraries: gcc_s is a much simpler workaround. What I've been asking you was to demonstrate that the existing setup is not enough and there exists a reproducible configuration with a build failure.

@wz1000 wz1000 force-pushed the wip/no-popcount branch from 9524ab3 to 249eb46 Compare July 12, 2022 11:03
@wz1000
Copy link
Contributor Author

wz1000 commented Jul 12, 2022

I've improved the implementation to only use 2 bit shifts and a multiplication to compute the popcount and added a CI job that tests it on alpine.

@wz1000 wz1000 force-pushed the wip/no-popcount branch 2 times, most recently from 72e91da to 4820609 Compare July 12, 2022 11:07
@wz1000
Copy link
Contributor Author

wz1000 commented Jul 12, 2022

Unfortunately the static alpine GHC binaries are not usable because of https://gitlab.haskell.org/ghc/ghc/-/issues/21844

I'm stumped by this, not sure how to add a non windows CI job that demonstrates the problem.

@wz1000 wz1000 force-pushed the wip/no-popcount branch 3 times, most recently from a777146 to f6c3cd4 Compare July 12, 2022 11:40
@wz1000
Copy link
Contributor Author

wz1000 commented Jul 12, 2022

I guess I can try to run it in an alpine container and it should work. I'll try this tomorrow.

@Bodigrim
Copy link
Contributor

I'm stumped by this, not sure how to add a non windows CI job that demonstrates the problem.

As I suggested in #450 (comment), try to reproduce bytestring CI with respect to running Windows job with a clean PATH: https://github.com/haskell/bytestring/blob/22b36125ac52605e807b7b96ef31e8f087248f17/.github/workflows/ci.yml#L93-L99
If this proves that extra-libraries: gcc_s is not a good solution, we are all set, no need for Alpine.

@wz1000 wz1000 force-pushed the wip/no-popcount branch from f6c3cd4 to ed612df Compare July 14, 2022 09:43
@wz1000
Copy link
Contributor Author

wz1000 commented Jul 14, 2022

I've fixed the alpine job and also attempted to add a windows job to perform the check you added in bytestring, but that job seems to fail in a different way to the intended failure mode. I'm not sure about this. See https://github.com/haskell/text/runs/7337239414?check_suite_focus=true

@Bodigrim
Copy link
Contributor

The simplest setup I can come up is https://github.com/Bodigrim/text/tree/purge-path. After the first commit the build succeeds, but as soon as we purge PATH, if fails (the error message is just error code 1, that's fine). If this setup is enough, I'd rather avoid Alpine job with unreleased GHC version.

Now I don't quite understand what does it have to do with simdutf flag. The code is used unconditionally for any length / drop / take, so its performance is absolutely crucial. Please acompany your patch with benchmark results (cabal bench --benchmark-options='-p length').

@bgamari
Copy link
Contributor

bgamari commented Jul 15, 2022

The failure in #454 is due not to text but rather bytestring, which text links against. See haskell/bytestring#497.

This should be fixed in the bytestring shipped with 9.2.3.

@bgamari
Copy link
Contributor

bgamari commented Jul 15, 2022

#454 also neglects to disable the simdutf8 flag, which introduces a dependency on libstdc++, which will of course fail for the same reason described in the last paragraph of GHC #20878.

@wz1000
Copy link
Contributor Author

wz1000 commented Jul 15, 2022

Like Ben said, there are two mostly independent issues here:

  1. The RTS linker does not know about __popcountdi2 so is unable to link any code referencing it. This is tracked by text-2.* -simdutf8 is broken with statically linked GHC Β #450
  2. extra-libraries: gcc_s is in itself problematic. This is tracked by Linking against gcc_s is problematic on windowsΒ #456.

#450 is only triggered if we don't link against extra_libraries: gcc_s. So it is triggered in the alpine job, and in the windows job if we remove the extra_libraries section.

The alpine CI job I added demonstrates issue #450. The windows CI job added by this PR demonstrates #456.

@Bodigrim
Copy link
Contributor

@wz1000 could you please check performance impact of your change?

wz1000 added 3 commits July 18, 2022 17:45
and avoid linking against gcc/gcc_s on all platforms.

This works around https://gitlab.haskell.org/ghc/ghc/-/issues/21787
and https://gitlab.haskell.org/ghc/ghc/-/issues/19900 which cause
problems when GHC's RTS linker tries to load `text`, which occurs if
you use a statically linked GHC to compile a file with a TH splice that
depends on `text`.

Fixes haskell#450 Please enter the commit
message for your changes. Lines starting
@wz1000 wz1000 force-pushed the wip/no-popcount branch from 8021973 to 17c4ab1 Compare July 18, 2022 12:15
@wz1000
Copy link
Contributor Author

wz1000 commented Jul 18, 2022

Here are the results of running cabal bench --benchmark-options='-p length' with GHC 9.2.3

Before: https://gist.github.com/wz1000/7604ab1663d8612bbd8485b13ae22f86#file-before-csv
After: https://gist.github.com/wz1000/7604ab1663d8612bbd8485b13ae22f86#file-after-csv

@wz1000
Copy link
Contributor Author

wz1000 commented Jul 18, 2022

Comparisions using --baseline:

Running 1 benchmarks...
Benchmark text-benchmarks: RUNNING...
All
  Pure
    tiny
      length
        cons
          Text:     OK (0.23s)
            22.2 ns Β± 1.3 ns
          LazyText: OK (0.24s)
            25.4 ns Β± 1.4 ns
        decode
          Text:     OK (0.21s)
            40.6 ns Β± 2.7 ns
          LazyText: OK (0.29s)
            125  ns Β±  10 ns
        drop
          Text:     OK (0.17s)
            30.7 ns Β± 2.6 ns,  8% faster than baseline
          LazyText: OK (0.17s)
            31.3 ns Β± 3.0 ns, 10% faster than baseline
        filter
          Text:     OK (0.58s)
            15.6 ns Β± 382 ps,  6% faster than baseline
          LazyText: OK (0.96s)
            26.3 ns Β± 638 ps,  7% faster than baseline
        filter.filter
          Text:     OK (0.17s)
            15.0 ns Β± 1.4 ns,  9% faster than baseline
          LazyText: OK (0.26s)
            26.2 ns Β± 1.4 ns,  9% faster than baseline
        init
          Text:     OK (0.37s)
            19.4 ns Β± 1.9 ns
          LazyText: OK (0.25s)
            24.1 ns Β± 2.0 ns, 10% faster than baseline
        intercalate
          Text:     OK (0.24s)
            24.2 ns Β± 2.0 ns, 18% faster than baseline
          LazyText: OK (0.26s)
            26.1 ns Β± 2.1 ns, 10% faster than baseline
        intersperse
          Text:     OK (0.48s)
            26.4 ns Β± 1.4 ns,  6% faster than baseline
          LazyText: OK (0.16s)
            27.5 ns Β± 2.6 ns, 10% faster than baseline
        map
          Text:     OK (0.25s)
            23.8 ns Β± 1.6 ns,  7% faster than baseline
          LazyText: OK (0.27s)
            26.3 ns Β± 2.1 ns, 12% faster than baseline
        map.map
          Text:     OK (0.25s)
            23.8 ns Β± 1.7 ns
          LazyText: OK (0.27s)
            26.3 ns Β± 1.3 ns
        replicate char
          Text:     OK (0.23s)
            21.6 ns Β± 1.4 ns, 12% faster than baseline
          LazyText: OK (0.21s)
            16.8 ns Β± 1.3 ns, 13% faster than baseline
        replicate string
          Text:     OK (0.24s)
            23.1 ns Β± 1.9 ns, 20% faster than baseline
          LazyText: OK (0.21s)
            19.5 ns Β± 1.5 ns, 19% faster than baseline
        take
          Text:     OK (0.23s)
            21.4 ns Β± 1.3 ns, 20% faster than baseline
          LazyText: OK (0.27s)
            26.6 ns Β± 1.5 ns, 19% faster than baseline
        tail
          Text:     OK (0.25s)
            24.0 ns Β± 1.6 ns,  6% faster than baseline
          LazyText: OK (0.29s)
            28.5 ns Β± 1.4 ns
        toLower
          Text:     OK (0.28s)
            111  ns Β± 5.9 ns
          LazyText: OK (0.86s)
            193  ns Β± 4.5 ns, 11% faster than baseline
        toUpper
          Text:     OK (0.17s)
            120  ns Β±  11 ns,  9% faster than baseline
          LazyText: OK (0.28s)
            222  ns Β±  10 ns,  9% faster than baseline
        words
          Text:     OK (0.22s)
            20.5 ns Β± 2.0 ns, 13% faster than baseline
          LazyText: OK (0.20s)
            35.4 ns Β± 2.6 ns, 11% faster than baseline
        zipWith
          Text:     OK (0.25s)
            25.4 ns Β± 1.4 ns, 13% faster than baseline
          LazyText: OK (0.18s)
            31.5 ns Β± 2.7 ns
    ascii-small
      length
        cons
          Text:     OK (0.19s)
            21.9 ΞΌs Β± 1.5 ΞΌs, 81% slower than baseline
          LazyText: OK (0.21s)
            22.1 ΞΌs Β± 2.1 ΞΌs, 79% slower than baseline
        decode
          Text:     OK (0.51s)
            28.5 ΞΌs Β± 2.1 ΞΌs, 58% slower than baseline
          LazyText: OK (1.02s)
            28.2 ΞΌs Β± 595 ns, 41% slower than baseline
        drop
          Text:     OK (0.20s)
            22.7 ΞΌs Β± 1.7 ΞΌs, 105% slower than baseline
          LazyText: OK (0.20s)
            22.4 ΞΌs Β± 2.1 ΞΌs, 89% slower than baseline
        filter
          Text:     OK (0.28s)
            128  ΞΌs Β± 6.2 ΞΌs
          LazyText: OK (0.16s)
            133  ΞΌs Β±  11 ΞΌs
        filter.filter
          Text:     OK (0.12s)
            128  ΞΌs Β±  11 ΞΌs
          LazyText: OK (0.16s)
            132  ΞΌs Β±  11 ΞΌs
        init
          Text:     OK (0.19s)
            21.7 ΞΌs Β± 1.6 ΞΌs, 114% slower than baseline
          LazyText: OK (0.40s)
            23.1 ΞΌs Β± 1.5 ΞΌs, 124% slower than baseline
        intercalate
          Text:     OK (0.17s)
            33.6 ΞΌs Β± 2.8 ΞΌs, 53% slower than baseline
          LazyText: OK (0.17s)
            35.2 ΞΌs Β± 2.9 ΞΌs, 37% slower than baseline
        intersperse
          Text:     OK (0.22s)
            23.4 ΞΌs Β± 1.6 ΞΌs, 129% slower than baseline
          LazyText: OK (1.59s)
            22.8 ΞΌs Β± 779 ns, 122% slower than baseline
        map
          Text:     OK (0.24s)
            24.5 ΞΌs Β± 1.3 ΞΌs, 110% slower than baseline
          LazyText: OK (0.84s)
            24.8 ΞΌs Β± 539 ns, 137% slower than baseline
        map.map
          Text:     OK (0.23s)
            24.0 ΞΌs Β± 2.0 ΞΌs, 133% slower than baseline
          LazyText: OK (0.23s)
            24.6 ΞΌs Β± 1.5 ΞΌs, 134% slower than baseline
        replicate char
          Text:     OK (0.26s)
            23.4 ns Β± 1.6 ns,  7% slower than baseline
          LazyText: OK (0.22s)
            19.5 ns Β± 1.5 ns
        replicate string
          Text:     OK (0.47s)
            24.3 ns Β± 1.4 ns
          LazyText: OK (0.23s)
            20.3 ns Β± 1.3 ns, 11% faster than baseline
        take
          Text:     OK (0.22s)
            21.4 ΞΌs Β± 1.3 ΞΌs, 170% slower than baseline
          LazyText: OK (0.20s)
            21.5 ΞΌs Β± 1.3 ΞΌs, 167% slower than baseline
        tail
          Text:     OK (0.18s)
            32.6 ΞΌs Β± 2.7 ΞΌs, 175% slower than baseline
          LazyText: OK (0.18s)
            32.7 ΞΌs Β± 2.7 ΞΌs, 186% slower than baseline
        toLower
          Text:     OK (0.70s)
            1.20 ms Β± 101 ΞΌs, 11% faster than baseline
          LazyText: OK (0.12s)
            1.75 ms Β± 170 ΞΌs
        toUpper
          Text:     OK (0.22s)
            1.71 ms Β±  99 ΞΌs
          LazyText: OK (0.29s)
            2.12 ms Β±  99 ΞΌs
        words
          Text:     OK (0.17s)
            291  ΞΌs Β±  28 ΞΌs
          LazyText: OK (0.32s)
            560  ΞΌs Β±  27 ΞΌs,  9% faster than baseline
        zipWith
          Text:     OK (0.22s)
            22.7 ΞΌs Β± 1.7 ΞΌs, 104% slower than baseline
          LazyText: OK (0.21s)
            23.2 ΞΌs Β± 1.6 ΞΌs, 102% slower than baseline
    ascii
      length
        cons
          Text:     OK (0.64s)
            18.4 ms Β± 1.1 ms, 78% slower than baseline
          LazyText: OK (0.31s)
            19.0 ms Β± 1.6 ms, 90% slower than baseline
        decode
          Text:     OK (1.34s)
            27.2 ms Β± 707 ΞΌs, 56% slower than baseline
          LazyText: OK (1.32s)
            26.9 ms Β± 2.0 ms, 43% slower than baseline
        drop
          Text:     OK (0.40s)
            18.7 ms Β± 1.4 ms, 106% slower than baseline
          LazyText: OK (0.40s)
            18.9 ms Β± 1.6 ms, 106% slower than baseline
        filter
          Text:     OK (0.51s)
            110  ms Β± 2.9 ms
          LazyText: OK (0.52s)
            115  ms Β± 3.0 ms, 13% faster than baseline
        filter.filter
          Text:     OK (0.51s)
            110  ms Β± 3.7 ms
          LazyText: OK (0.46s)
            121  ms Β± 9.8 ms
        init
          Text:     OK (0.39s)
            18.4 ms Β± 1.4 ms, 77% slower than baseline
          LazyText: OK (0.40s)
            18.6 ms Β± 1.4 ms, 75% slower than baseline
        intercalate
          Text:     OK (0.37s)
            26.9 ms Β± 1.5 ms, 23% slower than baseline
          LazyText: OK (0.37s)
            27.8 ms Β± 1.4 ms, 36% slower than baseline
        intersperse
          Text:     OK (0.40s)
            18.4 ms Β± 1.8 ms, 83% slower than baseline
          LazyText: OK (0.40s)
            18.9 ms Β± 1.5 ms, 79% slower than baseline
        map
          Text:     OK (0.40s)
            18.4 ms Β± 1.4 ms, 89% slower than baseline
          LazyText: OK (0.67s)
            20.4 ms Β± 921 ΞΌs, 111% slower than baseline
        map.map
          Text:     OK (0.31s)
            19.9 ms Β± 1.6 ms, 117% slower than baseline
          LazyText: OK (0.43s)
            20.5 ms Β± 1.5 ms, 106% slower than baseline
        replicate char
          Text:     OK (2.46s)
            22.4 ns Β± 1.9 ns
          LazyText: OK (2.38s)
            17.7 ns Β± 1.4 ns
        replicate string
          Text:     OK (2.35s)
            24.5 ns Β± 1.7 ns
          LazyText: OK (2.36s)
            28.0 ns Β± 2.7 ns, 11% slower than baseline
        take
          Text:     OK (0.90s)
            12.3 ms Β± 471 ΞΌs, 78% slower than baseline
          LazyText: OK (0.46s)
            12.7 ms Β± 673 ΞΌs, 69% slower than baseline
        tail
          Text:     OK (0.30s)
            19.1 ms Β± 1.7 ms, 92% slower than baseline
          LazyText: OK (0.43s)
            20.6 ms Β± 1.5 ms, 94% slower than baseline
        toLower
          Text:     OK (3.35s)
            1.056 s Β±  23 ms, 19% faster than baseline
          LazyText: OK (4.49s)
            1.426 s Β± 107 ms, 19% faster than baseline
        toUpper
          Text:     OK (4.16s)
            1.328 s Β±  33 ms, 17% faster than baseline
          LazyText: OK (5.27s)
            1.733 s Β± 136 ms, 12% faster than baseline
        words
          Text:     OK (0.86s)
            227  ms Β± 7.9 ms, 10% faster than baseline
          LazyText: OK (1.65s)
            490  ms Β±  22 ms, 18% faster than baseline
        zipWith
          Text:     OK (0.41s)
            17.7 ms Β± 1.5 ms, 67% slower than baseline
          LazyText: OK (0.41s)
            17.8 ms Β± 1.4 ms, 71% slower than baseline
    english
      length
        cons
          Text:     OK (0.23s)
            1.15 ms Β±  87 ΞΌs, 67% slower than baseline
          LazyText: OK (0.38s)
            1.19 ms Β±  90 ΞΌs, 77% slower than baseline
        decode
          Text:     OK (0.52s)
            1.64 ms Β±  90 ΞΌs, 54% slower than baseline
          LazyText: OK (1.04s)
            1.78 ms Β±  83 ΞΌs, 45% slower than baseline
        drop
          Text:     OK (0.77s)
            1.38 ms Β±  39 ΞΌs, 107% slower than baseline
          LazyText: OK (0.39s)
            1.25 ms Β± 105 ΞΌs, 80% slower than baseline
        filter
          Text:     OK (0.27s)
            7.49 ms Β± 491 ΞΌs
          LazyText: OK (0.15s)
            8.06 ms Β± 734 ΞΌs
        filter.filter
          Text:     OK (0.27s)
            7.60 ms Β± 658 ΞΌs, 11% faster than baseline
          LazyText: OK (0.14s)
            7.87 ms Β± 698 ΞΌs
        init
          Text:     OK (0.19s)
            1.26 ms Β±  91 ΞΌs, 97% slower than baseline
          LazyText: OK (0.22s)
            1.27 ms Β± 102 ΞΌs, 93% slower than baseline
        intercalate
          Text:     OK (0.17s)
            1.83 ms Β± 172 ΞΌs, 28% slower than baseline
          LazyText: OK (0.29s)
            1.83 ms Β± 160 ΞΌs, 21% slower than baseline
        intersperse
          Text:     OK (0.17s)
            1.23 ms Β±  94 ΞΌs, 96% slower than baseline
          LazyText: OK (0.22s)
            1.26 ms Β±  98 ΞΌs, 104% slower than baseline
        map
          Text:     OK (0.22s)
            1.23 ms Β±  96 ΞΌs, 76% slower than baseline
          LazyText: OK (0.22s)
            1.26 ms Β±  89 ΞΌs, 96% slower than baseline
        map.map
          Text:     OK (0.22s)
            1.23 ms Β± 107 ΞΌs, 67% slower than baseline
          LazyText: OK (0.22s)
            1.26 ms Β±  89 ΞΌs, 76% slower than baseline
        replicate char
          Text:     OK (0.36s)
            20.6 ns Β± 1.8 ns, 20% faster than baseline
          LazyText: OK (0.34s)
            16.9 ns Β± 1.5 ns, 17% faster than baseline
        replicate string
          Text:     OK (0.41s)
            24.4 ns Β± 1.3 ns, 17% faster than baseline
          LazyText: OK (0.88s)
            19.3 ns Β± 324 ps, 21% faster than baseline
        take
          Text:     OK (0.28s)
            845  ΞΌs Β±  68 ΞΌs, 76% slower than baseline
          LazyText: OK (0.28s)
            851  ΞΌs Β±  50 ΞΌs, 81% slower than baseline
        tail
          Text:     OK (0.22s)
            1.29 ms Β±  91 ΞΌs, 84% slower than baseline
          LazyText: OK (0.22s)
            1.28 ms Β± 110 ΞΌs, 81% slower than baseline
        toLower
          Text:     OK (0.21s)
            67.1 ms Β± 5.8 ms, 15% faster than baseline
          LazyText: OK (0.29s)
            95.0 ms Β± 3.0 ms, 11% faster than baseline
        toUpper
          Text:     OK (6.26s)
            92.7 ms Β± 8.5 ms, 12% faster than baseline
          LazyText: OK (0.37s)
            117  ms Β± 3.6 ms, 11% faster than baseline
        words
          Text:     OK (0.25s)
            15.5 ms Β± 671 ΞΌs, 15% faster than baseline
          LazyText: OK (0.26s)
            32.7 ms Β± 2.6 ms, 15% faster than baseline
        zipWith
          Text:     OK (0.70s)
            1.23 ms Β±  36 ΞΌs, 72% slower than baseline
          LazyText: OK (0.39s)
            1.25 ms Β±  95 ΞΌs, 74% slower than baseline
    russian
      length
        cons
          Text:     OK (0.14s)
            3.64 ΞΌs Β± 344 ns, 74% slower than baseline
          LazyText: OK (0.14s)
            3.59 ΞΌs Β± 348 ns, 69% slower than baseline
        decode
          Text:     OK (0.77s)
            5.45 ΞΌs Β± 267 ns, 27% slower than baseline
          LazyText: OK (0.77s)
            5.38 ΞΌs Β± 281 ns, 22% slower than baseline
        drop
          Text:     OK (0.27s)
            3.56 ΞΌs Β± 207 ns, 83% slower than baseline
          LazyText: OK (0.27s)
            3.52 ΞΌs Β± 196 ns, 79% slower than baseline
        filter
          Text:     OK (0.20s)
            21.4 ΞΌs Β± 1.6 ΞΌs,  8% faster than baseline
          LazyText: OK (0.24s)
            25.2 ΞΌs Β± 1.7 ΞΌs,  9% faster than baseline
        filter.filter
          Text:     OK (0.37s)
            20.3 ΞΌs Β± 921 ns,  9% faster than baseline
          LazyText: OK (0.23s)
            24.8 ΞΌs Β± 1.7 ΞΌs, 10% faster than baseline
        init
          Text:     OK (0.25s)
            3.47 ΞΌs Β± 303 ns, 83% slower than baseline
          LazyText: OK (0.26s)
            3.34 ΞΌs Β± 213 ns, 71% slower than baseline
        intercalate
          Text:     OK (0.31s)
            4.22 ΞΌs Β± 233 ns, 28% slower than baseline
          LazyText: OK (0.31s)
            4.32 ΞΌs Β± 350 ns
        intersperse
          Text:     OK (0.25s)
            3.43 ΞΌs Β± 256 ns, 29% slower than baseline
          LazyText: OK (0.15s)
            3.40 ΞΌs Β± 339 ns, 78% slower than baseline
        map
          Text:     OK (0.15s)
            3.47 ΞΌs Β± 346 ns, 81% slower than baseline
          LazyText: OK (0.26s)
            3.45 ΞΌs Β± 300 ns, 81% slower than baseline
        map.map
          Text:     OK (0.26s)
            3.45 ΞΌs Β± 317 ns, 81% slower than baseline
          LazyText: OK (0.26s)
            3.38 ΞΌs Β± 260 ns, 77% slower than baseline
        replicate char
          Text:     OK (0.22s)
            19.9 ns Β± 1.3 ns, 13% faster than baseline
          LazyText: OK (0.62s)
            16.8 ns Β± 902 ps, 11% faster than baseline
        replicate string
          Text:     OK (0.23s)
            21.5 ns Β± 1.4 ns, 20% faster than baseline
          LazyText: OK (0.21s)
            18.1 ns Β± 1.6 ns, 19% faster than baseline
        take
          Text:     OK (0.19s)
            2.28 ΞΌs Β± 183 ns, 73% slower than baseline
          LazyText: OK (0.70s)
            2.57 ΞΌs Β±  82 ns, 92% slower than baseline
        tail
          Text:     OK (0.17s)
            3.88 ΞΌs Β± 335 ns, 102% slower than baseline
          LazyText: OK (0.30s)
            4.00 ΞΌs Β± 244 ns, 112% slower than baseline
        toLower
          Text:     OK (0.27s)
            116  ΞΌs Β± 7.0 ΞΌs
          LazyText: OK (0.19s)
            159  ΞΌs Β±  13 ΞΌs
        toUpper
          Text:     OK (0.19s)
            160  ΞΌs Β±  13 ΞΌs
          LazyText: OK (0.24s)
            203  ΞΌs Β±  13 ΞΌs
        words
          Text:     OK (0.61s)
            34.5 ΞΌs Β± 780 ns,  9% faster than baseline
          LazyText: OK (0.31s)
            72.6 ΞΌs Β± 2.7 ΞΌs
        zipWith
          Text:     OK (0.26s)
            3.42 ΞΌs Β± 314 ns, 74% slower than baseline
          LazyText: OK (0.25s)
            3.42 ΞΌs Β± 175 ns, 73% slower than baseline
    japanese
      length
        cons
          Text:     OK (0.28s)
            3.66 ΞΌs Β± 226 ns, 78% slower than baseline
          LazyText: OK (0.28s)
            3.62 ΞΌs Β± 188 ns, 77% slower than baseline
        decode
          Text:     OK (0.81s)
            5.54 ΞΌs Β± 182 ns, 31% slower than baseline
          LazyText: OK (1.67s)
            6.09 ΞΌs Β± 109 ns, 34% slower than baseline
        drop
          Text:     OK (0.32s)
            4.19 ΞΌs Β± 271 ns, 96% slower than baseline
          LazyText: OK (0.18s)
            4.27 ΞΌs Β± 329 ns, 103% slower than baseline
        filter
          Text:     OK (0.25s)
            12.6 ΞΌs Β± 664 ns
          LazyText: OK (0.15s)
            16.0 ΞΌs Β± 1.5 ΞΌs
        filter.filter
          Text:     OK (0.25s)
            12.9 ΞΌs Β± 791 ns
          LazyText: OK (0.14s)
            14.7 ΞΌs Β± 1.4 ΞΌs, 15% faster than baseline
        init
          Text:     OK (0.16s)
            3.64 ΞΌs Β± 353 ns, 73% slower than baseline
          LazyText: OK (0.53s)
            3.63 ΞΌs Β± 274 ns, 75% slower than baseline
        intercalate
          Text:     OK (0.21s)
            5.36 ΞΌs Β± 518 ns, 23% slower than baseline
          LazyText: OK (0.24s)
            6.02 ΞΌs Β± 564 ns, 24% slower than baseline
        intersperse
          Text:     OK (0.28s)
            3.64 ΞΌs Β± 232 ns, 77% slower than baseline
          LazyText: OK (0.16s)
            3.64 ΞΌs Β± 360 ns, 78% slower than baseline
        map
          Text:     OK (0.29s)
            3.74 ΞΌs Β± 321 ns, 83% slower than baseline
          LazyText: OK (0.28s)
            3.63 ΞΌs Β± 237 ns, 79% slower than baseline
        map.map
          Text:     OK (0.16s)
            3.63 ΞΌs Β± 350 ns, 84% slower than baseline
          LazyText: OK (0.28s)
            3.62 ΞΌs Β± 175 ns, 86% slower than baseline
        replicate char
          Text:     OK (0.22s)
            19.4 ns Β± 1.5 ns, 10% faster than baseline
          LazyText: OK (0.19s)
            15.7 ns Β± 1.3 ns, 11% faster than baseline
        replicate string
          Text:     OK (0.24s)
            21.8 ns Β± 2.0 ns, 17% faster than baseline
          LazyText: OK (0.21s)
            18.6 ns Β± 1.5 ns, 16% faster than baseline
        take
          Text:     OK (0.20s)
            2.50 ΞΌs Β± 187 ns, 71% slower than baseline
          LazyText: OK (0.20s)
            2.51 ΞΌs Β± 181 ns, 74% slower than baseline
        tail
          Text:     OK (0.16s)
            3.76 ΞΌs Β± 340 ns, 96% slower than baseline
          LazyText: OK (0.16s)
            3.81 ΞΌs Β± 380 ns, 78% slower than baseline
        toLower
          Text:     OK (0.60s)
            68.0 ΞΌs Β± 2.8 ΞΌs, 13% faster than baseline
          LazyText: OK (0.44s)
            98.9 ΞΌs Β± 8.1 ΞΌs
        toUpper
          Text:     OK (0.17s)
            65.7 ΞΌs Β± 5.4 ΞΌs, 13% faster than baseline
          LazyText: OK (0.22s)
            93.5 ΞΌs Β± 6.9 ΞΌs, 19% faster than baseline
        words
          Text:     OK (0.20s)
            47.2 ΞΌs Β± 3.8 ΞΌs, 22% faster than baseline
          LazyText: OK (0.37s)
            40.4 ΞΌs Β± 2.2 ΞΌs, 16% faster than baseline
        zipWith
          Text:     OK (0.28s)
            3.69 ΞΌs Β± 204 ns, 74% slower than baseline
          LazyText: OK (0.27s)
            3.65 ΞΌs Β± 221 ns, 59% slower than baseline

All 216 tests passed (108.34s)
Benchmark text-benchmarks: FINISH

@wz1000 wz1000 force-pushed the wip/no-popcount branch from c8ec853 to c28d8e5 Compare July 18, 2022 12:49
@wz1000
Copy link
Contributor Author

wz1000 commented Jul 18, 2022

Tried to improve the implementation of popcount16. Results at https://gist.github.com/wz1000/7604ab1663d8612bbd8485b13ae22f86#file-after-improved-csv.

Baseline is before.csv, which is tip text-2.0 master (971051b).

All
  Pure
    tiny
      length
        cons
          Text:     OK (0.22s)
            21.4 ns Β± 1.3 ns
          LazyText: OK (0.23s)
            23.5 ns Β± 1.5 ns,  9% faster than baseline
        decode
          Text:     OK (0.21s)
            40.7 ns Β± 3.2 ns
          LazyText: OK (0.29s)
            118  ns Β± 5.4 ns, 10% faster than baseline
        drop
          Text:     OK (0.29s)
            29.6 ns Β± 1.8 ns, 12% faster than baseline
          LazyText: OK (0.31s)
            31.5 ns Β± 2.1 ns,  9% faster than baseline
        filter
          Text:     OK (0.17s)
            15.3 ns Β± 1.4 ns,  8% faster than baseline
          LazyText: OK (0.27s)
            26.7 ns Β± 1.7 ns,  6% faster than baseline
        filter.filter
          Text:     OK (0.29s)
            15.2 ns Β± 692 ps,  8% faster than baseline
          LazyText: OK (0.50s)
            27.3 ns Β± 2.2 ns
        init
          Text:     OK (0.36s)
            18.8 ns Β± 1.2 ns, 11% faster than baseline
          LazyText: OK (0.25s)
            24.6 ns Β± 2.0 ns,  8% faster than baseline
        intercalate
          Text:     OK (0.26s)
            25.2 ns Β± 1.6 ns, 14% faster than baseline
          LazyText: OK (0.26s)
            26.0 ns Β± 1.5 ns, 10% faster than baseline
        intersperse
          Text:     OK (0.26s)
            25.4 ns Β± 2.0 ns, 10% faster than baseline
          LazyText: OK (0.27s)
            26.0 ns Β± 2.3 ns, 15% faster than baseline
        map
          Text:     OK (0.23s)
            22.2 ns Β± 2.2 ns, 13% faster than baseline
          LazyText: OK (0.25s)
            24.5 ns Β± 2.1 ns, 18% faster than baseline
        map.map
          Text:     OK (0.43s)
            22.7 ns Β± 1.0 ns, 11% faster than baseline
          LazyText: OK (0.25s)
            24.5 ns Β± 2.2 ns, 10% faster than baseline
        replicate char
          Text:     OK (0.21s)
            19.7 ns Β± 1.7 ns, 20% faster than baseline
          LazyText: OK (0.18s)
            15.7 ns Β± 1.4 ns, 18% faster than baseline
        replicate string
          Text:     OK (0.24s)
            23.1 ns Β± 1.7 ns, 20% faster than baseline
          LazyText: OK (0.20s)
            18.6 ns Β± 1.3 ns, 23% faster than baseline
        take
          Text:     OK (0.23s)
            21.5 ns Β± 2.1 ns, 20% faster than baseline
          LazyText: OK (0.26s)
            26.1 ns Β± 1.8 ns, 20% faster than baseline
        tail
          Text:     OK (0.23s)
            21.9 ns Β± 1.8 ns, 15% faster than baseline
          LazyText: OK (0.27s)
            26.3 ns Β± 1.4 ns,  8% faster than baseline
        toLower
          Text:     OK (0.26s)
            103  ns Β± 5.9 ns, 15% faster than baseline
          LazyText: OK (0.21s)
            183  ns Β±  13 ns, 16% faster than baseline
        toUpper
          Text:     OK (0.99s)
            113  ns Β± 3.0 ns, 14% faster than baseline
          LazyText: OK (0.27s)
            214  ns Β±  11 ns, 12% faster than baseline
        words
          Text:     OK (0.40s)
            20.8 ns Β± 1.0 ns, 11% faster than baseline
          LazyText: OK (0.18s)
            35.1 ns Β± 3.0 ns, 12% faster than baseline
        zipWith
          Text:     OK (0.46s)
            25.4 ns Β± 1.8 ns, 13% faster than baseline
          LazyText: OK (0.18s)
            32.2 ns Β± 3.0 ns
    ascii-small
      length
        cons
          Text:     OK (0.32s)
            9.26 ΞΌs Β± 716 ns, 23% faster than baseline
          LazyText: OK (0.18s)
            9.52 ΞΌs Β± 920 ns, 22% faster than baseline
        decode
          Text:     OK (0.58s)
            15.0 ΞΌs Β± 1.4 ΞΌs, 16% faster than baseline
          LazyText: OK (0.60s)
            15.5 ΞΌs Β± 906 ns, 22% faster than baseline
        drop
          Text:     OK (0.33s)
            9.61 ΞΌs Β± 772 ns, 13% faster than baseline
          LazyText: OK (0.19s)
            9.59 ΞΌs Β± 668 ns, 19% faster than baseline
        filter
          Text:     OK (0.16s)
            133  ΞΌs Β±  12 ΞΌs
          LazyText: OK (0.30s)
            136  ΞΌs Β±  13 ΞΌs
        filter.filter
          Text:     OK (0.16s)
            131  ΞΌs Β±  11 ΞΌs
          LazyText: OK (0.16s)
            137  ΞΌs Β±  11 ΞΌs
        init
          Text:     OK (0.17s)
            9.22 ΞΌs Β± 687 ns,  9% faster than baseline
          LazyText: OK (0.19s)
            9.56 ΞΌs Β± 797 ns
        intercalate
          Text:     OK (0.19s)
            19.7 ΞΌs Β± 1.4 ΞΌs, 10% faster than baseline
          LazyText: OK (0.40s)
            22.2 ΞΌs Β± 1.0 ΞΌs, 13% faster than baseline
        intersperse
          Text:     OK (0.64s)
            9.15 ΞΌs Β± 306 ns,  9% faster than baseline
          LazyText: OK (0.19s)
            9.41 ΞΌs Β± 688 ns
        map
          Text:     OK (0.18s)
            9.16 ΞΌs Β± 719 ns, 21% faster than baseline
          LazyText: OK (0.19s)
            9.37 ΞΌs Β± 809 ns, 10% faster than baseline
        map.map
          Text:     OK (0.18s)
            9.09 ΞΌs Β± 775 ns, 11% faster than baseline
          LazyText: OK (0.18s)
            8.96 ΞΌs Β± 781 ns, 14% faster than baseline
        replicate char
          Text:     OK (0.22s)
            19.6 ns Β± 1.9 ns, 10% faster than baseline
          LazyText: OK (0.32s)
            16.0 ns Β± 862 ps, 19% faster than baseline
        replicate string
          Text:     OK (0.25s)
            23.6 ns Β± 1.5 ns
          LazyText: OK (0.22s)
            19.3 ns Β± 1.5 ns, 15% faster than baseline
        take
          Text:     OK (0.24s)
            6.17 ΞΌs Β± 337 ns, 21% faster than baseline
          LazyText: OK (0.24s)
            6.23 ΞΌs Β± 427 ns, 22% faster than baseline
        tail
          Text:     OK (0.18s)
            9.76 ΞΌs Β± 844 ns, 17% faster than baseline
          LazyText: OK (0.19s)
            9.88 ΞΌs Β± 814 ns, 13% faster than baseline
        toLower
          Text:     OK (0.32s)
            1.15 ms Β±  82 ΞΌs, 14% faster than baseline
          LazyText: OK (0.22s)
            1.57 ms Β±  97 ΞΌs, 11% faster than baseline
        toUpper
          Text:     OK (0.22s)
            1.63 ms Β± 133 ΞΌs
          LazyText: OK (0.28s)
            2.04 ms Β± 114 ΞΌs
        words
          Text:     OK (0.16s)
            271  ΞΌs Β±  27 ΞΌs
          LazyText: OK (0.17s)
            573  ΞΌs Β±  44 ΞΌs
        zipWith
          Text:     OK (0.64s)
            9.21 ΞΌs Β± 374 ns, 16% faster than baseline
          LazyText: OK (0.17s)
            9.52 ΞΌs Β± 697 ns, 16% faster than baseline
    ascii
      length
        cons
          Text:     OK (0.67s)
            7.83 ms Β± 535 ΞΌs, 23% faster than baseline
          LazyText: OK (0.37s)
            8.05 ms Β± 739 ΞΌs, 18% faster than baseline
        decode
          Text:     OK (1.54s)
            15.2 ms Β± 1.4 ms, 12% faster than baseline
          LazyText: OK (1.49s)
            14.8 ms Β± 562 ΞΌs, 20% faster than baseline
        drop
          Text:     OK (0.68s)
            8.08 ms Β± 416 ΞΌs, 10% faster than baseline
          LazyText: OK (0.39s)
            8.21 ms Β± 711 ΞΌs, 10% faster than baseline
        filter
          Text:     OK (0.99s)
            105  ms Β± 2.5 ms,  6% faster than baseline
          LazyText: OK (0.49s)
            108  ms Β± 5.7 ms, 18% faster than baseline
        filter.filter
          Text:     OK (0.29s)
            104  ms Β± 3.3 ms,  9% faster than baseline
          LazyText: OK (0.51s)
            112  ms Β± 7.1 ms
        init
          Text:     OK (0.28s)
            8.06 ms Β± 718 ΞΌs, 22% faster than baseline
          LazyText: OK (0.39s)
            8.29 ms Β± 692 ΞΌs, 21% faster than baseline
        intercalate
          Text:     OK (0.38s)
            16.9 ms Β± 1.5 ms, 22% faster than baseline
          LazyText: OK (0.53s)
            17.6 ms Β± 809 ΞΌs, 13% faster than baseline
        intersperse
          Text:     OK (0.47s)
            8.06 ms Β± 756 ΞΌs, 19% faster than baseline
          LazyText: OK (0.72s)
            8.47 ms Β± 363 ΞΌs, 19% faster than baseline
        map
          Text:     OK (0.48s)
            8.24 ms Β± 769 ΞΌs, 15% faster than baseline
          LazyText: OK (0.40s)
            8.43 ms Β± 704 ΞΌs, 12% faster than baseline
        map.map
          Text:     OK (0.66s)
            7.75 ms Β± 493 ΞΌs, 15% faster than baseline
          LazyText: OK (0.92s)
            7.93 ms Β± 394 ΞΌs, 20% faster than baseline
        replicate char
          Text:     OK (2.21s)
            20.0 ns Β± 1.3 ns
          LazyText: OK (2.72s)
            16.1 ns Β± 688 ps
        replicate string
          Text:     OK (2.56s)
            23.3 ns Β± 1.0 ns
          LazyText: OK (3.45s)
            17.9 ns Β± 212 ps, 28% faster than baseline
        take
          Text:     OK (0.58s)
            5.07 ms Β± 396 ΞΌs, 26% faster than baseline
          LazyText: OK (0.50s)
            5.34 ms Β± 440 ΞΌs, 28% faster than baseline
        tail
          Text:     OK (0.39s)
            8.46 ms Β± 841 ΞΌs, 14% faster than baseline
          LazyText: OK (0.39s)
            8.27 ms Β± 722 ΞΌs, 22% faster than baseline
        toLower
          Text:     OK (3.14s)
            1.016 s Β± 3.2 ms, 22% faster than baseline
          LazyText: OK (4.11s)
            1.347 s Β±  71 ms, 23% faster than baseline
        toUpper
          Text:     OK (4.17s)
            1.329 s Β±  32 ms, 17% faster than baseline
          LazyText: OK (5.19s)
            1.700 s Β±  42 ms, 14% faster than baseline
        words
          Text:     OK (0.84s)
            225  ms Β± 8.9 ms, 11% faster than baseline
          LazyText: OK (1.64s)
            488  ms Β±  14 ms, 18% faster than baseline
        zipWith
          Text:     OK (0.60s)
            7.91 ms Β± 377 ΞΌs, 24% faster than baseline
          LazyText: OK (0.61s)
            8.37 ms Β± 757 ΞΌs, 19% faster than baseline
    english
      length
        cons
          Text:     OK (0.38s)
            549  ΞΌs Β±  43 ΞΌs, 20% faster than baseline
          LazyText: OK (0.36s)
            556  ΞΌs Β±  27 ΞΌs, 17% faster than baseline
        decode
          Text:     OK (1.77s)
            776  ΞΌs Β±  23 ΞΌs, 26% faster than baseline
          LazyText: OK (1.15s)
            1.01 ms Β±  95 ΞΌs, 16% faster than baseline
        drop
          Text:     OK (0.21s)
            554  ΞΌs Β±  54 ΞΌs, 16% faster than baseline
          LazyText: OK (0.19s)
            540  ΞΌs Β±  46 ΞΌs, 21% faster than baseline
        filter
          Text:     OK (0.27s)
            7.39 ms Β± 371 ΞΌs,  9% faster than baseline
          LazyText: OK (0.26s)
            7.72 ms Β± 371 ΞΌs,  7% faster than baseline
        filter.filter
          Text:     OK (0.27s)
            7.32 ms Β± 435 ΞΌs, 14% faster than baseline
          LazyText: OK (0.14s)
            7.60 ms Β± 752 ΞΌs
        init
          Text:     OK (0.30s)
            518  ΞΌs Β±  23 ΞΌs, 18% faster than baseline
          LazyText: OK (0.35s)
            542  ΞΌs Β±  31 ΞΌs, 17% faster than baseline
        intercalate
          Text:     OK (0.34s)
            1.11 ms Β±  62 ΞΌs, 21% faster than baseline
          LazyText: OK (0.37s)
            1.19 ms Β±  84 ΞΌs, 20% faster than baseline
        intersperse
          Text:     OK (0.20s)
            532  ΞΌs Β±  52 ΞΌs, 14% faster than baseline
          LazyText: OK (0.21s)
            545  ΞΌs Β±  51 ΞΌs, 11% faster than baseline
        map
          Text:     OK (0.35s)
            527  ΞΌs Β±  31 ΞΌs, 24% faster than baseline
          LazyText: OK (0.64s)
            536  ΞΌs Β±  13 ΞΌs, 16% faster than baseline
        map.map
          Text:     OK (0.34s)
            499  ΞΌs Β±  31 ΞΌs, 32% faster than baseline
          LazyText: OK (4.54s)
            538  ΞΌs Β±  27 ΞΌs, 24% faster than baseline
        replicate char
          Text:     OK (0.36s)
            19.9 ns Β± 1.7 ns, 23% faster than baseline
          LazyText: OK (0.46s)
            15.7 ns Β± 1.2 ns, 22% faster than baseline
        replicate string
          Text:     OK (0.57s)
            22.6 ns Β± 1.2 ns, 23% faster than baseline
          LazyText: OK (0.51s)
            18.4 ns Β± 1.6 ns, 24% faster than baseline
        take
          Text:     OK (0.24s)
            332  ΞΌs Β±  22 ΞΌs, 30% faster than baseline
          LazyText: OK (0.25s)
            346  ΞΌs Β±  27 ΞΌs, 26% faster than baseline
        tail
          Text:     OK (0.20s)
            528  ΞΌs Β±  47 ΞΌs, 24% faster than baseline
          LazyText: OK (0.34s)
            506  ΞΌs Β±  38 ΞΌs, 28% faster than baseline
        toLower
          Text:     OK (0.21s)
            65.0 ms Β± 3.4 ms, 18% faster than baseline
          LazyText: OK (0.28s)
            88.3 ms Β± 5.3 ms, 17% faster than baseline
        toUpper
          Text:     OK (0.28s)
            89.2 ms Β± 3.2 ms, 16% faster than baseline
          LazyText: OK (0.35s)
            111  ms Β± 9.1 ms, 15% faster than baseline
        words
          Text:     OK (0.51s)
            15.1 ms Β± 531 ΞΌs, 17% faster than baseline
          LazyText: OK (0.25s)
            31.7 ms Β± 1.9 ms, 18% faster than baseline
        zipWith
          Text:     OK (0.33s)
            503  ΞΌs Β±  30 ΞΌs, 29% faster than baseline
          LazyText: OK (0.34s)
            518  ΞΌs Β±  52 ΞΌs, 27% faster than baseline
    russian
      length
        cons
          Text:     OK (0.24s)
            1.48 ΞΌs Β±  93 ns, 29% faster than baseline
          LazyText: OK (0.24s)
            1.49 ΞΌs Β± 102 ns, 29% faster than baseline
        decode
          Text:     OK (0.49s)
            3.20 ΞΌs Β± 291 ns, 25% faster than baseline
          LazyText: OK (0.27s)
            3.53 ΞΌs Β± 228 ns, 19% faster than baseline
        drop
          Text:     OK (0.26s)
            1.66 ΞΌs Β± 112 ns, 14% faster than baseline
          LazyText: OK (0.25s)
            1.67 ΞΌs Β± 100 ns, 14% faster than baseline
        filter
          Text:     OK (0.20s)
            22.3 ΞΌs Β± 1.6 ΞΌs
          LazyText: OK (0.25s)
            26.8 ΞΌs Β± 1.8 ΞΌs
        filter.filter
          Text:     OK (0.20s)
            22.1 ΞΌs Β± 1.8 ΞΌs
          LazyText: OK (0.24s)
            26.8 ΞΌs Β± 1.8 ΞΌs
        init
          Text:     OK (0.23s)
            1.57 ΞΌs Β±  95 ns, 17% faster than baseline
          LazyText: OK (0.25s)
            1.59 ΞΌs Β± 149 ns, 18% faster than baseline
        intercalate
          Text:     OK (0.20s)
            2.46 ΞΌs Β± 232 ns, 25% faster than baseline
          LazyText: OK (0.20s)
            2.59 ΞΌs Β± 254 ns, 41% faster than baseline
        intersperse
          Text:     OK (0.24s)
            1.61 ΞΌs Β± 100 ns, 39% faster than baseline
          LazyText: OK (0.25s)
            1.58 ΞΌs Β±  89 ns, 17% faster than baseline
        map
          Text:     OK (0.25s)
            1.57 ΞΌs Β±  99 ns, 18% faster than baseline
          LazyText: OK (0.25s)
            1.58 ΞΌs Β± 107 ns, 17% faster than baseline
        map.map
          Text:     OK (0.25s)
            1.59 ΞΌs Β±  87 ns, 16% faster than baseline
          LazyText: OK (0.25s)
            1.58 ΞΌs Β±  91 ns, 17% faster than baseline
        replicate char
          Text:     OK (0.22s)
            18.6 ns Β± 1.4 ns, 18% faster than baseline
          LazyText: OK (0.32s)
            16.1 ns Β± 780 ps, 14% faster than baseline
        replicate string
          Text:     OK (0.26s)
            23.2 ns Β± 1.8 ns, 13% faster than baseline
          LazyText: OK (0.37s)
            18.0 ns Β± 716 ps, 20% faster than baseline
        take
          Text:     OK (0.59s)
            1.04 ΞΌs Β±  71 ns, 20% faster than baseline
          LazyText: OK (0.30s)
            1.04 ΞΌs Β±  96 ns, 21% faster than baseline
        tail
          Text:     OK (0.25s)
            1.60 ΞΌs Β± 150 ns, 16% faster than baseline
          LazyText: OK (0.25s)
            1.61 ΞΌs Β± 147 ns, 14% faster than baseline
        toLower
          Text:     OK (0.26s)
            112  ΞΌs Β±  11 ΞΌs
          LazyText: OK (0.17s)
            158  ΞΌs Β±  13 ΞΌs
        toUpper
          Text:     OK (0.18s)
            150  ΞΌs Β±  14 ΞΌs
          LazyText: OK (0.23s)
            201  ΞΌs Β±  15 ΞΌs
        words
          Text:     OK (0.17s)
            33.2 ΞΌs Β± 3.0 ΞΌs, 12% faster than baseline
          LazyText: OK (0.31s)
            67.7 ΞΌs Β± 4.0 ΞΌs, 11% faster than baseline
        zipWith
          Text:     OK (0.22s)
            1.48 ΞΌs Β±  82 ns, 24% faster than baseline
          LazyText: OK (0.24s)
            1.47 ΞΌs Β±  92 ns, 25% faster than baseline
    japanese
      length
        cons
          Text:     OK (0.25s)
            1.61 ΞΌs Β± 129 ns, 21% faster than baseline
          LazyText: OK (0.47s)
            1.61 ΞΌs Β± 112 ns, 20% faster than baseline
        decode
          Text:     OK (0.97s)
            3.41 ΞΌs Β±  54 ns, 19% faster than baseline
          LazyText: OK (0.54s)
            3.53 ΞΌs Β± 243 ns, 22% faster than baseline
        drop
          Text:     OK (0.26s)
            1.65 ΞΌs Β±  82 ns, 22% faster than baseline
          LazyText: OK (0.24s)
            1.67 ΞΌs Β± 160 ns, 20% faster than baseline
        filter
          Text:     OK (0.20s)
            11.6 ΞΌs Β± 1.0 ΞΌs, 11% faster than baseline
          LazyText: OK (0.14s)
            14.7 ΞΌs Β± 1.4 ΞΌs, 14% faster than baseline
        filter.filter
          Text:     OK (0.43s)
            11.9 ΞΌs Β± 340 ns, 12% faster than baseline
          LazyText: OK (0.53s)
            15.0 ΞΌs Β± 1.0 ΞΌs, 14% faster than baseline
        init
          Text:     OK (0.24s)
            1.62 ΞΌs Β±  84 ns, 22% faster than baseline
          LazyText: OK (0.51s)
            1.69 ΞΌs Β± 132 ns, 18% faster than baseline
        intercalate
          Text:     OK (0.26s)
            3.36 ΞΌs Β± 216 ns, 22% faster than baseline
          LazyText: OK (0.30s)
            3.93 ΞΌs Β± 166 ns, 18% faster than baseline
        intersperse
          Text:     OK (0.25s)
            1.60 ΞΌs Β± 103 ns, 22% faster than baseline
          LazyText: OK (0.26s)
            1.61 ΞΌs Β± 113 ns, 21% faster than baseline
        map
          Text:     OK (0.25s)
            1.58 ΞΌs Β± 145 ns, 22% faster than baseline
          LazyText: OK (0.25s)
            1.59 ΞΌs Β± 145 ns, 21% faster than baseline
        map.map
          Text:     OK (0.25s)
            1.63 ΞΌs Β± 123 ns, 17% faster than baseline
          LazyText: OK (0.26s)
            1.61 ΞΌs Β± 121 ns, 16% faster than baseline
        replicate char
          Text:     OK (0.22s)
            19.4 ns Β± 1.6 ns, 10% faster than baseline
          LazyText: OK (0.32s)
            15.1 ns Β± 1.2 ns, 14% faster than baseline
        replicate string
          Text:     OK (0.25s)
            22.9 ns Β± 1.7 ns, 13% faster than baseline
          LazyText: OK (0.22s)
            18.8 ns Β± 1.3 ns, 15% faster than baseline
        take
          Text:     OK (0.19s)
            1.14 ΞΌs Β± 102 ns, 21% faster than baseline
          LazyText: OK (0.18s)
            1.13 ΞΌs Β±  93 ns, 21% faster than baseline
        tail
          Text:     OK (0.26s)
            1.72 ΞΌs Β± 117 ns, 10% faster than baseline
          LazyText: OK (0.28s)
            1.81 ΞΌs Β± 130 ns, 15% faster than baseline
        toLower
          Text:     OK (0.19s)
            74.9 ΞΌs Β± 5.9 ΞΌs
          LazyText: OK (0.24s)
            105  ΞΌs Β± 7.2 ΞΌs
        toUpper
          Text:     OK (0.17s)
            70.0 ΞΌs Β± 5.4 ΞΌs,  7% faster than baseline
          LazyText: OK (0.22s)
            98.4 ΞΌs Β± 6.4 ΞΌs, 15% faster than baseline
        words
          Text:     OK (0.21s)
            47.7 ΞΌs Β± 4.4 ΞΌs, 21% faster than baseline
          LazyText: OK (0.21s)
            43.0 ΞΌs Β± 3.9 ΞΌs, 10% faster than baseline
        zipWith
          Text:     OK (0.27s)
            1.69 ΞΌs Β±  92 ns, 19% faster than baseline
          LazyText: OK (0.27s)
            1.73 ΞΌs Β± 134 ns, 24% faster than baseline

All 216 tests passed (108.18s)
Benchmark text-benchmarks: FINISH

@wz1000
Copy link
Contributor Author

wz1000 commented Jul 18, 2022

I've successfully validated the implementation of popcount16 using the following program:

#include <stdint.h>
#include <stdio.h>
#include <assert.h>
#include <stdlib.h>
#include <sys/types.h>

static inline const size_t popcount16(uint16_t x) {

  // Taken from https://en.wikipedia.org/wiki/Hamming_weight
  const uint16_t m1  = 0x5555; //binary: 0101...
  const uint16_t m2  = 0x3333; //binary: 00110011..
  const uint16_t m4  = 0x0f0f; //binary:  4 zeros,  4 ones ...
  x -= (x >> 1) & m1;             //put count of each 2 bits into those 2 bits
  x = (x & m2) + ((x >> 2) & m2); //put count of each 4 bits into those 4 bits 
  x = (x + (x >> 4)) & m4;        //put count of each 8 bits into those 8 bits 
  return (x >> 8) + (x & 0x00FF);
}

int main() {
  for(int i = 0; i <= 0xFFFF; i++) {
    size_t a,b;
    a = __builtin_popcount((uint16_t) i);
    b = popcount16(i);
    if (a != b) {
      printf("No match %d %d \n", a, b);
      exit(1);
    }
  }
  printf("All values validated\n");
}

@Bodigrim
Copy link
Contributor

It is very surprising that software emulation of __builtin_popcount is actually faster, but hard to argue against benchmarks :)

@wz1000 I assume you want to take care of all __builtin_popcount in measure_off.c.

@wz1000
Copy link
Contributor Author

wz1000 commented Jul 18, 2022

It is very surprising that software emulation of __builtin_popcount is actually faster, but hard to argue against benchmarks :)

The problem is only triggered if GCC ends up using its own software implementation of popcount, rather than emitting the actual instruction. It is a known issue that the GCC emulation is suboptimal. See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=36041

@wz1000 I assume you want to take care of all __builtin_popcount in measure_off.c.

I'm reasonably confident all the other usages of the symbol are OK because they are guarded by sufficient feature flags to guarantee that GCC emits the popcount instruction for those usages. The RTS linker bug is only triggered if GCC decides to use its software emulation for popcount.

@@ -0,0 +1,12 @@
-- Simple test script for #450
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please inline this into simdutf-flag-alpine.yml?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've done this.

@wz1000 wz1000 force-pushed the wip/no-popcount branch from acd3899 to 01de497 Compare July 19, 2022 09:34
@wz1000 wz1000 force-pushed the wip/no-popcount branch from 01de497 to cef9848 Compare July 19, 2022 09:38
@Bodigrim Bodigrim merged commit 9412f44 into haskell:master Jul 19, 2022
@Bodigrim
Copy link
Contributor

Thanks @wz1000!

@Bodigrim Bodigrim linked an issue Jul 19, 2022 that may be closed by this pull request
@wz1000
Copy link
Contributor Author

wz1000 commented Jul 20, 2022

@Bodigrim Could we have a release for inclusion into 9.4?

@Bodigrim
Copy link
Contributor

@wz1000 sure, I'm waiting for #448 to land into master before releasing.

@Bodigrim
Copy link
Contributor

@wz1000 @bgamari actually what is the timeline for GHC 9.4.1? Do we have any time left to finish #448? If no, could you please confirm that master branch works for GHC purposes as is?

@wz1000
Copy link
Contributor Author

wz1000 commented Jul 21, 2022

The RC will be out this week, but we can use the master branch for that without waiting for a release. The final release will be made in about 2 weeks (early August) and we do need a release by then, ideally before.

@wz1000
Copy link
Contributor Author

wz1000 commented Jul 21, 2022

I will test the master branch one more time to be sure, but I think all the patches we need have been merged.

@bgamari
Copy link
Contributor

bgamari commented Jul 21, 2022

@Bodigrim, I'm afraid 9.4 is essentially done. rc1 should be released by the end of today and will ship with 9412f44, which appears to be working well. It would be great if we could produce a text release from it or something closely related

@Bodigrim
Copy link
Contributor

@wz1000 @bgamari Released as text-2.0.1, fdb06ff.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Linking against gcc_s is problematic on windows text-2.* -simdutf8 is broken with statically linked GHC

3 participants