openlibm and libm: mutation testing

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

openlibm and libm: mutation testing

Stanislav Pankevich
Hello,

I am addressing developers and maintainers of both libm and openlibm
(copy: https://github.com/JuliaLang/openlibm/issues/172).

I will start with tl;dr part first.

I am a developer of mutation testing system called
[Mull](https://github.com/mull-project/mull).

Our development involves analysis of open source projects. This
analysis helps us to develop Mull further and also we learn more about
how various projects are tested in the wild.

I have recorded a small demo from my sessions. It is about libm which
I was able to compile on Ubuntu 64 bit Docker image and it is very
similar to what I am also doing with openlibm on macOS.

[Analyzing sqrt function from libm C library with
Mull](https://www.youtube.com/watch?v=_ipprdCdVHc&t=9s&list=PLmo9Xl8PKe5sfM4gTYSUVaxJCJdQ06hDP&index=1)

---

Some time ago I started to analyze
[openlibm](https://github.com/JuliaLang/openlibm/issues/new) and
[libm](https://sourceware.org/newlib/libm.html). I was specifically
looking for something with lots of computational code, so libm and
then openlibm turned out to be very good targets for my testing of
so-called "math mutation operators", "scalar value mutation
operators", "replace function call mutation operators" etc.

Based on my analysis **both openlibm and libm have a lack of mutation
testing coverage**: you can modify chunks of the functions like `acos`
or `sqrt` and still see all of the tests passing.

These are my observations about the code:

- The code in both libraries seems use `__ieee754_*` functions that have
`Copyright (C) 1993 by Sun Microsystems` on it.
- `libm/test/**` has much more tests compared to `openlibm`. For
example `acos` is tested against ~280 test cases. They are passing on
my machine with some tweaks even though the test suite is not
maintained since 2002.
- `openlibm/test` suite works out of the box on macOS but has very few tests.
- Our `Mull` gives roughly `50-60%` mutation coverage on functions
like `acos` or `sqrt` which means that both `libm` and `openlibm` are
actually not tested well enough.

There are a few things I would like to ask:

1) Looks like both libraries have the same origin but the code in both
is very different these days. What is the reason of this difference in
implementations and what is the criteria of deciding which one
performs better after the years?

2) libm uses a struct to convert double to a pair of two uint32_t's and back.

```c
typedef union __ieee_double_shape_type {
  double value;
  ...
  struct {
    uint32_t lsw;
    uint32_t msw;
  } parts;
  ...
} __ieee_double_shape_type;
```

And it uses these pairs to write the test cases like:

```
{64,13, 37,__LINE__, 0x40500000, 0x00000000, 0xbff33333, 0x33333333},
/* 64.0000=f(-1.20000)*/
```

Where `0x40500000, 0x00000000` is expected result and `0xbff33333,
0x33333333` is the argument.

Why `libm` does this? Why not work with `double` directly? And if this
is a good practice, why then `openlibm` does not follow this approach?

3) Are libm and openlibm known to be used in critical software? I am
wondering how do people prove the correctness of how both libraries
work given:
- `libm`'s tests are not maintained since 2002. Its test coverage of
~280 cases for each function seems good but even a first shot with
Mull reveals that there is lot of weird code that does not contribute
to the implementation **if we judge only using that test suite**.
- `openlibm` compiles and runs its tests nicely right from master but
has so few of them so it is quite hard to consider its test suite as
one that gives a confidence.

4) Our long-term plans for Mull include making an open-source service
for mutation testing of OSS projects like
[google/oss-fuzz](https://github.com/google/oss-fuzz). Does anybody
see the potential in libm/openlibm being tested against mutation
coverage metric like [oss-fuzz
projects](https://github.com/google/oss-fuzz/tree/master/projects) are
tested using fuzzers?

Thanks for attention.
Reply | Threaded
Open this post in threaded view
|

Re: openlibm and libm: mutation testing

Joseph Myers
I would suggest looking at glibc's libm tests (60 MB of expectations
generated using MPFR and MPC, plus many tests with manually maintained
expectations for special cases of particular functions).  In principle it
should be possible to test other libm implementations using them, although
certainly at present there are lots of dependencies on glibc-specific
interfaces and glibc's rules for error handling and accuracy requirements.  
Where the glibc tests do not provide sufficient code coverage, whether for
glibc's implementations (NB many functions have several different
implementations used on different architectures, so several architectures
would need testing to assess coverage well) or for other libms'
implementations of the same functions, additional test inputs to improve
coverage would make sense.

I would point out that there are many cases in Sun fdlibm, and thus in
other implementations derived from it, where there is a lot of
arbitraryness in the particular value or operation used to e.g. force an
exception; if the point of doing C * C is to generate "overflow", it's
expected that you can change the (large) value of C with no change to the
results of the function; if the point of doing C + x is to generate
"inexact" (not in glibc's goals for most functions, but various such code
has yet to be cleaned up), you can change C, or change the operation + to
-, without any change to the results of the function being expected; if a
function uses a polynomial approximation, many changes to low-order
coefficients may result in only small changes to the result of the
function, within its accuracy goals.  So the fact that mutating the code
does not affect test results does not necessarily indicate any problem
with test coverage.

--
Joseph S. Myers
[hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: openlibm and libm: mutation testing

Stanislav Pankevich
Hi Joseph,

Thanks for the hint about glibc's libm tests: I will see how easy it is to run
Mull on them and then if we find anything.

The answer to the second paragraph: I don't know how and why the functions like
sqrt and acos are really implemented, my goal was to see if a test vector
of 284 inputs would produce a 100% mutation testing coverage on newlib/libm's
test suite of 2002 year. The result of ~60% means that there is a gray area of
subtle things which I as an outsider cannot see through: if these are factors
with small contribution weight like you mentioned or some more or less serious
bugs hiding around.

In the Julia/openlibm thread I share what I found with other tools like KLEE
and libFuzzer: https://github.com/JuliaLang/openlibm/issues/172#issuecomment-348649289.

I cannot say with confidence that those are real bugs because I am
running newlib
on a docker's Ubuntu 64bit image with some hacks to make the library compile
and run the test suite.

But I am very sure that newlib/libm would benefit a lot from being tested
continuously with a modern test suite with Clang and memory and undefined
behaviour sanitizers enabled and also being part of
https://github.com/google/oss-fuzz/tree/master/projects.

Thanks for attention.


On Wed, Nov 22, 2017 at 12:55 AM, Joseph Myers <[hidden email]> wrote:

> I would suggest looking at glibc's libm tests (60 MB of expectations
> generated using MPFR and MPC, plus many tests with manually maintained
> expectations for special cases of particular functions).  In principle it
> should be possible to test other libm implementations using them, although
> certainly at present there are lots of dependencies on glibc-specific
> interfaces and glibc's rules for error handling and accuracy requirements.
> Where the glibc tests do not provide sufficient code coverage, whether for
> glibc's implementations (NB many functions have several different
> implementations used on different architectures, so several architectures
> would need testing to assess coverage well) or for other libms'
> implementations of the same functions, additional test inputs to improve
> coverage would make sense.
>
> I would point out that there are many cases in Sun fdlibm, and thus in
> other implementations derived from it, where there is a lot of
> arbitraryness in the particular value or operation used to e.g. force an
> exception; if the point of doing C * C is to generate "overflow", it's
> expected that you can change the (large) value of C with no change to the
> results of the function; if the point of doing C + x is to generate
> "inexact" (not in glibc's goals for most functions, but various such code
> has yet to be cleaned up), you can change C, or change the operation + to
> -, without any change to the results of the function being expected; if a
> function uses a polynomial approximation, many changes to low-order
> coefficients may result in only small changes to the result of the
> function, within its accuracy goals.  So the fact that mutating the code
> does not affect test results does not necessarily indicate any problem
> with test coverage.
>
> --
> Joseph S. Myers
> [hidden email]