RFC: *scanf vs. overflow

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

RFC: *scanf vs. overflow

Sourceware - libc-alpha mailing list
It has long been known that the C specification of *scanf() leaves
behavior undefined for things like
int i;
sscanf("9999999999999999", "%i", &i);

C11 7.21.6.2 P12
"Matches an optionally signed integer, whose format is the same as
expected for the subject sequence of the strtol function with the value
0 for the base argument."
C11 7.21.6.2 P10
"If this object does not have an appropriate type, or if the result of
the conversion cannot be represented in the object, the behavior is
undefined."

as there is an overflow when consuming the input which matches the
strtol subject sequence but does not fit in the width of an int.  On my
Linux system, 'man sscanf' mentions that ERANGE might be set in such a
case, but neither C nor POSIX actually requires this behavior; other
likely behaviors is storing the value mod 2^32 into i, or storing
INT_MAX into i, or ...

This is annoying - the only safe way to parse integers from
untrustworthy sources, where overflow MUST be detected, is to manually
open-code strtol() calls, which can get quite lengthy in comparison to
the concise representations possible with *scanf.

Would glibc be willing to consider a GNU extension to add an optional
flag character between '%' and the various numeric conversion specifiers
(both integral based on strto*l, and floating point based on strtod),
where we could force *scanf to treat numeric overflow as a matching
failure, rather than undefined behavior?  Or even a second flag to
request that printf stop consuming characters if the next character in
input would cause overflow in the current specifier, leaving that
character to instead be matched to the remainder of the format string?

Let's suppose for arguments that we add '^' as a request to force
overflow to be a matching error.  Then sscanf("9999999999999999", "%^i",
&i) would be well-specified to return 0, rather than returning 1 with an
unknown value assigned into i or any other behavior that other libc do
with the undefined behavior when the ^ is not present.

And if glibc likes the idea of such an extension, and we see an uptick
in applications actually using it, I'd also be happy to champion the
addition of such an extension in POSIX (but the POSIX folks will
definitely want to see existing practice first - both an implementation
and applications that use that implementation).  The libguestfs suite of
programs is willing to be an early adopter, if glibc is willing to
pursue adding such a safety valve.

--
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

Reply | Threaded
Open this post in threaded view
|

Re: RFC: *scanf vs. overflow

Rich Felker-2
On Fri, May 22, 2020 at 03:59:14PM -0500, Eric Blake via Libc-alpha wrote:

> It has long been known that the C specification of *scanf() leaves
> behavior undefined for things like
> int i;
> sscanf("9999999999999999", "%i", &i);
>
> C11 7.21.6.2 P12
> "Matches an optionally signed integer, whose format is the same as
> expected for the subject sequence of the strtol function with the
> value 0 for the base argument."
> C11 7.21.6.2 P10
> "If this object does not have an appropriate type, or if the result
> of the conversion cannot be represented in the object, the behavior
> is undefined."
>
> as there is an overflow when consuming the input which matches the
> strtol subject sequence but does not fit in the width of an int.  On
> my Linux system, 'man sscanf' mentions that ERANGE might be set in
> such a case, but neither C nor POSIX actually requires this
> behavior; other likely behaviors is storing the value mod 2^32 into
> i, or storing INT_MAX into i, or ...
>
> This is annoying - the only safe way to parse integers from
> untrustworthy sources, where overflow MUST be detected, is to
> manually open-code strtol() calls, which can get quite lengthy in
> comparison to the concise representations possible with *scanf.
>
> Would glibc be willing to consider a GNU extension to add an
> optional flag character between '%' and the various numeric
> conversion specifiers (both integral based on strto*l, and floating
> point based on strtod), where we could force *scanf to treat numeric
> overflow as a matching failure, rather than undefined behavior?  Or
> even a second flag to request that printf stop consuming characters
> if the next character in input would cause overflow in the current
> specifier, leaving that character to instead be matched to the
> remainder of the format string?

Since conversion specifier forms outside the standard *also* have
undefined behavior, I see no advantage to defining that particular
undefined case vs just defining the result of the overflowing
conversion, unless you're worried the standard might later define a
conflicting definition. Neither way is amenable to configure detection
(without breaking cross compiling) without also adopting something
like my proposal on libc-coord:
https://www.openwall.com/lists/libc-coord/2020/04/22/1

BTW there is a portable only-somewhat-hideous way to do this with
sscanf: using assignment suppression combined with %n, then strtol,
etc. with the offsets sproduced by %n.

> Let's suppose for arguments that we add '^' as a request to force
> overflow to be a matching error.  Then sscanf("9999999999999999",
> "%^i", &i) would be well-specified to return 0, rather than
> returning 1 with an unknown value assigned into i or any other
> behavior that other libc do with the undefined behavior when the ^
> is not present.
>
> And if glibc likes the idea of such an extension, and we see an
> uptick in applications actually using it, I'd also be happy to
> champion the addition of such an extension in POSIX (but the POSIX
> folks will definitely want to see existing practice first - both an
> implementation and applications that use that implementation).  The
> libguestfs suite of programs is willing to be an early adopter, if
> glibc is willing to pursue adding such a safety valve.

I think it would be more useful to look for existing practice where
the UB blows up in horrible ways, and if there is none (if all
implementations behave somewhat reasonably) define the intersection of
their behaviors as standard and get rid of the UB here. A new feature
will not reliably be usable for decades in portable software, but new
documentation of existing universal practice would be immediately
usable.

Rich
Reply | Threaded
Open this post in threaded view
|

Re: RFC: *scanf vs. overflow

Paul Eggert
On 5/22/20 6:16 PM, Rich Felker wrote:
> A new feature
> will not reliably be usable for decades in portable software, but new
> documentation of existing universal practice would be immediately
> usable.

We could do both.

Also, we could change glibc's behavior in a simpler way, by not adding a new
flag; but if an integer is out of range, then scan only the initial prefix that
fits, leaving the trailing digits for the rest of the format to scan. This also
conforms to POSIX and is more likely to cause C programs to do the right thing
(i.e., report a failure) than the current behavior does. And with luck perhaps
we could eventually get POSIX to standardize this behavior.
Reply | Threaded
Open this post in threaded view
|

Re: RFC: *scanf vs. overflow

Sourceware - libc-alpha mailing list
In reply to this post by Sourceware - libc-alpha mailing list
The context to this is that nbdkit uses sscanf to parse simple file
formats in various places, eg:

https://github.com/libguestfs/nbdkit/blob/b23f4f53cf71326f1dba481f64f7f182c20fa3dc/plugins/data/format.c#L171-L172
https://github.com/libguestfs/nbdkit/blob/b23f4f53cf71326f1dba481f64f7f182c20fa3dc/filters/ddrescue/ddrescue.c#L98

We can only do this safely where we can prove that overflow does not
matter.  In other cases we've had to change sscanf uses to strto* etc
which is much more difficult to use correctly.  Just look at how much
code is required to wrap strto* functions to use them safely:

https://github.com/libguestfs/nbdkit/blob/b23f4f53cf71326f1dba481f64f7f182c20fa3dc/server/public.c#L113-L296

Rich.

--
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-p2v converts physical machines to virtual machines.  Boot with a
live CD or over the network (PXE) and turn machines into KVM guests.
http://libguestfs.org/virt-v2v

Reply | Threaded
Open this post in threaded view
|

Re: RFC: *scanf vs. overflow

Paul Eggert
On 5/23/20 12:06 AM, Richard W.M. Jones via Libc-alpha wrote:

> https://github.com/libguestfs/nbdkit/blob/b23f4f53cf71326f1dba481f64f7f182c20fa3dc/plugins/data/format.c#L171-L172

> We can only do this safely where we can prove that overflow does not
> matter.

Yes, this is exactly the sort of usage that I had in mind. In the following
example, which is the first use of *scanf I saw, if scanf never allowed integer
overflow (that is, it scanned only as much of a number that would fit), this
code would output an error message instead of blithely going on with an
overflowed number, and this would be safer than the code's current behavior.

>       if (sscanf (&value[i], "*%" SCNi64 "%n", &k, &n) == 1) {
>         if (k < 0) {
>           nbdkit_error ("data parameter *N must be >= 0");
>           return -1;
>         }
>         ...
>       } else {
>         nbdkit_error ("')' in data string not followed by '*'");
>         return -1;
>       }
Reply | Threaded
Open this post in threaded view
|

Re: RFC: *scanf vs. overflow

Rich Felker-2
In reply to this post by Paul Eggert
On Fri, May 22, 2020 at 08:06:34PM -0700, Paul Eggert wrote:

> On 5/22/20 6:16 PM, Rich Felker wrote:
> > A new feature
> > will not reliably be usable for decades in portable software, but new
> > documentation of existing universal practice would be immediately
> > usable.
>
> We could do both.
>
> Also, we could change glibc's behavior in a simpler way, by not adding a new
> flag; but if an integer is out of range, then scan only the initial prefix that
> fits, leaving the trailing digits for the rest of the format to scan. This also
> conforms to POSIX and is more likely to cause C programs to do the right thing
> (i.e., report a failure) than the current behavior does. And with luck perhaps
> we could eventually get POSIX to standardize this behavior.

I'm not really a fan of stopping on an initial prefix. While UB allows
anything, that's contrary to the abstract behavior defined for scanf
(matching fields syntactically then value conversion) and does not
admit easily sharing a backend with strto*. It's also even *more
likely* to break programs that don't expect the behavior than just
storing a wrapped or clamped value, since all the remaining fields
will misalign with the conversion specifier string.

FILE-based (as opposed to string-based) scanf forms inherently do not
admit any kind of "recovery" after mismatch without the caller seeking
backwards (requiring a seekable stream); many of them are lossy on
error. This is mainly a reaon not to use them, not a justification for
a weird definition for one special case.

I'm pretty sure the real answer here is just "don't use *scanf for
that."

Rich
Reply | Threaded
Open this post in threaded view
|

Re: RFC: *scanf vs. overflow

Rich Felker-2
In reply to this post by Sourceware - libc-alpha mailing list
On Sat, May 23, 2020 at 08:06:54AM +0100, Richard W.M. Jones via Libc-alpha wrote:
> The context to this is that nbdkit uses sscanf to parse simple file
> formats in various places, eg:
>
> https://github.com/libguestfs/nbdkit/blob/b23f4f53cf71326f1dba481f64f7f182c20fa3dc/plugins/data/format.c#L171-L172
> https://github.com/libguestfs/nbdkit/blob/b23f4f53cf71326f1dba481f64f7f182c20fa3dc/filters/ddrescue/ddrescue.c#L98
>
> We can only do this safely where we can prove that overflow does not
> matter.

Being that it's specified as UB, it can never "not matter";
arbitrarily bad side effects are permitted. So it's only safe to use
scanf where the input is *trusted not to contain overflowing values*.

What would be really nice to fix here is getting the standard to
specify that overflow has behavior like strto* or at least
"unspecified value" rather than "undefined behavior" so that it's safe
to let it overflow in cases where you don't care (e.g. you'll be
consistency-checking the value afterwards anyway).

> In other cases we've had to change sscanf uses to strto* etc
> which is much more difficult to use correctly.  Just look at how much
> code is required to wrap strto* functions to use them safely:
>
> https://github.com/libguestfs/nbdkit/blob/b23f4f53cf71326f1dba481f64f7f182c20fa3dc/server/public.c#L113-L296

Really that much code is just for the sake of verbose error messages,
and they're not even accurate. "errno!=0" does not mean "could not
parse"; it can also be overflow of a perfectly parseable value. And if
you've already caught errno!=0 then end==str is impossible (dead
code). The last case, not hitting null, is also likely spurious/wrong;
you usually *want* to pick up where strto* stopped, and the next thing
the parser does will catch whether the characters after the number are
valid there or not.

strto* do have some annoying design flaws in error reporting, but
they're not really that hard to use right, and much easier than scanf
which just *lacks the reporting channels* for the kind of fine-grained
error reporting you're insisting on doing here.

FWIW this code would also be a lot cleaner as a static inline function
rather than a many-line macro.

Rich
Reply | Threaded
Open this post in threaded view
|

Re: RFC: *scanf vs. overflow

Paul Eggert
In reply to this post by Rich Felker-2
On 5/23/20 9:11 AM, Rich Felker wrote:

> stopping on an initial prefix ... does not admit easily sharing a backend with strto*.

I don't see why. If the backend has a "stop scanning on integer overflow" flag
(which it would need to have anyway, to support the proposed behavior), then
*scanf can use the flag and strto* can not use it.

Anyway, this is not an issue for glibc, which has no such backend.

> that's contrary to the abstract behavior defined for scanf
> (matching fields syntactically then value conversion)

That's not really a problem. The abstract behavior already provides for matching
that is not purely syntactic. For example, string conversion specifiers can
impose length limits on the match, which means the matching does not rely purely
on the syntax of the input. It would be easy to say that integer conversion
specifiers can also impose limits related to integer overflow.

> It's also even *more
> likely* to break programs that don't expect the behavior than just
> storing a wrapped or clamped value

That's not true of the code that I looked at (see the URLs earlier in this
thread). That code was pretty carefully written and yet still vulnerable to the
integer-overflow issue.

> I'm pretty sure the real answer here is just "don't use *scanf for
> that."

Absolutely true right now. We are merely talking about (a) what sort of
implementation behavior is more useful for programs that are currently relying
on undefined behavior, and (b) what might be the cleanest addition to POSIX
later, to help improve this mess so that future programmers can use *scanf
safely in more situations.
Reply | Threaded
Open this post in threaded view
|

Re: RFC: *scanf vs. overflow

Rich Felker-2
On Sat, May 23, 2020 at 09:28:26AM -0700, Paul Eggert wrote:
> On 5/23/20 9:11 AM, Rich Felker wrote:
>
> > stopping on an initial prefix ... does not admit easily sharing a backend with strto*.
>
> I don't see why. If the backend has a "stop scanning on integer overflow" flag
> (which it would need to have anyway, to support the proposed behavior), then
> *scanf can use the flag and strto* can not use it.
>
> Anyway, this is not an issue for glibc, which has no such backend.

It's relevant because you want to propose this for standardization.

> > that's contrary to the abstract behavior defined for scanf
> > (matching fields syntactically then value conversion)
>
> That's not really a problem. The abstract behavior already provides for matching
> that is not purely syntactic. For example, string conversion specifiers can
> impose length limits on the match, which means the matching does not rely purely
> on the syntax of the input. It would be easy to say that integer conversion
> specifiers can also impose limits related to integer overflow.

Sure that's syntax. It's /[^ ]{1,n}"/.

Of course for integers you can define a syntax that matches every
non-overflowing value (this is always true for finite matching sets),
but that's nothing like how the function is specified and I don't
think anyone reasonable would classify non-overflow as a syntactic
property.

> > It's also even *more
> > likely* to break programs that don't expect the behavior than just
> > storing a wrapped or clamped value
>
> That's not true of the code that I looked at (see the URLs earlier in this
> thread). That code was pretty carefully written and yet still vulnerable to the
> integer-overflow issue.

I don't follow. *Any* use of scanf on untrusted input is "vulnerable
to the integer-overflow issue" in the sense that overflow is UB. This
is not something subtle.

If you mean actually using overflowed values in an unsafe way
(assuming no ballooning effects of UB, just wrong values), I don't see
how it's subtle either. Any value that could be produced via overflow
could also be produced via non-overflowing input, and you have to
validate data either way.

> > I'm pretty sure the real answer here is just "don't use *scanf for
> > that."
>
> Absolutely true right now. We are merely talking about (a) what sort of
> implementation behavior is more useful for programs that are currently relying
> on undefined behavior, and (b) what might be the cleanest addition to POSIX
> later, to help improve this mess so that future programmers can use *scanf
> safely in more situations.

This is absolutely not "clean" and I am opposed to it.

Rich
Reply | Threaded
Open this post in threaded view
|

Re: RFC: *scanf vs. overflow

Paul Eggert
On 5/23/20 9:45 AM, Rich Felker wrote:

> It's relevant because you want to propose this for standardization.

I don't think it's ready for standardization now. I'm merely proposing it for
glibc. If it works well there, great; if not, then glibc should do something
more ambitious, such as Eric's proposal.

Doing nothing is not a good option; this is a real problem that affects many
real programs.

> that's syntax. It's /[^ ]{1,n}"/.

I'll concede that. That being said, the difference between syntax and semantics
is always somewhat arbitrary, and there's little point to slicing and dicing the
fuzzy boundary here. Regardless of whether the change is to "syntax" or
"semantics" it would be easy to change POSIX to allow the proposed behavior;
it's not a fundamental change to the spec.

> *Any* use of scanf on untrusted input is "vulnerable
> to the integer-overflow issue" in the sense that overflow is UB.

Absolutely, but that was not the point. The issue is what's the best thing to do
for these programs. Many carefully-written but imperfect programs have these
bugs, and although glibc is within its rights to dump core or worse when these
programs run, it's better if glibc's behavior improves their overall
reliability. Letting these errors run through silently causing further
corruption later is not a good strategy for improving overall reliability.
Pointing fingers at developers and telling them not to use scanf is not likely
to be a good strategy either.
> Any value that could be produced via overflow
> could also be produced via non-overflowing input, and you have to
> validate data either way.

Sure, but silently treating 2**32 as if it were zero is more likely to cause
real problems later. We need a better way for *scanf to reflect these errors
back to the calling code.
Reply | Threaded
Open this post in threaded view
|

Re: [Libguestfs] RFC: *scanf vs. overflow

Sourceware - libc-alpha mailing list
In reply to this post by Rich Felker-2
On Sat, May 23, 2020 at 12:45:01PM -0400, Rich Felker wrote:
> I don't follow. *Any* use of scanf on untrusted input is "vulnerable
> to the integer-overflow issue" in the sense that overflow is UB. This
> is not something subtle.

{,s}scanf is a useful, natural way to parse strings, and strto* is a
horrible interface with many bear traps.  It seems to me scanf could
be changed to make it safe for overflow, simply by stopping parsing at
the point where the overflow occurs and returning a short count (or
the various other ideas suggested already in this thread).

Rich.

--
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-top is 'top' for virtual machines.  Tiny program with many
powerful monitoring features, net stats, disk stats, logging, etc.
http://people.redhat.com/~rjones/virt-top