[RFC] Possible new execveat(2) Linux syscall

classic Classic list List threaded Threaded
14 messages Options
Reply | Threaded
Open this post in threaded view
|

[RFC] Possible new execveat(2) Linux syscall

David Drysdale
Hi,

Over at the LKML[1] we've been discussing a possible new syscall, execveat(2),
and it would be good to hear a glibc perspective about it (and whether there
are any interface changes that would make it easier to use from userspace).

The syscall prototype is:
  int execveat(int fd, const char *pathname,
                      char *const argv[],  char *const envp[],
                      int flags); /* AT_EMPTY_PATH, AT_SYMLINK_NOFOLLOW */
and it works similarly to execve(2) except:
 - the executable to run is identified by the combination of fd+pathname, like
   other *at(2) syscalls
 - there's an extra flags field to control behaviour.
(I've attached a text version of the suggested man page below)

One particular benefit of this is that it allows an fexecve(3) implementation
that doesn't rely on /proc being accessible, which is useful for sandboxed
applications.  (However, that does only work for non-interpreted programs:
the name passed to a script interpreter is of the form "/dev/fd/<fd>/<path>"
or "/dev/fd/<fd>", so the executed interpreter will normally still need /proc
access to load the script file).

How does this sound from a glibc perspective?

Thanks,
David

[1] https://lkml.org/lkml/2014/11/7/512, with earlier discussions at
https://lkml.org/lkml/2014/11/6/469, https://lkml.org/lkml/2014/10/22/275
and https://lkml.org/lkml/2014/10/17/428

----

EXECVEAT(2)              Linux Programmer's Manual             EXECVEAT(2)

NAME
       execveat - execute program relative to a directory file descriptor

SYNOPSIS
       #include <unistd.h>

       int execveat(int fd, const char *pathname,
                    char *const argv[],  char *const envp[],
                    int flags);

DESCRIPTION
       The  execveat()  system call executes the program pointed to by the
       combination of fd and pathname.  The execveat() system  call  oper‐
       ates  in  exactly the same way as execve(2), except for the differ‐
       ences described in this manual page.

       If the pathname given in pathname is relative, then  it  is  inter‐
       preted relative to the directory referred to by the file descriptor
       fd (rather than relative to the current working  directory  of  the
       calling process, as is done by execve(2) for a relative pathname).

       If  pathname is relative and fd is the special value AT_FDCWD, then
       pathname is interpreted relative to the current  working  directory
       of the calling process (like execve(2)).

       If pathname is absolute, then fd is ignored.

       If pathname is an empty string and the AT_EMPTY_PATH flag is speci‐
       fied, then the file descriptor fd specifies the  file  to  be  exe‐
       cuted.

       flags can either be 0, or include the following flags:

       AT_EMPTY_PATH
              If pathname is an empty string, operate on the file referred
              to by fd (which may have been  obtained  using  the  open(2)
              O_PATH flag).

       AT_SYMLINK_NOFOLLOW
              If  the  file  identified by fd and a non-NULL pathname is a
              symbolic link, then the call fails with the error EINVAL.

RETURN VALUE
       On success, execveat() does not return. On error  -1  is  returned,
       and errno is set appropriately.

ERRORS
       The  same  errors  that  occur  for  execve(2)  can  also occur for
       execveat().   The  following  additional  errors  can   occur   for
       execveat():

       EBADF  fd is not a valid file descriptor.

       ENOENT The  program  identified by fd and pathname requires the use
              of an interpreter program (such as a  script  starting  with
              "#!")  but  the  file  descriptor  fd  was  opened  with the
              O_CLOEXEC flag and so the program file  is  inaccessible  to
              the launched interpreter.

       EINVAL Invalid flag specified in flags.

       ENOTDIR
              pathname  is  relative and fd is a file descriptor referring
              to a file other than a directory.

VERSIONS
       execveat() was added to Linux in kernel 3.???.

NOTES
       In addition to the reasons explained in openat(2),  the  execveat()
       system call is also needed to allow fexecve(3) to be implemented on
       systems that do not have the /proc filesystem mounted.

SEE ALSO
       execve(2), fexecve(3)

Linux                           2014-04-02                     EXECVEAT(2)
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] Possible new execveat(2) Linux syscall

Rich Felker
On Fri, Nov 14, 2014 at 02:54:19PM +0000, David Drysdale wrote:

> Hi,
>
> Over at the LKML[1] we've been discussing a possible new syscall, execveat(2),
> and it would be good to hear a glibc perspective about it (and whether there
> are any interface changes that would make it easier to use from userspace).
>
> The syscall prototype is:
>   int execveat(int fd, const char *pathname,
>                       char *const argv[],  char *const envp[],
>                       int flags); /* AT_EMPTY_PATH, AT_SYMLINK_NOFOLLOW */
> and it works similarly to execve(2) except:
>  - the executable to run is identified by the combination of fd+pathname, like
>    other *at(2) syscalls
>  - there's an extra flags field to control behaviour.
> (I've attached a text version of the suggested man page below)
>
> One particular benefit of this is that it allows an fexecve(3) implementation
> that doesn't rely on /proc being accessible, which is useful for sandboxed
> applications.  (However, that does only work for non-interpreted programs:
> the name passed to a script interpreter is of the form "/dev/fd/<fd>/<path>"
> or "/dev/fd/<fd>", so the executed interpreter will normally still need /proc
> access to load the script file).
>
> How does this sound from a glibc perspective?

I've been following the discussions so far and everything looks mostly
okay. There are still issues to be resolved with the different
semantics between Linux O_PATH and what POSIX requires for O_EXEC (and
O_SEARCH) but as long as the intent is that, once O_EXEC is defined to
save the permissions at the time of open and cause them to be used in
place of the current file permissions at the time of execveat

One major issue however is FD_CLOEXEC with scripts. Last I checked,
this didn't work because the file is already closed by the time the
interpreted runs. The intended usage of fexecve is almost certainly to
call it with the file descriptor set close-on-exec; otherwise, there
would be no clean way to close it, since the program being executed
doesn't know that it's being executed via fexecve. So this is a
serious problem that needs to be solved if it hasn't already. I have
some ideas I could offer, but I'm not an expert on the kernel side
things so I'm not sure they'd be correct.

Rich

> Thanks,
> David
>
> [1] https://lkml.org/lkml/2014/11/7/512, with earlier discussions at
> https://lkml.org/lkml/2014/11/6/469, https://lkml.org/lkml/2014/10/22/275
> and https://lkml.org/lkml/2014/10/17/428
>
> ----
>
> EXECVEAT(2)              Linux Programmer's Manual             EXECVEAT(2)
>
> NAME
>        execveat - execute program relative to a directory file descriptor
>
> SYNOPSIS
>        #include <unistd.h>
>
>        int execveat(int fd, const char *pathname,
>                     char *const argv[],  char *const envp[],
>                     int flags);
>
> DESCRIPTION
>        The  execveat()  system call executes the program pointed to by the
>        combination of fd and pathname.  The execveat() system  call  oper‐
>        ates  in  exactly the same way as execve(2), except for the differ‐
>        ences described in this manual page.
>
>        If the pathname given in pathname is relative, then  it  is  inter‐
>        preted relative to the directory referred to by the file descriptor
>        fd (rather than relative to the current working  directory  of  the
>        calling process, as is done by execve(2) for a relative pathname).
>
>        If  pathname is relative and fd is the special value AT_FDCWD, then
>        pathname is interpreted relative to the current  working  directory
>        of the calling process (like execve(2)).
>
>        If pathname is absolute, then fd is ignored.
>
>        If pathname is an empty string and the AT_EMPTY_PATH flag is speci‐
>        fied, then the file descriptor fd specifies the  file  to  be  exe‐
>        cuted.
>
>        flags can either be 0, or include the following flags:
>
>        AT_EMPTY_PATH
>               If pathname is an empty string, operate on the file referred
>               to by fd (which may have been  obtained  using  the  open(2)
>               O_PATH flag).
>
>        AT_SYMLINK_NOFOLLOW
>               If  the  file  identified by fd and a non-NULL pathname is a
>               symbolic link, then the call fails with the error EINVAL.
>
> RETURN VALUE
>        On success, execveat() does not return. On error  -1  is  returned,
>        and errno is set appropriately.
>
> ERRORS
>        The  same  errors  that  occur  for  execve(2)  can  also occur for
>        execveat().   The  following  additional  errors  can   occur   for
>        execveat():
>
>        EBADF  fd is not a valid file descriptor.
>
>        ENOENT The  program  identified by fd and pathname requires the use
>               of an interpreter program (such as a  script  starting  with
>               "#!")  but  the  file  descriptor  fd  was  opened  with the
>               O_CLOEXEC flag and so the program file  is  inaccessible  to
>               the launched interpreter.
>
>        EINVAL Invalid flag specified in flags.
>
>        ENOTDIR
>               pathname  is  relative and fd is a file descriptor referring
>               to a file other than a directory.
>
> VERSIONS
>        execveat() was added to Linux in kernel 3.???.
>
> NOTES
>        In addition to the reasons explained in openat(2),  the  execveat()
>        system call is also needed to allow fexecve(3) to be implemented on
>        systems that do not have the /proc filesystem mounted.
>
> SEE ALSO
>        execve(2), fexecve(3)
>
> Linux                           2014-04-02                     EXECVEAT(2)
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] Possible new execveat(2) Linux syscall

Andy Lutomirski
On Nov 16, 2014 11:53 AM, "Rich Felker" <[hidden email]> wrote:

>
> On Fri, Nov 14, 2014 at 02:54:19PM +0000, David Drysdale wrote:
> > Hi,
> >
> > Over at the LKML[1] we've been discussing a possible new syscall, execveat(2),
> > and it would be good to hear a glibc perspective about it (and whether there
> > are any interface changes that would make it easier to use from userspace).
> >
> > The syscall prototype is:
> >   int execveat(int fd, const char *pathname,
> >                       char *const argv[],  char *const envp[],
> >                       int flags); /* AT_EMPTY_PATH, AT_SYMLINK_NOFOLLOW */
> > and it works similarly to execve(2) except:
> >  - the executable to run is identified by the combination of fd+pathname, like
> >    other *at(2) syscalls
> >  - there's an extra flags field to control behaviour.
> > (I've attached a text version of the suggested man page below)
> >
> > One particular benefit of this is that it allows an fexecve(3) implementation
> > that doesn't rely on /proc being accessible, which is useful for sandboxed
> > applications.  (However, that does only work for non-interpreted programs:
> > the name passed to a script interpreter is of the form "/dev/fd/<fd>/<path>"
> > or "/dev/fd/<fd>", so the executed interpreter will normally still need /proc
> > access to load the script file).
> >
> > How does this sound from a glibc perspective?
>
> I've been following the discussions so far and everything looks mostly
> okay. There are still issues to be resolved with the different
> semantics between Linux O_PATH and what POSIX requires for O_EXEC (and
> O_SEARCH) but as long as the intent is that, once O_EXEC is defined to
> save the permissions at the time of open and cause them to be used in
> place of the current file permissions at the time of execveat

Is something missing here?

FWIW, I don't understand O_PATH or O_EXEC very well, so from my POV,
help would be appreciated.

>
> One major issue however is FD_CLOEXEC with scripts. Last I checked,
> this didn't work because the file is already closed by the time the
> interpreted runs. The intended usage of fexecve is almost certainly to
> call it with the file descriptor set close-on-exec; otherwise, there
> would be no clean way to close it, since the program being executed
> doesn't know that it's being executed via fexecve. So this is a
> serious problem that needs to be solved if it hasn't already. I have
> some ideas I could offer, but I'm not an expert on the kernel side
> things so I'm not sure they'd be correct.

Bring on the ideas.

FWIW, I've often thought that interpreter binaries should mark
themselves as such to enable better interactions with the kernel.

--Andy

>
> Rich
>
> > Thanks,
> > David
> >
> > [1] https://lkml.org/lkml/2014/11/7/512, with earlier discussions at
> > https://lkml.org/lkml/2014/11/6/469, https://lkml.org/lkml/2014/10/22/275
> > and https://lkml.org/lkml/2014/10/17/428
> >
> > ----
> >
> > EXECVEAT(2)              Linux Programmer's Manual             EXECVEAT(2)
> >
> > NAME
> >        execveat - execute program relative to a directory file descriptor
> >
> > SYNOPSIS
> >        #include <unistd.h>
> >
> >        int execveat(int fd, const char *pathname,
> >                     char *const argv[],  char *const envp[],
> >                     int flags);
> >
> > DESCRIPTION
> >        The  execveat()  system call executes the program pointed to by the
> >        combination of fd and pathname.  The execveat() system  call  oper‐
> >        ates  in  exactly the same way as execve(2), except for the differ‐
> >        ences described in this manual page.
> >
> >        If the pathname given in pathname is relative, then  it  is  inter‐
> >        preted relative to the directory referred to by the file descriptor
> >        fd (rather than relative to the current working  directory  of  the
> >        calling process, as is done by execve(2) for a relative pathname).
> >
> >        If  pathname is relative and fd is the special value AT_FDCWD, then
> >        pathname is interpreted relative to the current  working  directory
> >        of the calling process (like execve(2)).
> >
> >        If pathname is absolute, then fd is ignored.
> >
> >        If pathname is an empty string and the AT_EMPTY_PATH flag is speci‐
> >        fied, then the file descriptor fd specifies the  file  to  be  exe‐
> >        cuted.
> >
> >        flags can either be 0, or include the following flags:
> >
> >        AT_EMPTY_PATH
> >               If pathname is an empty string, operate on the file referred
> >               to by fd (which may have been  obtained  using  the  open(2)
> >               O_PATH flag).
> >
> >        AT_SYMLINK_NOFOLLOW
> >               If  the  file  identified by fd and a non-NULL pathname is a
> >               symbolic link, then the call fails with the error EINVAL.
> >
> > RETURN VALUE
> >        On success, execveat() does not return. On error  -1  is  returned,
> >        and errno is set appropriately.
> >
> > ERRORS
> >        The  same  errors  that  occur  for  execve(2)  can  also occur for
> >        execveat().   The  following  additional  errors  can   occur   for
> >        execveat():
> >
> >        EBADF  fd is not a valid file descriptor.
> >
> >        ENOENT The  program  identified by fd and pathname requires the use
> >               of an interpreter program (such as a  script  starting  with
> >               "#!")  but  the  file  descriptor  fd  was  opened  with the
> >               O_CLOEXEC flag and so the program file  is  inaccessible  to
> >               the launched interpreter.
> >
> >        EINVAL Invalid flag specified in flags.
> >
> >        ENOTDIR
> >               pathname  is  relative and fd is a file descriptor referring
> >               to a file other than a directory.
> >
> > VERSIONS
> >        execveat() was added to Linux in kernel 3.???.
> >
> > NOTES
> >        In addition to the reasons explained in openat(2),  the  execveat()
> >        system call is also needed to allow fexecve(3) to be implemented on
> >        systems that do not have the /proc filesystem mounted.
> >
> > SEE ALSO
> >        execve(2), fexecve(3)
> >
> > Linux                           2014-04-02                     EXECVEAT(2)
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] Possible new execveat(2) Linux syscall

Rich Felker
On Sun, Nov 16, 2014 at 01:20:39PM -0800, Andy Lutomirski wrote:

> On Nov 16, 2014 11:53 AM, "Rich Felker" <[hidden email]> wrote:
> >
> > On Fri, Nov 14, 2014 at 02:54:19PM +0000, David Drysdale wrote:
> > > Hi,
> > >
> > > Over at the LKML[1] we've been discussing a possible new syscall, execveat(2),
> > > and it would be good to hear a glibc perspective about it (and whether there
> > > are any interface changes that would make it easier to use from userspace).
> > >
> > > The syscall prototype is:
> > >   int execveat(int fd, const char *pathname,
> > >                       char *const argv[],  char *const envp[],
> > >                       int flags); /* AT_EMPTY_PATH, AT_SYMLINK_NOFOLLOW */
> > > and it works similarly to execve(2) except:
> > >  - the executable to run is identified by the combination of fd+pathname, like
> > >    other *at(2) syscalls
> > >  - there's an extra flags field to control behaviour.
> > > (I've attached a text version of the suggested man page below)
> > >
> > > One particular benefit of this is that it allows an fexecve(3) implementation
> > > that doesn't rely on /proc being accessible, which is useful for sandboxed
> > > applications.  (However, that does only work for non-interpreted programs:
> > > the name passed to a script interpreter is of the form "/dev/fd/<fd>/<path>"
> > > or "/dev/fd/<fd>", so the executed interpreter will normally still need /proc
> > > access to load the script file).
> > >
> > > How does this sound from a glibc perspective?
> >
> > I've been following the discussions so far and everything looks mostly
> > okay. There are still issues to be resolved with the different
> > semantics between Linux O_PATH and what POSIX requires for O_EXEC (and
> > O_SEARCH) but as long as the intent is that, once O_EXEC is defined to
> > save the permissions at the time of open and cause them to be used in
> > place of the current file permissions at the time of execveat
>
> Is something missing here?
>
> FWIW, I don't understand O_PATH or O_EXEC very well, so from my POV,
> help would be appreciated.

Yes. POSIX requires that permission checks for execution (fexecve with
O_EXEC file descriptors) and directory-search (*at functions with
O_SEARCH file descriptors) succeed if the open operation succeeded --
the permissions check is required to take place at open time rather
than at exec/search time. There's a separate discussion about how to
make this work on the kernel side.

> > One major issue however is FD_CLOEXEC with scripts. Last I checked,
> > this didn't work because the file is already closed by the time the
> > interpreted runs. The intended usage of fexecve is almost certainly to
> > call it with the file descriptor set close-on-exec; otherwise, there
> > would be no clean way to close it, since the program being executed
> > doesn't know that it's being executed via fexecve. So this is a
> > serious problem that needs to be solved if it hasn't already. I have
> > some ideas I could offer, but I'm not an expert on the kernel side
> > things so I'm not sure they'd be correct.
>
> Bring on the ideas.

My thought is that when the kernel opens the binary and sees that it's
a script that needs an interpreter, the kernel should not pass
/proc/self/fd/%d to the interpreter, but instead should pass the name
of a new magic symlink in /proc/self that's connected to the inode for
the script to be executed but that ceases to exist as soon as it's
opened. In theory this could also be used for suid scripts to make
them secure.

> FWIW, I've often thought that interpreter binaries should mark
> themselves as such to enable better interactions with the kernel.

That's hard since users expect to be able to use arbitrary
interpreters (and sometimes even pass through multiple ones, e.g.
#!/usr/bin/env perl).

Rich
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] Possible new execveat(2) Linux syscall

Andy Lutomirski
On Sun, Nov 16, 2014 at 2:08 PM, Rich Felker <[hidden email]> wrote:

> On Sun, Nov 16, 2014 at 01:20:39PM -0800, Andy Lutomirski wrote:
>> On Nov 16, 2014 11:53 AM, "Rich Felker" <[hidden email]> wrote:
>> >
>> > On Fri, Nov 14, 2014 at 02:54:19PM +0000, David Drysdale wrote:
>> > > Hi,
>> > >
>> > > Over at the LKML[1] we've been discussing a possible new syscall, execveat(2),
>> > > and it would be good to hear a glibc perspective about it (and whether there
>> > > are any interface changes that would make it easier to use from userspace).
>> > >
>> > > The syscall prototype is:
>> > >   int execveat(int fd, const char *pathname,
>> > >                       char *const argv[],  char *const envp[],
>> > >                       int flags); /* AT_EMPTY_PATH, AT_SYMLINK_NOFOLLOW */
>> > > and it works similarly to execve(2) except:
>> > >  - the executable to run is identified by the combination of fd+pathname, like
>> > >    other *at(2) syscalls
>> > >  - there's an extra flags field to control behaviour.
>> > > (I've attached a text version of the suggested man page below)
>> > >
>> > > One particular benefit of this is that it allows an fexecve(3) implementation
>> > > that doesn't rely on /proc being accessible, which is useful for sandboxed
>> > > applications.  (However, that does only work for non-interpreted programs:
>> > > the name passed to a script interpreter is of the form "/dev/fd/<fd>/<path>"
>> > > or "/dev/fd/<fd>", so the executed interpreter will normally still need /proc
>> > > access to load the script file).
>> > >
>> > > How does this sound from a glibc perspective?
>> >
>> > I've been following the discussions so far and everything looks mostly
>> > okay. There are still issues to be resolved with the different
>> > semantics between Linux O_PATH and what POSIX requires for O_EXEC (and
>> > O_SEARCH) but as long as the intent is that, once O_EXEC is defined to
>> > save the permissions at the time of open and cause them to be used in
>> > place of the current file permissions at the time of execveat
>>
>> Is something missing here?
>>
>> FWIW, I don't understand O_PATH or O_EXEC very well, so from my POV,
>> help would be appreciated.
>
> Yes. POSIX requires that permission checks for execution (fexecve with
> O_EXEC file descriptors) and directory-search (*at functions with
> O_SEARCH file descriptors) succeed if the open operation succeeded --
> the permissions check is required to take place at open time rather
> than at exec/search time. There's a separate discussion about how to
> make this work on the kernel side.

It may be worth making this work as part of adding execveat to the
kernel.  Does the kernel even have O_EXEC right now?

>
>> > One major issue however is FD_CLOEXEC with scripts. Last I checked,
>> > this didn't work because the file is already closed by the time the
>> > interpreted runs. The intended usage of fexecve is almost certainly to
>> > call it with the file descriptor set close-on-exec; otherwise, there
>> > would be no clean way to close it, since the program being executed
>> > doesn't know that it's being executed via fexecve. So this is a
>> > serious problem that needs to be solved if it hasn't already. I have
>> > some ideas I could offer, but I'm not an expert on the kernel side
>> > things so I'm not sure they'd be correct.
>>
>> Bring on the ideas.
>
> My thought is that when the kernel opens the binary and sees that it's
> a script that needs an interpreter, the kernel should not pass
> /proc/self/fd/%d to the interpreter, but instead should pass the name
> of a new magic symlink in /proc/self that's connected to the inode for
> the script to be executed but that ceases to exist as soon as it's
> opened. In theory this could also be used for suid scripts to make
> them secure.

This doesn't help if /proc is not mounted, which is an important use case.

>
>> FWIW, I've often thought that interpreter binaries should mark
>> themselves as such to enable better interactions with the kernel.
>
> That's hard since users expect to be able to use arbitrary
> interpreters (and sometimes even pass through multiple ones, e.g.
> #!/usr/bin/env perl).
>

Hmm.  I'd be okay with old interpreters having a somewhat degraded experience.

I guess that #!/some/interpreted/script isn't allowed, but maybe
#!/usr/bin/env some-interpreted-script should work.

It could be that all that's really needed is some convention to tell
an interpreter that it should use fd N as a script *and close it*.
Something like /dev/fd_and_close/N could work, but that has all kinds
of problems.

Alternatively, if we could have a way to mark an fd so that it's
close-on-exec after exec, that would solve the nesting problem, as
long as every interpreter in the chain does it.  And the kernel could
certainly implement execve on a close-on-exec fd by passing /dev/fd/N
where N is a close-on-exec fd, at least in the non-nested case.

--Andy

> Rich



--
Andy Lutomirski
AMA Capital Management, LLC
Reply | Threaded
Open this post in threaded view
|

Re: [musl] Re: [RFC] Possible new execveat(2) Linux syscall

Rich Felker
On Sun, Nov 16, 2014 at 02:34:32PM -0800, Andy Lutomirski wrote:

> On Sun, Nov 16, 2014 at 2:08 PM, Rich Felker <[hidden email]> wrote:
> > On Sun, Nov 16, 2014 at 01:20:39PM -0800, Andy Lutomirski wrote:
> >> On Nov 16, 2014 11:53 AM, "Rich Felker" <[hidden email]> wrote:
> >> >
> >> > On Fri, Nov 14, 2014 at 02:54:19PM +0000, David Drysdale wrote:
> >> > > Hi,
> >> > >
> >> > > Over at the LKML[1] we've been discussing a possible new syscall, execveat(2),
> >> > > and it would be good to hear a glibc perspective about it (and whether there
> >> > > are any interface changes that would make it easier to use from userspace).
> >> > >
> >> > > The syscall prototype is:
> >> > >   int execveat(int fd, const char *pathname,
> >> > >                       char *const argv[],  char *const envp[],
> >> > >                       int flags); /* AT_EMPTY_PATH, AT_SYMLINK_NOFOLLOW */
> >> > > and it works similarly to execve(2) except:
> >> > >  - the executable to run is identified by the combination of fd+pathname, like
> >> > >    other *at(2) syscalls
> >> > >  - there's an extra flags field to control behaviour.
> >> > > (I've attached a text version of the suggested man page below)
> >> > >
> >> > > One particular benefit of this is that it allows an fexecve(3) implementation
> >> > > that doesn't rely on /proc being accessible, which is useful for sandboxed
> >> > > applications.  (However, that does only work for non-interpreted programs:
> >> > > the name passed to a script interpreter is of the form "/dev/fd/<fd>/<path>"
> >> > > or "/dev/fd/<fd>", so the executed interpreter will normally still need /proc
> >> > > access to load the script file).
> >> > >
> >> > > How does this sound from a glibc perspective?
> >> >
> >> > I've been following the discussions so far and everything looks mostly
> >> > okay. There are still issues to be resolved with the different
> >> > semantics between Linux O_PATH and what POSIX requires for O_EXEC (and
> >> > O_SEARCH) but as long as the intent is that, once O_EXEC is defined to
> >> > save the permissions at the time of open and cause them to be used in
> >> > place of the current file permissions at the time of execveat
> >>
> >> Is something missing here?
> >>
> >> FWIW, I don't understand O_PATH or O_EXEC very well, so from my POV,
> >> help would be appreciated.
> >
> > Yes. POSIX requires that permission checks for execution (fexecve with
> > O_EXEC file descriptors) and directory-search (*at functions with
> > O_SEARCH file descriptors) succeed if the open operation succeeded --
> > the permissions check is required to take place at open time rather
> > than at exec/search time. There's a separate discussion about how to
> > make this work on the kernel side.
>
> It may be worth making this work as part of adding execveat to the
> kernel.  Does the kernel even have O_EXEC right now?

No. The proposal is that O_EXEC and O_SEARCH would both be equal to
O_PATH|3 (3 being the rarely-used O_ACCMODE for "neither read or
write, but some weird ioctls are accepted") which gracefully falls
back for both current kernels with O_PATH (in which case the 3 is
ignored and the discrepency from POSIX is just the time at which
permissions are checked) and for pre-O_PATH kernels (in which case the
access mode used is 3, and read/write ops fail on the fd, but it's
still usable for fexecve and *at functions with /proc-based fallback
implementations).

I would be happy to see this work get done at the same time.

> >> > One major issue however is FD_CLOEXEC with scripts. Last I checked,
> >> > this didn't work because the file is already closed by the time the
> >> > interpreted runs. The intended usage of fexecve is almost certainly to
> >> > call it with the file descriptor set close-on-exec; otherwise, there
> >> > would be no clean way to close it, since the program being executed
> >> > doesn't know that it's being executed via fexecve. So this is a
> >> > serious problem that needs to be solved if it hasn't already. I have
> >> > some ideas I could offer, but I'm not an expert on the kernel side
> >> > things so I'm not sure they'd be correct.
> >>
> >> Bring on the ideas.
> >
> > My thought is that when the kernel opens the binary and sees that it's
> > a script that needs an interpreter, the kernel should not pass
> > /proc/self/fd/%d to the interpreter, but instead should pass the name
> > of a new magic symlink in /proc/self that's connected to the inode for
> > the script to be executed but that ceases to exist as soon as it's
> > opened. In theory this could also be used for suid scripts to make
> > them secure.
>
> This doesn't help if /proc is not mounted, which is an important use case.

I don't know what can be done in this case short of some really ugly
hacks, like giving open() special behavior when the pathname points to
a magic address in the argv region, or having the kernel create temp
files in some magic path.

> >> FWIW, I've often thought that interpreter binaries should mark
> >> themselves as such to enable better interactions with the kernel.
> >
> > That's hard since users expect to be able to use arbitrary
> > interpreters (and sometimes even pass through multiple ones, e.g.
> > #!/usr/bin/env perl).
>
> Hmm.  I'd be okay with old interpreters having a somewhat degraded experience.
>
> I guess that #!/some/interpreted/script isn't allowed, but maybe
> #!/usr/bin/env some-interpreted-script should work.
>
> It could be that all that's really needed is some convention to tell
> an interpreter that it should use fd N as a script *and close it*.
> Something like /dev/fd_and_close/N could work, but that has all kinds
> of problems.
>
> Alternatively, if we could have a way to mark an fd so that it's
> close-on-exec after exec, that would solve the nesting problem, as
> long as every interpreter in the chain does it.  And the kernel could
> certainly implement execve on a close-on-exec fd by passing /dev/fd/N
> where N is a close-on-exec fd, at least in the non-nested case.

This doesn't solve the problem of needing /proc though (/dev/fd is
just a link to /proc/self/fd).

Rich
Reply | Threaded
Open this post in threaded view
|

Re: [musl] Re: [RFC] Possible new execveat(2) Linux syscall

Andy Lutomirski
On Sun, Nov 16, 2014 at 3:32 PM, Rich Felker <[hidden email]> wrote:

> On Sun, Nov 16, 2014 at 02:34:32PM -0800, Andy Lutomirski wrote:
>> On Sun, Nov 16, 2014 at 2:08 PM, Rich Felker <[hidden email]> wrote:
>> > On Sun, Nov 16, 2014 at 01:20:39PM -0800, Andy Lutomirski wrote:
>> >> On Nov 16, 2014 11:53 AM, "Rich Felker" <[hidden email]> wrote:
>> >> >
>> >> > On Fri, Nov 14, 2014 at 02:54:19PM +0000, David Drysdale wrote:
>> >> > > Hi,
>> >> > >
>> >> > > Over at the LKML[1] we've been discussing a possible new syscall, execveat(2),
>> >> > > and it would be good to hear a glibc perspective about it (and whether there
>> >> > > are any interface changes that would make it easier to use from userspace).
>> >> > >
>> >> > > The syscall prototype is:
>> >> > >   int execveat(int fd, const char *pathname,
>> >> > >                       char *const argv[],  char *const envp[],
>> >> > >                       int flags); /* AT_EMPTY_PATH, AT_SYMLINK_NOFOLLOW */
>> >> > > and it works similarly to execve(2) except:
>> >> > >  - the executable to run is identified by the combination of fd+pathname, like
>> >> > >    other *at(2) syscalls
>> >> > >  - there's an extra flags field to control behaviour.
>> >> > > (I've attached a text version of the suggested man page below)
>> >> > >
>> >> > > One particular benefit of this is that it allows an fexecve(3) implementation
>> >> > > that doesn't rely on /proc being accessible, which is useful for sandboxed
>> >> > > applications.  (However, that does only work for non-interpreted programs:
>> >> > > the name passed to a script interpreter is of the form "/dev/fd/<fd>/<path>"
>> >> > > or "/dev/fd/<fd>", so the executed interpreter will normally still need /proc
>> >> > > access to load the script file).
>> >> > >
>> >> > > How does this sound from a glibc perspective?
>> >> >
>> >> > I've been following the discussions so far and everything looks mostly
>> >> > okay. There are still issues to be resolved with the different
>> >> > semantics between Linux O_PATH and what POSIX requires for O_EXEC (and
>> >> > O_SEARCH) but as long as the intent is that, once O_EXEC is defined to
>> >> > save the permissions at the time of open and cause them to be used in
>> >> > place of the current file permissions at the time of execveat
>> >>
>> >> Is something missing here?
>> >>
>> >> FWIW, I don't understand O_PATH or O_EXEC very well, so from my POV,
>> >> help would be appreciated.
>> >
>> > Yes. POSIX requires that permission checks for execution (fexecve with
>> > O_EXEC file descriptors) and directory-search (*at functions with
>> > O_SEARCH file descriptors) succeed if the open operation succeeded --
>> > the permissions check is required to take place at open time rather
>> > than at exec/search time. There's a separate discussion about how to
>> > make this work on the kernel side.
>>
>> It may be worth making this work as part of adding execveat to the
>> kernel.  Does the kernel even have O_EXEC right now?
>
> No. The proposal is that O_EXEC and O_SEARCH would both be equal to
> O_PATH|3 (3 being the rarely-used O_ACCMODE for "neither read or
> write, but some weird ioctls are accepted") which gracefully falls
> back for both current kernels with O_PATH (in which case the 3 is
> ignored and the discrepency from POSIX is just the time at which
> permissions are checked) and for pre-O_PATH kernels (in which case the
> access mode used is 3, and read/write ops fail on the fd, but it's
> still usable for fexecve and *at functions with /proc-based fallback
> implementations).
>
> I would be happy to see this work get done at the same time.
>
>> >> > One major issue however is FD_CLOEXEC with scripts. Last I checked,
>> >> > this didn't work because the file is already closed by the time the
>> >> > interpreted runs. The intended usage of fexecve is almost certainly to
>> >> > call it with the file descriptor set close-on-exec; otherwise, there
>> >> > would be no clean way to close it, since the program being executed
>> >> > doesn't know that it's being executed via fexecve. So this is a
>> >> > serious problem that needs to be solved if it hasn't already. I have
>> >> > some ideas I could offer, but I'm not an expert on the kernel side
>> >> > things so I'm not sure they'd be correct.
>> >>
>> >> Bring on the ideas.
>> >
>> > My thought is that when the kernel opens the binary and sees that it's
>> > a script that needs an interpreter, the kernel should not pass
>> > /proc/self/fd/%d to the interpreter, but instead should pass the name
>> > of a new magic symlink in /proc/self that's connected to the inode for
>> > the script to be executed but that ceases to exist as soon as it's
>> > opened. In theory this could also be used for suid scripts to make
>> > them secure.
>>
>> This doesn't help if /proc is not mounted, which is an important use case.
>
> I don't know what can be done in this case short of some really ugly
> hacks, like giving open() special behavior when the pathname points to
> a magic address in the argv region, or having the kernel create temp
> files in some magic path.
>
>> >> FWIW, I've often thought that interpreter binaries should mark
>> >> themselves as such to enable better interactions with the kernel.
>> >
>> > That's hard since users expect to be able to use arbitrary
>> > interpreters (and sometimes even pass through multiple ones, e.g.
>> > #!/usr/bin/env perl).
>>
>> Hmm.  I'd be okay with old interpreters having a somewhat degraded experience.
>>
>> I guess that #!/some/interpreted/script isn't allowed, but maybe
>> #!/usr/bin/env some-interpreted-script should work.
>>
>> It could be that all that's really needed is some convention to tell
>> an interpreter that it should use fd N as a script *and close it*.
>> Something like /dev/fd_and_close/N could work, but that has all kinds
>> of problems.
>>
>> Alternatively, if we could have a way to mark an fd so that it's
>> close-on-exec after exec, that would solve the nesting problem, as
>> long as every interpreter in the chain does it.  And the kernel could
>> certainly implement execve on a close-on-exec fd by passing /dev/fd/N
>> where N is a close-on-exec fd, at least in the non-nested case.
>
> This doesn't solve the problem of needing /proc though (/dev/fd is
> just a link to /proc/self/fd).
>

Al Viro was talking about having a special fs just for /dev/fd.  And
interpreters could special-case path names of a certain form.

--Andy
Reply | Threaded
Open this post in threaded view
|

Re: [musl] Re: [RFC] Possible new execveat(2) Linux syscall

David Drysdale
In reply to this post by Rich Felker
On Sun, Nov 16, 2014 at 11:32 PM, Rich Felker <[hidden email]> wrote:

> On Sun, Nov 16, 2014 at 02:34:32PM -0800, Andy Lutomirski wrote:
>> On Sun, Nov 16, 2014 at 2:08 PM, Rich Felker <[hidden email]> wrote:
>> > On Sun, Nov 16, 2014 at 01:20:39PM -0800, Andy Lutomirski wrote:
>> >> On Nov 16, 2014 11:53 AM, "Rich Felker" <[hidden email]> wrote:
>> >> >
>> >> > On Fri, Nov 14, 2014 at 02:54:19PM +0000, David Drysdale wrote:
>> >> > > Hi,
>> >> > >
>> >> > > Over at the LKML[1] we've been discussing a possible new syscall, execveat(2),
>> >> > > and it would be good to hear a glibc perspective about it (and whether there
>> >> > > are any interface changes that would make it easier to use from userspace).
>> >> > >
>> >> > > The syscall prototype is:
>> >> > >   int execveat(int fd, const char *pathname,
>> >> > >                       char *const argv[],  char *const envp[],
>> >> > >                       int flags); /* AT_EMPTY_PATH, AT_SYMLINK_NOFOLLOW */
>> >> > > and it works similarly to execve(2) except:
>> >> > >  - the executable to run is identified by the combination of fd+pathname, like
>> >> > >    other *at(2) syscalls
>> >> > >  - there's an extra flags field to control behaviour.
>> >> > > (I've attached a text version of the suggested man page below)
>> >> > >
>> >> > > One particular benefit of this is that it allows an fexecve(3) implementation
>> >> > > that doesn't rely on /proc being accessible, which is useful for sandboxed
>> >> > > applications.  (However, that does only work for non-interpreted programs:
>> >> > > the name passed to a script interpreter is of the form "/dev/fd/<fd>/<path>"
>> >> > > or "/dev/fd/<fd>", so the executed interpreter will normally still need /proc
>> >> > > access to load the script file).
>> >> > >
>> >> > > How does this sound from a glibc perspective?
>> >> >
>> >> > I've been following the discussions so far and everything looks mostly
>> >> > okay. There are still issues to be resolved with the different
>> >> > semantics between Linux O_PATH and what POSIX requires for O_EXEC (and
>> >> > O_SEARCH) but as long as the intent is that, once O_EXEC is defined to
>> >> > save the permissions at the time of open and cause them to be used in
>> >> > place of the current file permissions at the time of execveat
>> >>
>> >> Is something missing here?
>> >>
>> >> FWIW, I don't understand O_PATH or O_EXEC very well, so from my POV,
>> >> help would be appreciated.
>> >
>> > Yes. POSIX requires that permission checks for execution (fexecve with
>> > O_EXEC file descriptors) and directory-search (*at functions with
>> > O_SEARCH file descriptors) succeed if the open operation succeeded --
>> > the permissions check is required to take place at open time rather
>> > than at exec/search time. There's a separate discussion about how to
>> > make this work on the kernel side.

I'm not familiar with O_EXEC either, I'm afraid, so to be clear -- does
O_EXEC mean the permission check is explicitly skipped later, at execute
time?  In other words, if you open(O_EXEC) an executable then remove the
execute bit from the file, does a subsequent fexecve() still work?

If it does, then from an implementation perspective that presumably implies
the need for a record of the permission check in the struct file (and that
this property would be inherited by any dup()ed file descriptors).  From a
security perspective, having a gap between time-of-check and time-of-use
always sounds worrying...

>>
>> It may be worth making this work as part of adding execveat to the
>> kernel.  Does the kernel even have O_EXEC right now?
>
> No. The proposal is that O_EXEC and O_SEARCH would both be equal to
> O_PATH|3 (3 being the rarely-used O_ACCMODE for "neither read or
> write, but some weird ioctls are accepted") which gracefully falls
> back for both current kernels with O_PATH (in which case the 3 is
> ignored and the discrepency from POSIX is just the time at which
> permissions are checked) and for pre-O_PATH kernels (in which case the
> access mode used is 3, and read/write ops fail on the fd, but it's
> still usable for fexecve and *at functions with /proc-based fallback
> implementations).
>
> I would be happy to see this work get done at the same time.
Reply | Threaded
Open this post in threaded view
|

Re: [musl] Re: [RFC] Possible new execveat(2) Linux syscall

Rich Felker
On Mon, Nov 17, 2014 at 03:42:15PM +0000, David Drysdale wrote:
> I'm not familiar with O_EXEC either, I'm afraid, so to be clear -- does
> O_EXEC mean the permission check is explicitly skipped later, at execute
> time?  In other words, if you open(O_EXEC) an executable then remove the
> execute bit from the file, does a subsequent fexecve() still work?

Yes. It's just like how read and write permissions work. If you open a
file for read then remove read permissions, or open it for write then
remove write permissions, the existing permissions to the open file
are not lost. Of course open with O_EXEC/O_SEARCH needs to fail if the
caller does not have +x access to the file/directory at the time of
open.

> If it does, then from an implementation perspective that presumably implies
> the need for a record of the permission check in the struct file (and that
> this property would be inherited by any dup()ed file descriptors).  From a
> security perspective, having a gap between time-of-check and time-of-use
> always sounds worrying...

This record already exists for read and write. All that's needed is
for an extra bit to be added to record exec/search permission.

Rich
Reply | Threaded
Open this post in threaded view
|

Re: [musl] Re: [RFC] Possible new execveat(2) Linux syscall

Christoph Hellwig
On Mon, Nov 17, 2014 at 01:30:10PM -0500, Rich Felker wrote:

> On Mon, Nov 17, 2014 at 03:42:15PM +0000, David Drysdale wrote:
> > I'm not familiar with O_EXEC either, I'm afraid, so to be clear -- does
> > O_EXEC mean the permission check is explicitly skipped later, at execute
> > time?  In other words, if you open(O_EXEC) an executable then remove the
> > execute bit from the file, does a subsequent fexecve() still work?
>
> Yes. It's just like how read and write permissions work. If you open a
> file for read then remove read permissions, or open it for write then
> remove write permissions, the existing permissions to the open file
> are not lost. Of course open with O_EXEC/O_SEARCH needs to fail if the
> caller does not have +x access to the file/directory at the time of
> open.

Adding a FMODE_EXEC similar to FMODE_READ/WRITE would be trivial.

Reply | Threaded
Open this post in threaded view
|

Re: [RFC] Possible new execveat(2) Linux syscall

Christoph Hellwig
In reply to this post by Rich Felker
On Sun, Nov 16, 2014 at 02:52:46PM -0500, Rich Felker wrote:
> I've been following the discussions so far and everything looks mostly
> okay. There are still issues to be resolved with the different
> semantics between Linux O_PATH and what POSIX requires for O_EXEC (and
> O_SEARCH) but as long as the intent is that, once O_EXEC is defined to
> save the permissions at the time of open and cause them to be used in
> place of the current file permissions at the time of execveat

As far as I can tell we only need the little patch below to make Linux
O_PATH a valid O_SEARCH implementation.  Rich, you said you wanted to
look over it?

For O_EXEC my interpretation is that we basically just need this new
execveat syscall + a patch to add FMODE_EXEC and enforce it.  So we
wouldn't even need the O_PATH|3 hack.  But unless someone more familar
with the arcane details of the Posix language verifies it I'm tempted to
give up trying to help to implent these flags :(

diff --git a/fs/open.c b/fs/open.c
index d6fd3ac..ee24720 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -512,7 +512,7 @@ out_unlock:
 
 SYSCALL_DEFINE2(fchmod, unsigned int, fd, umode_t, mode)
 {
- struct fd f = fdget(fd);
+ struct fd f = fdget_raw(fd);
  int err = -EBADF;
 
  if (f.file) {
@@ -633,7 +633,7 @@ SYSCALL_DEFINE3(lchown, const char __user *, filename, uid_t, user, gid_t, group
 
 SYSCALL_DEFINE3(fchown, unsigned int, fd, uid_t, user, gid_t, group)
 {
- struct fd f = fdget(fd);
+ struct fd f = fdget_raw(fd);
  int error = -EBADF;
 
  if (!f.file)
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] Possible new execveat(2) Linux syscall

David Drysdale
On Fri, Nov 21, 2014 at 10:13 AM, Christoph Hellwig <[hidden email]> wrote:

> On Sun, Nov 16, 2014 at 02:52:46PM -0500, Rich Felker wrote:
>> I've been following the discussions so far and everything looks mostly
>> okay. There are still issues to be resolved with the different
>> semantics between Linux O_PATH and what POSIX requires for O_EXEC (and
>> O_SEARCH) but as long as the intent is that, once O_EXEC is defined to
>> save the permissions at the time of open and cause them to be used in
>> place of the current file permissions at the time of execveat
>
> As far as I can tell we only need the little patch below to make Linux
> O_PATH a valid O_SEARCH implementation.  Rich, you said you wanted to
> look over it?
>
> For O_EXEC my interpretation is that we basically just need this new
> execveat syscall + a patch to add FMODE_EXEC and enforce it.  So we
> wouldn't even need the O_PATH|3 hack.  But unless someone more familar
> with the arcane details of the Posix language verifies it I'm tempted to
> give up trying to help to implent these flags :(

I'm not particularly familiar with POSIX details either, but I thought the
O_PATH|3 hack would be needed for the interaction with O_ACCMODE -- just
using FMODE_EXEC as O_EXEC would confuse existing code that examines
(flags & O_ACCMODE).

From [1]:
  "Applications shall specify exactly one of the ...five ... file access
  modes ... O_EXEC / O_RDONLY / O_RDWR / O_SEARCH / O_WRONLY"
(and O_EXEC and O_SEARCH are allowed to be the same value,
as one only applies to files and the other only applies to directories).

As O_ACCMODE is 3, there are only 4 possible access modes that work
with any existing code that checks (flags & O_ACCMODE), and 3 of the
values are taken (0=O_RDONLY, 1=O_WRONLY, 2=O_RDWR).  So I
guess that's where the idea for the |3 hack comes from.

[1] http://pubs.opengroup.org/onlinepubs/9699919799/functions/open.html
Reply | Threaded
Open this post in threaded view
|

Re: [musl] Re: [RFC] Possible new execveat(2) Linux syscall

Rich Felker
In reply to this post by Christoph Hellwig
On Fri, Nov 21, 2014 at 02:13:18AM -0800, Christoph Hellwig wrote:

> On Sun, Nov 16, 2014 at 02:52:46PM -0500, Rich Felker wrote:
> > I've been following the discussions so far and everything looks mostly
> > okay. There are still issues to be resolved with the different
> > semantics between Linux O_PATH and what POSIX requires for O_EXEC (and
> > O_SEARCH) but as long as the intent is that, once O_EXEC is defined to
> > save the permissions at the time of open and cause them to be used in
> > place of the current file permissions at the time of execveat
>
> As far as I can tell we only need the little patch below to make Linux
> O_PATH a valid O_SEARCH implementation.  Rich, you said you wanted to
> look over it?

I think the below looks correct, but it's not complete. The *at
functions also need to use FMODE_EXEC rather than rechecking +x
permissions at the time of the operation.

> For O_EXEC my interpretation is that we basically just need this new
> execveat syscall + a patch to add FMODE_EXEC and enforce it.  So we
> wouldn't even need the O_PATH|3 hack.  But unless someone more familar
> with the arcane details of the Posix language verifies it I'm tempted to
> give up trying to help to implent these flags :(

O_EXEC/O_SEARCH cannot be equal to O_PATH, because of differing
semantics on open. With O_NOFOLLOW, O_PATH yields a file descriptor
referring to the symlink itself. With O_EXEC or O_SEARCH, O_NOFOLLOW
is required to make open fail if the target is a symlink. It would be
a serious regression to eliminate the ability of O_PATH to open
symlinks like this.

Note that enforcing O_NOFOLLOW failure on symlinks can be implemented
in userspace instead of (or in addition to, for better behavior with
old kernels) kernelspace, but it still requires a different value from
O_PATH or userspace would be eliminating access to an important O_PATH
feature.

Further, O_PATH|3 was the best value I could find to yield nearly
reasonable fallback behavior on most old kernels. Simply using 3 fails
to open directories and files to which the caller does not have write
permission (mode 3 is a nearly-undocumented hack for opening devices
for ioctl-only read-write access, it seems). On pre-O_PATH kernels,
using O_PATH|3 would fallback to this failing case, yielding spurious
failure-to-open for all O_SEARCH and some O_EXEC operations, but those
kernels are old enough to be irrelevant to most users anyway. On
kernels that do have O_PATH, using O_PATH|3 ignores the 3 and yields
the current O_PATH semantics, which are nearly correct.

Of course O_PATH|1 or O_PATH|2 would also work in principle, as would
adding a completely new bit in addition to O_PATH, but these all seem
less desirable.

Rich
Reply | Threaded
Open this post in threaded view
|

Re: [musl] Re: [RFC] Possible new execveat(2) Linux syscall

Rich Felker
In reply to this post by David Drysdale
On Fri, Nov 21, 2014 at 01:49:35PM +0000, David Drysdale wrote:

> On Fri, Nov 21, 2014 at 10:13 AM, Christoph Hellwig <[hidden email]> wrote:
> > On Sun, Nov 16, 2014 at 02:52:46PM -0500, Rich Felker wrote:
> >> I've been following the discussions so far and everything looks mostly
> >> okay. There are still issues to be resolved with the different
> >> semantics between Linux O_PATH and what POSIX requires for O_EXEC (and
> >> O_SEARCH) but as long as the intent is that, once O_EXEC is defined to
> >> save the permissions at the time of open and cause them to be used in
> >> place of the current file permissions at the time of execveat
> >
> > As far as I can tell we only need the little patch below to make Linux
> > O_PATH a valid O_SEARCH implementation.  Rich, you said you wanted to
> > look over it?
> >
> > For O_EXEC my interpretation is that we basically just need this new
> > execveat syscall + a patch to add FMODE_EXEC and enforce it.  So we
> > wouldn't even need the O_PATH|3 hack.  But unless someone more familar
> > with the arcane details of the Posix language verifies it I'm tempted to
> > give up trying to help to implent these flags :(
>
> I'm not particularly familiar with POSIX details either, but I thought the
> O_PATH|3 hack would be needed for the interaction with O_ACCMODE -- just
> using FMODE_EXEC as O_EXEC would confuse existing code that examines
> (flags & O_ACCMODE).

To conform to POSIX, O_ACCMODE needs to contain all the bits of
O_RDONLY|O_WRONLY|O_RDWR|O_SEARCH|O_EXEC. Certainly it's possible that
code compiled with an old definition of O_ACCMODE as 3 could inherit
(or otherwise obtain) a file descriptor in O_SEARCH/O_EXEC mode, so
it's preferable to have the low 2 bits be distinct from the existing
access modes, but O_ACCMODE's definition (at least in userspace)
really does need to be updated to equal O_PATH|3.

> >From [1]:
>   "Applications shall specify exactly one of the ...five ... file access
>   modes ... O_EXEC / O_RDONLY / O_RDWR / O_SEARCH / O_WRONLY"
> (and O_EXEC and O_SEARCH are allowed to be the same value,
> as one only applies to files and the other only applies to directories).
>
> As O_ACCMODE is 3, there are only 4 possible access modes that work
> with any existing code that checks (flags & O_ACCMODE), and 3 of the
> values are taken (0=O_RDONLY, 1=O_WRONLY, 2=O_RDWR).  So I
> guess that's where the idea for the |3 hack comes from.

3 is also "taken" too, but it's a mostly-undocumented hack.

Rich