Fallout from dlopen() blocking SIGSYS

classic Classic list List threaded Threaded
17 messages Options
Reply | Threaded
Open this post in threaded view
|

Fallout from dlopen() blocking SIGSYS

Gian-Carlo Pascutto
(reposting here per request from Florian Weimer)

This glibc patch:

Block signals during the initial part of dlopen
(a2e8aa0d9ea648068d8be52dd7b15f1b6a008e23)

is going to break every Firefox release of the last few years. We use a
seccomp-bpf filter to sandbox various processes. In some of these
processes we don't want to do a dlopen() of untrusted code while we're
not sandboxed yet, for example in the process we use to isolate Google's
Widevine DRM modules from any private data on the system.

seccomp-bpf will intercept various filesystem related syscalls and raise
SIGSYS, at which moment our code will contact a broker in the parent
process that checks if the file that's being want to read is acceptable
to us, and then passes down the file handle.

This obviously only works if the code that we load doesn't block the
SIGSYS signal, so we interpose the signal handling functions to stop
that from happening:
https://searchfox.org/mozilla-central/rev/04d8e7629354bab9e6a285183e763410860c5006/security/sandbox/linux/SandboxHooks.cpp#42

But, ld.so being the linker itself, this technique doesn't work for it.
This means that it will succeed in blocking SIGSYS, try open() (or
similar), and fail because seccomp-bpf will block it but nobody will
handle the raised signal. Now, we hit a Linux seccomp-bpf design issue:
SECCOMP_RET_TRAP will, if the signal is either ignored or blocked,
unblock it and reset it to SIG_DFL (effectively).
At this point, Firefox crashes.

We are tracking the problem here:
https://bugzilla.mozilla.org/show_bug.cgi?id=1600574

This is a particularly nasty problem for us - the only
solution/workaround so far is to disable sandboxing, and if affects old
releases such as ESR.

I'm not clear if it affects other seccomp-bpf users like Chromium - they
use almost exactly the same way of dealing with filesystem access, but
(all AFAIK) only for the sandboxed GPU Process which might not need to
dlopen() things (and I'm not sure that particular sandbox is enabled on
regular Linux to begin with).

We might need to intercept the system calls that set up the signal
handling and basically undo the above glibc change, while trying to
ensure the conditions that caused you to make the above change don't
hold. Or something. We're still thinking about how to cope with this.

--
GCP
Reply | Threaded
Open this post in threaded view
|

Re: Fallout from dlopen() blocking SIGSYS

Adhemerval Zanella-2


On 03/12/2019 11:31, Gian-Carlo Pascutto wrote:

> (reposting here per request from Florian Weimer)
>
> This glibc patch:
>
> Block signals during the initial part of dlopen
> (a2e8aa0d9ea648068d8be52dd7b15f1b6a008e23)
>
> is going to break every Firefox release of the last few years. We use a
> seccomp-bpf filter to sandbox various processes. In some of these
> processes we don't want to do a dlopen() of untrusted code while we're
> not sandboxed yet, for example in the process we use to isolate Google's
> Widevine DRM modules from any private data on the system.
>
> seccomp-bpf will intercept various filesystem related syscalls and raise
> SIGSYS, at which moment our code will contact a broker in the parent
> process that checks if the file that's being want to read is acceptable
> to us, and then passes down the file handle.
>
> This obviously only works if the code that we load doesn't block the
> SIGSYS signal, so we interpose the signal handling functions to stop
> that from happening:
> https://searchfox.org/mozilla-central/rev/04d8e7629354bab9e6a285183e763410860c5006/security/sandbox/linux/SandboxHooks.cpp#42
>
> But, ld.so being the linker itself, this technique doesn't work for it.
> This means that it will succeed in blocking SIGSYS, try open() (or
> similar), and fail because seccomp-bpf will block it but nobody will
> handle the raised signal. Now, we hit a Linux seccomp-bpf design issue:
> SECCOMP_RET_TRAP will, if the signal is either ignored or blocked,
> unblock it and reset it to SIG_DFL (effectively).
> At this point, Firefox crashes.
>
> We are tracking the problem here:
> https://bugzilla.mozilla.org/show_bug.cgi?id=1600574
>
> This is a particularly nasty problem for us - the only
> solution/workaround so far is to disable sandboxing, and if affects old
> releases such as ESR.
>
> I'm not clear if it affects other seccomp-bpf users like Chromium - they
> use almost exactly the same way of dealing with filesystem access, but
> (all AFAIK) only for the sandboxed GPU Process which might not need to
> dlopen() things (and I'm not sure that particular sandbox is enabled on
> regular Linux to begin with).
>
> We might need to intercept the system calls that set up the signal
> handling and basically undo the above glibc change, while trying to
> ensure the conditions that caused you to make the above change don't
> hold. Or something. We're still thinking about how to cope with this.
>

Block and unblocking signals during certain parts of libc implementation
are required to avoid consistency issues and Florian can give us a better
idea why it is important on dlopen. Another implementation that glibc also
does it is posix_spawn and you might face this very issue if you use use
posix_spawn with some file action.

IMHO the issue here is seccomp-bpf is crossing some API boundaries and
adding such constraint that make harder to provide a more robust libc
implementation. The SIGSYS seemed to become a de-facto API for seccomp,
where blocking it might break some filter criteria specially if it tries
to work along with libc.

I presume glibc might try to work better with seccomp and don't add SIGSYS
on __libc_signal_block_all, however it is an incomplete solution since
SIGSYS handler can potentially make libc internal state inconsistent.

What I would expect is with current SECCOMP_RET_TRAP semantic that once
a filter is installed, the kernel would either fail or silent ignore
trying to block SIGSYS (as for SIGKILL or SIGSTOP).  However it does
not help with current situation.

Would be possible to write a filter to intercept rt_sigprocmask and
exclude SIGSYS for this specific case?
Reply | Threaded
Open this post in threaded view
|

Re: Fallout from dlopen() blocking SIGSYS

Rich Felker-2
In reply to this post by Gian-Carlo Pascutto
On Tue, Dec 03, 2019 at 03:31:34PM +0100, Gian-Carlo Pascutto wrote:

> (reposting here per request from Florian Weimer)
>
> This glibc patch:
>
> Block signals during the initial part of dlopen
> (a2e8aa0d9ea648068d8be52dd7b15f1b6a008e23)
>
> is going to break every Firefox release of the last few years. We use a
> seccomp-bpf filter to sandbox various processes. In some of these
> processes we don't want to do a dlopen() of untrusted code while we're
> not sandboxed yet, for example in the process we use to isolate Google's
> Widevine DRM modules from any private data on the system.
>
> seccomp-bpf will intercept various filesystem related syscalls and raise
> SIGSYS, at which moment our code will contact a broker in the parent
> process that checks if the file that's being want to read is acceptable
> to us, and then passes down the file handle.
>
> This obviously only works if the code that we load doesn't block the
> SIGSYS signal, so we interpose the signal handling functions to stop
> that from happening:
> https://searchfox.org/mozilla-central/rev/04d8e7629354bab9e6a285183e763410860c5006/security/sandbox/linux/SandboxHooks.cpp#42
>
> But, ld.so being the linker itself, this technique doesn't work for it.
> This means that it will succeed in blocking SIGSYS, try open() (or
> similar), and fail because seccomp-bpf will block it but nobody will
> handle the raised signal. Now, we hit a Linux seccomp-bpf design issue:
> SECCOMP_RET_TRAP will, if the signal is either ignored or blocked,
> unblock it and reset it to SIG_DFL (effectively).
> At this point, Firefox crashes.
>
> We are tracking the problem here:
> https://bugzilla.mozilla.org/show_bug.cgi?id=1600574
>
> This is a particularly nasty problem for us - the only
> solution/workaround so far is to disable sandboxing, and if affects old
> releases such as ESR.
>
> I'm not clear if it affects other seccomp-bpf users like Chromium - they
> use almost exactly the same way of dealing with filesystem access, but
> (all AFAIK) only for the sandboxed GPU Process which might not need to
> dlopen() things (and I'm not sure that particular sandbox is enabled on
> regular Linux to begin with).
>
> We might need to intercept the system calls that set up the signal
> handling and basically undo the above glibc change, while trying to
> ensure the conditions that caused you to make the above change don't
> hold. Or something. We're still thinking about how to cope with this.

This entire design has a lot of things that can go wrong, and really
should be changed. Unblocking signals behind the application's (or
worse, libc/ldso's) back is a really nasty thing to do, and
intentionally breaks the "critical section" properties the blocking
very likely was intended to ensure. In a worst case, the stack pointer
might not even be valid while signals are blocked.

I don't understand the whole architecture, so it's hard for me to
offer constructive advice on how it *should* be done, but the current
way is surely wrong. The fact that it broke under a completely valid
and completely reasonable change to glibc is evidence of that; it's
silently depending on old-glibc internals. The right solution probably
involves using seccomp with a tracer process, or if the dlopen calls
are the only thing you need to filter, interposing dlopen to prevent
the nasty library from calling it. Or, just patching the nasty library
not to do nasty things you don't want it to do.

The issue was brought to my attention in relation to musl libc, and at
present it probably doesn't affect us, but it's very likely that it
could at some point in the future. Blocking signals is an idiom we use
heavily for ensuring safety of operations that might otherwise have
race conditions, making async-signal-safe locks, etc.

Rich
Reply | Threaded
Open this post in threaded view
|

Re: Fallout from dlopen() blocking SIGSYS

Rich Felker-2
In reply to this post by Adhemerval Zanella-2
On Wed, Dec 04, 2019 at 05:46:54PM -0300, Adhemerval Zanella wrote:

>
>
> On 03/12/2019 11:31, Gian-Carlo Pascutto wrote:
> > (reposting here per request from Florian Weimer)
> >
> > This glibc patch:
> >
> > Block signals during the initial part of dlopen
> > (a2e8aa0d9ea648068d8be52dd7b15f1b6a008e23)
> >
> > is going to break every Firefox release of the last few years. We use a
> > seccomp-bpf filter to sandbox various processes. In some of these
> > processes we don't want to do a dlopen() of untrusted code while we're
> > not sandboxed yet, for example in the process we use to isolate Google's
> > Widevine DRM modules from any private data on the system.
> >
> > seccomp-bpf will intercept various filesystem related syscalls and raise
> > SIGSYS, at which moment our code will contact a broker in the parent
> > process that checks if the file that's being want to read is acceptable
> > to us, and then passes down the file handle.
> >
> > This obviously only works if the code that we load doesn't block the
> > SIGSYS signal, so we interpose the signal handling functions to stop
> > that from happening:
> > https://searchfox.org/mozilla-central/rev/04d8e7629354bab9e6a285183e763410860c5006/security/sandbox/linux/SandboxHooks.cpp#42
> >
> > But, ld.so being the linker itself, this technique doesn't work for it.
> > This means that it will succeed in blocking SIGSYS, try open() (or
> > similar), and fail because seccomp-bpf will block it but nobody will
> > handle the raised signal. Now, we hit a Linux seccomp-bpf design issue:
> > SECCOMP_RET_TRAP will, if the signal is either ignored or blocked,
> > unblock it and reset it to SIG_DFL (effectively).
> > At this point, Firefox crashes.
> >
> > We are tracking the problem here:
> > https://bugzilla.mozilla.org/show_bug.cgi?id=1600574
> >
> > This is a particularly nasty problem for us - the only
> > solution/workaround so far is to disable sandboxing, and if affects old
> > releases such as ESR.
> >
> > I'm not clear if it affects other seccomp-bpf users like Chromium - they
> > use almost exactly the same way of dealing with filesystem access, but
> > (all AFAIK) only for the sandboxed GPU Process which might not need to
> > dlopen() things (and I'm not sure that particular sandbox is enabled on
> > regular Linux to begin with).
> >
> > We might need to intercept the system calls that set up the signal
> > handling and basically undo the above glibc change, while trying to
> > ensure the conditions that caused you to make the above change don't
> > hold. Or something. We're still thinking about how to cope with this.
> >
>
> Block and unblocking signals during certain parts of libc implementation
> are required to avoid consistency issues and Florian can give us a better
> idea why it is important on dlopen. Another implementation that glibc also
> does it is posix_spawn and you might face this very issue if you use use
> posix_spawn with some file action.
>
> IMHO the issue here is seccomp-bpf is crossing some API boundaries and
> adding such constraint that make harder to provide a more robust libc
> implementation. The SIGSYS seemed to become a de-facto API for seccomp,
> where blocking it might break some filter criteria specially if it tries
> to work along with libc.
>
> I presume glibc might try to work better with seccomp and don't add SIGSYS
> on __libc_signal_block_all, however it is an incomplete solution since
> SIGSYS handler can potentially make libc internal state inconsistent.
>
> What I would expect is with current SECCOMP_RET_TRAP semantic that once
> a filter is installed, the kernel would either fail or silent ignore
> trying to block SIGSYS (as for SIGKILL or SIGSTOP).  However it does
> not help with current situation.
>
> Would be possible to write a filter to intercept rt_sigprocmask and
> exclude SIGSYS for this specific case?

This is probably possible but even more dangerous. It risks leaving a
signal unblocked in places where it's unsafe for it to be unblocked,
and hiding that fact from libc. That's goind deeper in the rabbit hole
of hacks that poke at implementation internals, not getting out of it.

I think the problem here is that SECCOMP_RET_TRAP is currently a
broken interface, and it's part of a larger problem of
"self-sandboxing" not being a valid sandboxing model. A non-broken
version of it would deliver the signal to a separate process;
SECCOMP_RET_TRACE already does this, but I imagine it's problematic
for other reasons like interfering with use of debugger, etc.

Rich
Reply | Threaded
Open this post in threaded view
|

Re: Fallout from dlopen() blocking SIGSYS

Christian Brauner-2
On Wed, Dec 04, 2019 at 05:13:19PM -0500, Rich Felker wrote:

> On Wed, Dec 04, 2019 at 05:46:54PM -0300, Adhemerval Zanella wrote:
> >
> >
> > On 03/12/2019 11:31, Gian-Carlo Pascutto wrote:
> > > (reposting here per request from Florian Weimer)
> > >
> > > This glibc patch:
> > >
> > > Block signals during the initial part of dlopen
> > > (a2e8aa0d9ea648068d8be52dd7b15f1b6a008e23)
> > >
> > > is going to break every Firefox release of the last few years. We use a
> > > seccomp-bpf filter to sandbox various processes. In some of these
> > > processes we don't want to do a dlopen() of untrusted code while we're
> > > not sandboxed yet, for example in the process we use to isolate Google's
> > > Widevine DRM modules from any private data on the system.
> > >
> > > seccomp-bpf will intercept various filesystem related syscalls and raise
> > > SIGSYS, at which moment our code will contact a broker in the parent
> > > process that checks if the file that's being want to read is acceptable
> > > to us, and then passes down the file handle.

Hey everyone,

I saw this fly by the libc-alpha mailing list late at night yesterday. I
think with work we've recently done in the upstream kernel the
SECCOMP_RET_TRAP approach trapping and signaling is not needed anymore
at least on newer kernels. That won't help you with legacy but it's
still worth considering switching over.

Let me say, I'm sorry if this will be a bit longer and sorry if I
should've misunderstood your use-case.

Afaict, you use SECCOMP_RET_TRAP to intercept syscalls and emulate them
in userspace, i.e. the actual syscall is never actually performed. An
alternative to SECCOMP_RET_TRAP is to use the SECCOMP_RET_USER_NOTIF
feature that we released with Linux v5.0 which doesn not rely on signals
nor ptrace.

Here's the gist:
SECCOMP_RET_USER_NOTIF enables a process (supervisee) to retrieve an fd
for its seccomp filter. This fd can then be handed to another (usually
more privileged) process (supervisor). The supervisor will then be able
to receive seccomp messages about the syscalls having been performed by
the supervisee on the fd.

We have integrated this feature into userspace and currently make heavy
use of this to intercept mknod(), mount(), and other syscalls in user
namespaces aka in containers.
For example, if the mknod() syscall matches a device in a pre-determined
whitelist the privileged supervisor will perform the mknod syscall in
lieu of the unprivileged supervisee and report back to the supervisee on
the success or failure of its attempt. If the syscall does not match a
device in a whitelist we simply report an error.

Here's a quick asciinema demo

https://asciinema.org/a/285491
Be a little patient during the mount() interception I'm demonstrating :).

Here are more technical details:
A task can registers a seccomp filter for a syscall with the
SECCOMP_RET_USER_NOTIF flag set in the filter. When it loads the filter
the caller can specify SECCOMP_FILTER_FLAG_NEW_LISTENER. This flag will
instruct seccomp to return a so-called "notifier fd" which is an
anonymous-inode-based, close-on-exec per default file descriptor. A
dummy code-example from our kernel testing is:

struct sock_filter filter[] = {
        BPF_STMT(BPF_LD+BPF_W+BPF_ABS, offsetof(struct seccomp_data, nr)),

        /* Let's listen for mknod() syscalls. */
        BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_mknod, 0, 1),

        /* If this is a mknod(), notify the listener. */
        BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_USER_NOTIF),

        /* If this wasn't a mknod() syscall just let it through. */
        BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW),
};

struct sock_fprog prog = {
        .len = (unsigned short)ARRAY_SIZE(filter),
        .filter = filter,
};

int notify_fd = seccomp(SECCOMP_SET_MODE_FILTER, SECCOMP_FILTER_FLAG_NEW_LISTENER, &prog);

The task now has a notify_fd to it's own seccomp filter. This notify_fd
can now be handed of to your broker process.

The broker can now add that notify_fd into an epoll() loop. When a
mknod() syscall (per the above example) is issued the kernel will set
EPOLLIN on the notify_fd and the broker will be notified that a syscall
has been performed and the task performing the syscall will now be
blocked.
The broker can then use the following ioctl:

struct seccomp_notif req = {};
ioctl(notify_fd, SECCOMP_IOCTL_NOTIF_RECV, &req), 0);

to receive information about the performed syscall. The information
includes the pid of the calling task, an unique cookie identifying the
request, and the seccomp data.

struct seccomp_notif {
        __u64 id;
        __u32 pid;
        __u32 flags;
        struct seccomp_data data;
};

Non-pointer based arguments can be directly inspected by the broker
withour risking a TOCTOU via seccomp_data. Pointer-based arguments can
be safely inspected if the caller reads _all_ the data it is interested
in via /proc/<pid>/mem before it makes a decision whether or not to
emulate the syscall and checks that the request is still valid
afterwards to avoid pid recycling:

char buf[PATH_MAX];
int fd = open(/proc/<pid>/mem);

/* Read the path argument from the mknod() syscall. /*
pread(C.int(fd, buf, sizeof(buf), req.data.args[0]);

/*
 * Verify that the task has not exited in the meantime and been recycled
 * and we've read the wrong memory.
 */
ioctl(listener, SECCOMP_IOCTL_NOTIF_ID_VALID, &req.id), 0);

If that succeeds the broker knows, that the task is still alive and
hasn't been recycled.
Now the broker can inspect the path and the device number in the mknod()
syscall and decide whether or not to emulate the syscall.

After the broker emulated the syscall it can instruct the kernel to
report back an errno value to the task it listenes to via:

struct seccomp_notif_resp {
        __u64 id;
        __s64 val;
        __s32 error;
        __u32 flags;
};

struct seccomp_notif_resp resp = {
        .id = req.id,
        /* Let's use a really dumb errno value in this example. */
        .error = ENOANO,
};

ioctl(listener, SECCOMP_IOCTL_NOTIF_SEND, &resp);

At this point the calling task will see a very strange error value
ENOANO for it's mknod() syscall.

In kernel 5.5 I've extended this feature to also allow continuing
syscalls by setting the SECCOMP_USER_NOTIF_FLAG_CONTINUE flag when
sending a response, i.e.

struct seccomp_notif_resp resp = {
        .id = req.id,
        .flags = SECCOMP_USER_NOTIF_FLAG_CONTINUE,
};

ioctl(listener, SECCOMP_IOCTL_NOTIF_SEND, &resp);

will cause the syscall to be performed. This is needed for syscalls that
can't be emulated in userspace, e.g. in contrast to mknod(), a lot of
other syscall we intercept (e.g.  setxattr()) cannot be easily filtered
like mknod() because they have pointer arguments. Additionally, some of
them might actually succeed in user namespaces (e.g. setxattr() for all
"user.*" xattrs). Since we currently cannot tell seccomp to continue
from a user notifier we are stuck with performing all of the syscalls in
lieu of the container. This is a huge security liability since it is
extremely difficult to correctly assume all of the necessary privileges
of the calling task such that the syscall can be successfully emulated
without escaping other additional security restrictions (think missing
CAP_MKNOD for mknod(), or MS_NODEV on a filesystem etc.). This can be
solved by telling seccomp to resume the syscall.

The continue feature obviously requires _massive amounts of caution_.
Here's my comment from the kernel header:

"Note, the SECCOMP_USER_NOTIF_FLAG_CONTINUE flag must be used with caution!
If set by the process supervising the syscalls of another process the
syscall will continue. This is problematic because of an inherent TOCTOU.
An attacker can exploit the time while the supervised process is waiting on
a response from the supervising process to rewrite syscall arguments which
are passed as pointers of the intercepted syscall.
It should be absolutely clear that this means that the seccomp notifier
_cannot_ be used to implement a security policy! It should only ever be used
in scenarios where a more privileged process supervises the syscalls of a
lesser privileged process to get around kernel-enforced security
restrictions when the privileged process deems this safe. In other words,
in order to continue a syscall the supervising process should be sure that
another security mechanism or the kernel itself will sufficiently block
syscalls if arguments are rewritten to something unsafe.

Similar precautions should be applied when stacking SECCOMP_RET_USER_NOTIF
or SECCOMP_RET_TRACE. For SECCOMP_RET_USER_NOTIF filters acting on the
same syscall, the most recently added filter takes precedence. This means
that the new SECCOMP_RET_USER_NOTIF filter can override any
SECCOMP_IOCTL_NOTIF_SEND from earlier filters, essentially allowing all
such filtered syscalls to be executed by sending the response
SECCOMP_USER_NOTIF_FLAG_CONTINUE. Note that SECCOMP_RET_TRACE can equally
be overriden by SECCOMP_USER_NOTIF_FLAG_CONTINUE."

In any case, maybe this is is a viable alternative to you on new kernels
that allows you to avoid signals and such.

I'm happy to help review this or advice since we've had quite some
experience implementing and making use of this.
I've spoken on this feature a few times before and will be talking about
it during FOSDEM too.
Reply | Threaded
Open this post in threaded view
|

Re: Fallout from dlopen() blocking SIGSYS

Rich Felker-2
On Thu, Dec 05, 2019 at 10:47:25AM +0100, Christian Brauner wrote:

> On Wed, Dec 04, 2019 at 05:13:19PM -0500, Rich Felker wrote:
> > On Wed, Dec 04, 2019 at 05:46:54PM -0300, Adhemerval Zanella wrote:
> > >
> > >
> > > On 03/12/2019 11:31, Gian-Carlo Pascutto wrote:
> > > > (reposting here per request from Florian Weimer)
> > > >
> > > > This glibc patch:
> > > >
> > > > Block signals during the initial part of dlopen
> > > > (a2e8aa0d9ea648068d8be52dd7b15f1b6a008e23)
> > > >
> > > > is going to break every Firefox release of the last few years. We use a
> > > > seccomp-bpf filter to sandbox various processes. In some of these
> > > > processes we don't want to do a dlopen() of untrusted code while we're
> > > > not sandboxed yet, for example in the process we use to isolate Google's
> > > > Widevine DRM modules from any private data on the system.
> > > >
> > > > seccomp-bpf will intercept various filesystem related syscalls and raise
> > > > SIGSYS, at which moment our code will contact a broker in the parent
> > > > process that checks if the file that's being want to read is acceptable
> > > > to us, and then passes down the file handle.
>
> Hey everyone,
>
> I saw this fly by the libc-alpha mailing list late at night yesterday. I
> think with work we've recently done in the upstream kernel the
> SECCOMP_RET_TRAP approach trapping and signaling is not needed anymore
> at least on newer kernels. That won't help you with legacy but it's
> still worth considering switching over.
>
> Let me say, I'm sorry if this will be a bit longer and sorry if I
> should've misunderstood your use-case.
>
> Afaict, you use SECCOMP_RET_TRAP to intercept syscalls and emulate them
> in userspace, i.e. the actual syscall is never actually performed. An
> alternative to SECCOMP_RET_TRAP is to use the SECCOMP_RET_USER_NOTIF
> feature that we released with Linux v5.0 which doesn not rely on signals
> nor ptrace.
>
> Here's the gist:
> SECCOMP_RET_USER_NOTIF enables a process (supervisee) to retrieve an fd
> for its seccomp filter. This fd can then be handed to another (usually
> more privileged) process (supervisor). The supervisor will then be able
> to receive seccomp messages about the syscalls having been performed by
> the supervisee on the fd.
>
> We have integrated this feature into userspace and currently make heavy
> use of this to intercept mknod(), mount(), and other syscalls in user
> namespaces aka in containers.
> For example, if the mknod() syscall matches a device in a pre-determined
> whitelist the privileged supervisor will perform the mknod syscall in
> lieu of the unprivileged supervisee and report back to the supervisee on
> the success or failure of its attempt. If the syscall does not match a
> device in a whitelist we simply report an error.
>
> Here's a quick asciinema demo
>
> https://asciinema.org/a/285491
> Be a little patient during the mount() interception I'm demonstrating :).
>
> Here are more technical details:
> A task can registers a seccomp filter for a syscall with the
> SECCOMP_RET_USER_NOTIF flag set in the filter. When it loads the filter
> the caller can specify SECCOMP_FILTER_FLAG_NEW_LISTENER. This flag will
> instruct seccomp to return a so-called "notifier fd" which is an
> anonymous-inode-based, close-on-exec per default file descriptor. A
> dummy code-example from our kernel testing is:
>
> struct sock_filter filter[] = {
> BPF_STMT(BPF_LD+BPF_W+BPF_ABS, offsetof(struct seccomp_data, nr)),
>
> /* Let's listen for mknod() syscalls. */
> BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_mknod, 0, 1),
>
> /* If this is a mknod(), notify the listener. */
> BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_USER_NOTIF),
>
> /* If this wasn't a mknod() syscall just let it through. */
> BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW),
> };
>
> struct sock_fprog prog = {
> .len = (unsigned short)ARRAY_SIZE(filter),
> .filter = filter,
> };
>
> int notify_fd = seccomp(SECCOMP_SET_MODE_FILTER, SECCOMP_FILTER_FLAG_NEW_LISTENER, &prog);
>
> The task now has a notify_fd to it's own seccomp filter. This notify_fd
> can now be handed of to your broker process.
>
> The broker can now add that notify_fd into an epoll() loop. When a
> mknod() syscall (per the above example) is issued the kernel will set
> EPOLLIN on the notify_fd and the broker will be notified that a syscall
> has been performed and the task performing the syscall will now be
> blocked.
> The broker can then use the following ioctl:
>
> struct seccomp_notif req = {};
> ioctl(notify_fd, SECCOMP_IOCTL_NOTIF_RECV, &req), 0);
>
> to receive information about the performed syscall. The information
> includes the pid of the calling task, an unique cookie identifying the
> request, and the seccomp data.
>
> struct seccomp_notif {
> __u64 id;
> __u32 pid;
> __u32 flags;
> struct seccomp_data data;
> };
>
> Non-pointer based arguments can be directly inspected by the broker
> withour risking a TOCTOU via seccomp_data. Pointer-based arguments can
> be safely inspected if the caller reads _all_ the data it is interested
> in via /proc/<pid>/mem before it makes a decision whether or not to
> emulate the syscall and checks that the request is still valid
> afterwards to avoid pid recycling:
>
> char buf[PATH_MAX];
> int fd = open(/proc/<pid>/mem);
>
> /* Read the path argument from the mknod() syscall. /*
> pread(C.int(fd, buf, sizeof(buf), req.data.args[0]);
>
> /*
>  * Verify that the task has not exited in the meantime and been recycled
>  * and we've read the wrong memory.
>  */
> ioctl(listener, SECCOMP_IOCTL_NOTIF_ID_VALID, &req.id), 0);
>
> If that succeeds the broker knows, that the task is still alive and
> hasn't been recycled.
> Now the broker can inspect the path and the device number in the mknod()
> syscall and decide whether or not to emulate the syscall.
>
> After the broker emulated the syscall it can instruct the kernel to
> report back an errno value to the task it listenes to via:
>
> struct seccomp_notif_resp {
> __u64 id;
> __s64 val;
> __s32 error;
> __u32 flags;
> };
>
> struct seccomp_notif_resp resp = {
> .id = req.id,
> /* Let's use a really dumb errno value in this example. */
> .error = ENOANO,
> };
>
> ioctl(listener, SECCOMP_IOCTL_NOTIF_SEND, &resp);
>
> At this point the calling task will see a very strange error value
> ENOANO for it's mknod() syscall.
>
> In kernel 5.5 I've extended this feature to also allow continuing
> syscalls by setting the SECCOMP_USER_NOTIF_FLAG_CONTINUE flag when
> sending a response, i.e.
>
> struct seccomp_notif_resp resp = {
> .id = req.id,
> .flags = SECCOMP_USER_NOTIF_FLAG_CONTINUE,
> };
>
> ioctl(listener, SECCOMP_IOCTL_NOTIF_SEND, &resp);
>
> will cause the syscall to be performed. This is needed for syscalls that
> can't be emulated in userspace, e.g. in contrast to mknod(), a lot of
> other syscall we intercept (e.g.  setxattr()) cannot be easily filtered
> like mknod() because they have pointer arguments. Additionally, some of
> them might actually succeed in user namespaces (e.g. setxattr() for all
> "user.*" xattrs). Since we currently cannot tell seccomp to continue
> from a user notifier we are stuck with performing all of the syscalls in
> lieu of the container. This is a huge security liability since it is
> extremely difficult to correctly assume all of the necessary privileges
> of the calling task such that the syscall can be successfully emulated
> without escaping other additional security restrictions (think missing
> CAP_MKNOD for mknod(), or MS_NODEV on a filesystem etc.). This can be
> solved by telling seccomp to resume the syscall.
>
> The continue feature obviously requires _massive amounts of caution_.
> Here's my comment from the kernel header:
>
> "Note, the SECCOMP_USER_NOTIF_FLAG_CONTINUE flag must be used with caution!
> If set by the process supervising the syscalls of another process the
> syscall will continue. This is problematic because of an inherent TOCTOU.
> An attacker can exploit the time while the supervised process is waiting on
> a response from the supervising process to rewrite syscall arguments which
> are passed as pointers of the intercepted syscall.
> It should be absolutely clear that this means that the seccomp notifier
> _cannot_ be used to implement a security policy! It should only ever be used
> in scenarios where a more privileged process supervises the syscalls of a
> lesser privileged process to get around kernel-enforced security
> restrictions when the privileged process deems this safe. In other words,
> in order to continue a syscall the supervising process should be sure that
> another security mechanism or the kernel itself will sufficiently block
> syscalls if arguments are rewritten to something unsafe.
>
> Similar precautions should be applied when stacking SECCOMP_RET_USER_NOTIF
> or SECCOMP_RET_TRACE. For SECCOMP_RET_USER_NOTIF filters acting on the
> same syscall, the most recently added filter takes precedence. This means
> that the new SECCOMP_RET_USER_NOTIF filter can override any
> SECCOMP_IOCTL_NOTIF_SEND from earlier filters, essentially allowing all
> such filtered syscalls to be executed by sending the response
> SECCOMP_USER_NOTIF_FLAG_CONTINUE. Note that SECCOMP_RET_TRACE can equally
> be overriden by SECCOMP_USER_NOTIF_FLAG_CONTINUE."
>
> In any case, maybe this is is a viable alternative to you on new kernels
> that allows you to avoid signals and such.
>
> I'm happy to help review this or advice since we've had quite some
> experience implementing and making use of this.
> I've spoken on this feature a few times before and will be talking about
> it during FOSDEM too.

This is great news, and sounds pretty much exactly like what I
expected a replacement should look like. Thanks for the detailed
reply!

Rich
Reply | Threaded
Open this post in threaded view
|

Re: Fallout from dlopen() blocking SIGSYS

Florian Weimer-5
In reply to this post by Gian-Carlo Pascutto
* Gian-Carlo Pascutto:

> Block signals during the initial part of dlopen
> (a2e8aa0d9ea648068d8be52dd7b15f1b6a008e23)
>
> is going to break every Firefox release of the last few years. We use a
> seccomp-bpf filter to sandbox various processes. In some of these
> processes we don't want to do a dlopen() of untrusted code while we're
> not sandboxed yet, for example in the process we use to isolate Google's
> Widevine DRM modules from any private data on the system.
>
> seccomp-bpf will intercept various filesystem related syscalls and raise
> SIGSYS, at which moment our code will contact a broker in the parent
> process that checks if the file that's being want to read is acceptable
> to us, and then passes down the file handle.

I have re-reviewed the referenced patch and posted:

  <https://sourceware.org/ml/libc-alpha/2019-12/msg00175.html>
  <https://sourceware.org/ml/libc-alpha/2019-12/msg00176.html>
  <https://sourceware.org/ml/libc-alpha/2019-12/msg00177.html>

Lazy binding is buggy and has races, but with the new patches, the
NODELETE changes should not make matters worse.

But I think we do need something better for seccomp sandboxing in the
medium term, so I'm happy to have a larger conversation now.

Is there actually a signal handler for SIGSYS in the monitored process?
Based on some discussion I've seen, I think the kernel pushes a signal
context on the thread stack (otherwise there wouldn't be a signal mask
to patch), handler or not.  This alone as compatibility implications.

There are cases where we absolutely have to block all signals for
correctness purposes.  Some reasons are:

(a) Implementing async-signal-safe functions on top of something that is
not async-signal-safe.

(b) Avoid running user code with the wrong TCB or an uninitialized TCB.

(c) Prevent the kernel from pushing the signal context onto a stack that
is too small.

(d) Avoid running user code on a stack that is too small.

(e) Enable reuse of the stack pointer register for something else.

Particularly for (a), I expect to see more cases in the future.  I don't
know which system calls we would run in such critical sections.  The
usage in dlopen falls into that category, but it's a very incomplete fix
and not very useful overall.

Unfortunately, (b) is generally necessary around clone system calls.
It's essential for correct use of vfork-like clone in posix_spawn.  We
don't do it for pthread_create today, but this results in a bug we want
to fix (see bug 25098).

(c) is relevant to the current use of clone in vfork because we have a
small stack there.  I think this impacts seccomp monitoring even if
there is no actual signal handler because of the signal context data
written by the kernel.  I want to add a clone_samestack system call
wrapper that avoids this issue, but I haven't done that yet.

(d) is a more problematic variant of (c).  That's a secondary issue with
the vfork as wellin addition to (b).  I don't think (d) is something we
do a lot in glibc, but applications may do it.  Perhaps they use
sigaltstack instead.

glibc currently does not do (e) as far as I know, but there are some
applications which use %esp on i386 as a general-purpose register.  I
doubt this use case is relevant to Firefox anyway.

Please do not underestimate the stack usage for the signal context.  If
I recall correctly, on current x86, it is more than 5 KiB, and on POWER,
it's more than 10 KiB.  Stack usage grows with newer kernel releases
which bring support for larger register files.  Some of this overhead
comes from the red zone (the stack region below the stack pointer that
signal handlers cannot touch), and that's part of the ABI definition.
But the variable-sized part of the signal context is not exported from
the kernel, so it's hard for applications to size stacks appropriately.
(That's why I'm interested in clone_samestack for glibc's internal use.)

For (a), we really need a list of system calls which are safe to perform
in such critical sections.  Can we call your interposed malloc, or will
that try to open files in /proc in some cases?

When we fix bug 25098 and adopt clone3, you might be a bit of a problem
because of the in-memory flags argument for clone3, and you can't
intercept the system call due to the blocked signals.

Thanks,
Florian

Reply | Threaded
Open this post in threaded view
|

Re: Fallout from dlopen() blocking SIGSYS

Christian Brauner-2
On Thu, Dec 05, 2019 at 05:03:00PM +0100, Florian Weimer wrote:

> * Gian-Carlo Pascutto:
>
> > Block signals during the initial part of dlopen
> > (a2e8aa0d9ea648068d8be52dd7b15f1b6a008e23)
> >
> > is going to break every Firefox release of the last few years. We use a
> > seccomp-bpf filter to sandbox various processes. In some of these
> > processes we don't want to do a dlopen() of untrusted code while we're
> > not sandboxed yet, for example in the process we use to isolate Google's
> > Widevine DRM modules from any private data on the system.
> >
> > seccomp-bpf will intercept various filesystem related syscalls and raise
> > SIGSYS, at which moment our code will contact a broker in the parent
> > process that checks if the file that's being want to read is acceptable
> > to us, and then passes down the file handle.
>
> I have re-reviewed the referenced patch and posted:
>
>   <https://sourceware.org/ml/libc-alpha/2019-12/msg00175.html>
>   <https://sourceware.org/ml/libc-alpha/2019-12/msg00176.html>
>   <https://sourceware.org/ml/libc-alpha/2019-12/msg00177.html>
>
> Lazy binding is buggy and has races, but with the new patches, the
> NODELETE changes should not make matters worse.
>
> But I think we do need something better for seccomp sandboxing in the
> medium term, so I'm happy to have a larger conversation now.
>
> Is there actually a signal handler for SIGSYS in the monitored process?
> Based on some discussion I've seen, I think the kernel pushes a signal
> context on the thread stack (otherwise there wouldn't be a signal mask
> to patch), handler or not.  This alone as compatibility implications.
>
> There are cases where we absolutely have to block all signals for
> correctness purposes.  Some reasons are:
>
> (a) Implementing async-signal-safe functions on top of something that is
> not async-signal-safe.
>
> (b) Avoid running user code with the wrong TCB or an uninitialized TCB.
>
> (c) Prevent the kernel from pushing the signal context onto a stack that
> is too small.
>
> (d) Avoid running user code on a stack that is too small.
>
> (e) Enable reuse of the stack pointer register for something else.
>
> Particularly for (a), I expect to see more cases in the future.  I don't
> know which system calls we would run in such critical sections.  The
> usage in dlopen falls into that category, but it's a very incomplete fix
> and not very useful overall.
>
> Unfortunately, (b) is generally necessary around clone system calls.
> It's essential for correct use of vfork-like clone in posix_spawn.  We
> don't do it for pthread_create today, but this results in a bug we want
> to fix (see bug 25098).
>
> (c) is relevant to the current use of clone in vfork because we have a
> small stack there.  I think this impacts seccomp monitoring even if
> there is no actual signal handler because of the signal context data
> written by the kernel.  I want to add a clone_samestack system call
> wrapper that avoids this issue, but I haven't done that yet.
>
> (d) is a more problematic variant of (c).  That's a secondary issue with
> the vfork as wellin addition to (b).  I don't think (d) is something we
> do a lot in glibc, but applications may do it.  Perhaps they use
> sigaltstack instead.
>
> glibc currently does not do (e) as far as I know, but there are some
> applications which use %esp on i386 as a general-purpose register.  I
> doubt this use case is relevant to Firefox anyway.
>
> Please do not underestimate the stack usage for the signal context.  If
> I recall correctly, on current x86, it is more than 5 KiB, and on POWER,
> it's more than 10 KiB.  Stack usage grows with newer kernel releases
> which bring support for larger register files.  Some of this overhead
> comes from the red zone (the stack region below the stack pointer that
> signal handlers cannot touch), and that's part of the ABI definition.
> But the variable-sized part of the signal context is not exported from
> the kernel, so it's hard for applications to size stacks appropriately.
> (That's why I'm interested in clone_samestack for glibc's internal use.)
>
> For (a), we really need a list of system calls which are safe to perform
> in such critical sections.  Can we call your interposed malloc, or will
> that try to open files in /proc in some cases?
>
> When we fix bug 25098 and adopt clone3, you might be a bit of a problem
> because of the in-memory flags argument for clone3, and you can't

Fwiw, I have this on my agenda, i.e. making it possible for seccomp to
filter a certain __subset__ of system calls with pointer arguments. I've
started a discussion in August right before Kernel Summit:
https://lists.linuxfoundation.org/pipermail/ksummit-discuss/2019-July/006699.html
and Kees Cook and I gave a session at Kernel Summit in Lisbon:
https://www.youtube.com/watch?v=PnOSPsRzVYM

It's planned I just need to find time to work on this :/

Christian
Reply | Threaded
Open this post in threaded view
|

Re: Fallout from dlopen() blocking SIGSYS

Gian-Carlo Pascutto
In reply to this post by Florian Weimer-5
On 5/12/2019 17:03, Florian Weimer wrote:
> I have re-reviewed the referenced patch and posted:
>
>   <https://sourceware.org/ml/libc-alpha/2019-12/msg00175.html>
>   <https://sourceware.org/ml/libc-alpha/2019-12/msg00176.html>
>   <https://sourceware.org/ml/libc-alpha/2019-12/msg00177.html>
>
> Lazy binding is buggy and has races, but with the new patches, the
> NODELETE changes should not make matters worse.

Thanks, it seems like that should put out the immediate fire at least.

> But I think we do need something better for seccomp sandboxing in the
> medium term, so I'm happy to have a larger conversation now.

We'd probably need to include the Chromium people in this. The
implementation we use is to a large extent based on theirs, and, IIRC,
it's also that team that did parts of the initial implementation of
seccomp-bpf in the kernel. The (only) reason this now affected Firefox
first is that dlopen() was the first to block signals, and we have a use
case where we need to do that inside the sandbox, and (apparently)
Chromium doesn't (yet).

But if the signal blocking is going to be required for other libc calls
to work, I would assume it is going to risk breaking Chromium too, as
they also use the SECCOMP_RET_TRAP mechanism in several places:

https://chromium.googlesource.com/chromium/src/+/refs/heads/master/services/service_manager/sandbox/linux/bpf_network_policy_linux.cc#38
https://chromium.googlesource.com/chromium/src/+/refs/heads/master/services/service_manager/sandbox/linux/bpf_gpu_policy_linux.cc#75
https://chromium.googlesource.com/chromium/src/+/refs/heads/master/services/service_manager/sandbox/linux/bpf_audio_policy_linux.cc#132

We could try to answer some of your questions as we're also familiar
with the code but it'd be like getting second hand information to some
extent.

> For (a), we really need a list of system calls which are safe to perform
> in such critical sections.  Can we call your interposed malloc, or will
> that try to open files in /proc in some cases?

It should do an anonymous mmap, AFAIK. We TRAP on

https://searchfox.org/mozilla-central/source/security/sandbox/linux/SandboxFilter.cpp

Anything trying to hit the filesystem:
open, openat, access, accessat, stat, lstat, statat, chmod, link,
symlink, rename, mkdir, rmdir, unlink, readlink, readlinkat, faccessat,
statx (in the future)

Anything affecting other processes or leaking too much data about the
system:
tkill, prctl (sometimes), getppid, connect, socketpair, socketcall,
sched, uname, fcntl, sched_getparam, sched_getscheduler,
sched_setscheduler (sometimes)

> When we fix bug 25098 and adopt clone3, you might be a bit of a problem
> because of the in-memory flags argument for clone3, and you can't
> intercept the system call due to the blocked signals.

Yes, that looks like it would be a serious problem, for both browsers:

https://searchfox.org/mozilla-central/rev/ea63a0888d406fae720cf24f4727d87569a8cab5/security/sandbox/linux/SandboxFilter.cpp#325

https://chromium.googlesource.com/chromium/src/+/refs/heads/master/sandbox/linux/seccomp-bpf-helpers/syscall_parameters_restrictions.cc#140

It sounds like what Christian Brauner is proposing would solve this, but
we can't rely on it being present - and likely won't be able to due to
backwards compatibility for many more years.

Some implementations in glibc try to use a new syscall, and if it
doesn't exist on the current kernel (ENOSYS), fall back to older
interfaces. If that's possible for clone3 usage, then we'd simply return
ENOSYS on that and force the fallback to regular clone, unless the
kernel is new enough that we can filter. That would decouple glibc's
ability to use the new syscall from the state of the seccomp filtering
implementation in the kernel. Could that work here?

We currently do this for statx, which just gets an ENOSYS instead of a
TRAP - glibc (and rust's stdlib!) will happily use their fallback paths
until we write a broker implementation for it.

--
GCP
Reply | Threaded
Open this post in threaded view
|

Re: Fallout from dlopen() blocking SIGSYS

Florian Weimer-5
* Gian-Carlo Pascutto:

> On 5/12/2019 17:03, Florian Weimer wrote:
>> I have re-reviewed the referenced patch and posted:
>>
>>   <https://sourceware.org/ml/libc-alpha/2019-12/msg00175.html>
>>   <https://sourceware.org/ml/libc-alpha/2019-12/msg00176.html>
>>   <https://sourceware.org/ml/libc-alpha/2019-12/msg00177.html>
>>
>> Lazy binding is buggy and has races, but with the new patches, the
>> NODELETE changes should not make matters worse.
>
> Thanks, it seems like that should put out the immediate fire at least.
>
>> But I think we do need something better for seccomp sandboxing in the
>> medium term, so I'm happy to have a larger conversation now.
>
> We'd probably need to include the Chromium people in this.

Would you be able to do this?  I don't know anyone there.

> The implementation we use is to a large extent based on theirs, and,
> IIRC, it's also that team that did parts of the initial implementation
> of seccomp-bpf in the kernel. The (only) reason this now affected
> Firefox first is that dlopen() was the first to block signals, and we
> have a use case where we need to do that inside the sandbox, and
> (apparently) Chromium doesn't (yet).
>
> But if the signal blocking is going to be required for other libc calls
> to work, I would assume it is going to risk breaking Chromium too, as
> they also use the SECCOMP_RET_TRAP mechanism in several places:

Chromium was definitely affected by the recent changes in the glibc
syscall profile, adjusting to the current generation of system calls
recommended by the kernel.  (Only those system calls will receive Y2038
companions on existing 32-bit architectures, and we'll consolidate on
those system calls.)

>> For (a), we really need a list of system calls which are safe to perform
>> in such critical sections.  Can we call your interposed malloc, or will
>> that try to open files in /proc in some cases?
>
> It should do an anonymous mmap, AFAIK.

Doesn't it also create background threads?

> We TRAP on
>
> https://searchfox.org/mozilla-central/source/security/sandbox/linux/SandboxFilter.cpp
>
> Anything trying to hit the filesystem:
> open, openat, access, accessat, stat, lstat, statat, chmod, link,
> symlink, rename, mkdir, rmdir, unlink, readlink, readlinkat, faccessat,
> statx (in the future)
>
> Anything affecting other processes or leaking too much data about the
> system:
> tkill, prctl (sometimes), getppid, connect, socketpair, socketcall,
> sched, uname, fcntl, sched_getparam, sched_getscheduler,
> sched_setscheduler (sometimes)

And for other system calls, you may have BPF-based filters which inspect
arguments?

>> When we fix bug 25098 and adopt clone3, you might be a bit of a problem
>> because of the in-memory flags argument for clone3, and you can't
>> intercept the system call due to the blocked signals.
>
> Yes, that looks like it would be a serious problem, for both browsers:
>
> https://searchfox.org/mozilla-central/rev/ea63a0888d406fae720cf24f4727d87569a8cab5/security/sandbox/linux/SandboxFilter.cpp#325
>
> https://chromium.googlesource.com/chromium/src/+/refs/heads/master/sandbox/linux/seccomp-bpf-helpers/syscall_parameters_restrictions.cc#140

> Some implementations in glibc try to use a new syscall, and if it
> doesn't exist on the current kernel (ENOSYS), fall back to older
> interfaces. If that's possible for clone3 usage, then we'd simply return
> ENOSYS on that and force the fallback to regular clone, unless the
> kernel is new enough that we can filter. That would decouple glibc's
> ability to use the new syscall from the state of the seccomp filtering
> implementation in the kernel. Could that work here?

For default builds and the forseeable future, probably yes.

> We currently do this for statx, which just gets an ENOSYS instead of a
> TRAP - glibc (and rust's stdlib!) will happily use their fallback paths
> until we write a broker implementation for it.

Some distributions may opt to build glibc against a more recent minimum
kernel version instead of the default (Linux 3.2).  In that case, the
fallback code may be gone.  Some libraries also choose not to support
old kernels and make direct system calls.

This is also what happens if we switch to more recent system calls from
historic ones.  There won't be any ENOSYS fallback code if they are
available in Linux 3.2.

Thanks,
Florian

Reply | Threaded
Open this post in threaded view
|

Re: Fallout from dlopen() blocking SIGSYS

Gian-Carlo Pascutto
In reply to this post by Rich Felker-2
On 4/12/2019 23:08, Rich Felker wrote:
> The right solution probably involves using seccomp with a tracer
> process, or if the dlopen calls are the only thing you need to
> filter, interposing dlopen to prevent the nasty library from calling
>  it.

In the case of the EME/DRM modules, the use case is that we don't
(necessarily) trust the (closed source, proprietary) module. It's
shipped as a .so (currently, it really is libwidevinecdm.so), so we want
to be sure that we only call dlopen() on it *when already inside* the
sandbox. This in order to prevent some craftily written header (or
STT_IFUNC shenanigans) from owning us.

I singled this one out because it's the place where we intentionally do
the dlopen() inside.

There's a few other places where we end up using dlopen() inside the
sandbox where we perhaps don't necessarily need to. Most of those are
fixable *in theory*, but more problematic in practice as the libraries
tend to be those shared with other applications...and this is how I
found out that at least on Ubuntu Chromium ends up installing its
private, duplicate copies of those...sigh.

> Or, just patching the nasty library not to do nasty things you don't
> want it to do.

I doubt this is something we can realistically do here.

--
GCP
Reply | Threaded
Open this post in threaded view
|

Re: Fallout from dlopen() blocking SIGSYS

Jed Davis
In reply to this post by Christian Brauner-2
On Thu, Dec 5, 2019 at 2:47 AM Christian Brauner
<[hidden email]> wrote:
> I saw this fly by the libc-alpha mailing list late at night yesterday. I
> think with work we've recently done in the upstream kernel the
> SECCOMP_RET_TRAP approach trapping and signaling is not needed anymore
> at least on newer kernels. That won't help you with legacy but it's
> still worth considering switching over.

Thank you for that detailed writeup.  SECCOMP_RET_USER_NOTIF does seem
to be capable of doing everything we're currently doing with
SECCOMP_RET_TRAP, and it should be feasible to use SECCOMP_RET_TRAP to
implement similar functionality for the case of old kernel +
SIGSYS-tolerant userland to avoid excessive code duplication.  It is
going to cause some extra context switches (anything like open() that
needs to pass back fds, and anything that's currently handled
in-process), but this was never meant to be on a critical path.  There
may also be problems with external security policies, like the Snap
package manager's confinement, that try to restrict ptrace-like
actions.  And, of course, rewriting everything will be a nontrivial
project.

Also, if Firefox needs to be SIGSYS-free before 2020-09-22, when the
68.x extended support branch ends, then we need to somehow find a
“low-risk” way to implement this major change.

As for legacy support: normally we can't rely on new kernel features
like that (even SECCOMP_FILTER_FLAG_TSYNC is still too new to use
without a fallback), but in this case there would have to be an older
kernel combined with a significantly newer glibc (whichever version
eventually unrecoverably breaks the use of SIGSYS) to have problems,
so this is hopefully an exception.

--Jed
Reply | Threaded
Open this post in threaded view
|

Re: Fallout from dlopen() blocking SIGSYS

Jed Davis
In reply to this post by Gian-Carlo Pascutto
On Fri, Dec 6, 2019 at 3:40 AM Gian-Carlo Pascutto
<[hidden email]> wrote:

>
> On 4/12/2019 23:08, Rich Felker wrote:
> > The right solution probably involves using seccomp with a tracer
> > process, or if the dlopen calls are the only thing you need to
> > filter, interposing dlopen to prevent the nasty library from calling
> >  it.
>
> In the case of the EME/DRM modules, the use case is that we don't
> (necessarily) trust the (closed source, proprietary) module. It's
> shipped as a .so (currently, it really is libwidevinecdm.so), so we want
> to be sure that we only call dlopen() on it *when already inside* the
> sandbox. This in order to prevent some craftily written header (or
> STT_IFUNC shenanigans) from owning us.

To add to this: in the EME case, we have exactly one .so file, with
any dependencies pre-loaded; in this case the open/openat trap is
actually handled in-process by returning a pre-opened fd rather than
messaging a broker (for annoying reasons involving socketcall(2) and
SOCK_DGRAM, but also to minimize attack surface).

Which is to say that if we had something like FreeBSD's fdlopen(3), we
could just use it directly instead of doing seccomp tricks.

But even with fdlopen we'd still have the problem of content processes
using various dynamically/lazily loaded libraries (e.g., libavcodec
and its 38 transitive dependencies), and as previously discussed the
space of potential problems here is much larger than dlopen calling
open.  So that definitely wouldn't be a general-purpose fix, but I
thought it was worth mentioning.

--Jed
Reply | Threaded
Open this post in threaded view
|

Re: Fallout from dlopen() blocking SIGSYS

Christian Brauner-2
In reply to this post by Jed Davis
On Fri, Dec 06, 2019 at 05:09:17PM -0700, Jed Davis wrote:

> On Thu, Dec 5, 2019 at 2:47 AM Christian Brauner
> <[hidden email]> wrote:
> > I saw this fly by the libc-alpha mailing list late at night yesterday. I
> > think with work we've recently done in the upstream kernel the
> > SECCOMP_RET_TRAP approach trapping and signaling is not needed anymore
> > at least on newer kernels. That won't help you with legacy but it's
> > still worth considering switching over.
>
> Thank you for that detailed writeup.  SECCOMP_RET_USER_NOTIF does seem
> to be capable of doing everything we're currently doing with
> SECCOMP_RET_TRAP, and it should be feasible to use SECCOMP_RET_TRAP to
> implement similar functionality for the case of old kernel +
> SIGSYS-tolerant userland to avoid excessive code duplication.  It is
> going to cause some extra context switches (anything like open() that
> needs to pass back fds, and anything that's currently handled
> in-process), but this was never meant to be on a critical path.  There
> may also be problems with external security policies, like the Snap
> package manager's confinement, that try to restrict ptrace-like

Hm, I'd need more details about this specific worry. The notifier itself
doesn't need any ptrace-like capabilities. Only if you want to
get/inject/replace file descriptors. But if that's a feature you need we
can implement this directly in the notifier.

> actions.  And, of course, rewriting everything will be a nontrivial
> project.

Yeah. As I said, we'd be happy to give help/reviews/guidance since we've
done the kernel and userspace part.

>
> Also, if Firefox needs to be SIGSYS-free before 2020-09-22, when the
> 68.x extended support branch ends, then we need to somehow find a
> “low-risk” way to implement this major change.
>
> As for legacy support: normally we can't rely on new kernel features
> like that (even SECCOMP_FILTER_FLAG_TSYNC is still too new to use
> without a fallback), but in this case there would have to be an older
> kernel combined with a significantly newer glibc (whichever version
> eventually unrecoverably breaks the use of SIGSYS) to have problems,
> so this is hopefully an exception.

Yeah, that sounds like a very unlikely combination to be honest.

Christian
Reply | Threaded
Open this post in threaded view
|

Re: Fallout from dlopen() blocking SIGSYS

Florian Weimer-5
In reply to this post by Jed Davis
* Jed Davis:

> Also, if Firefox needs to be SIGSYS-free before 2020-09-22, when the
> 68.x extended support branch ends, then we need to somehow find a
> “low-risk” way to implement this major change.

You could perhaps probe whether the sandbox works with only minimal
local content and disable it automatically if it does not.

> As for legacy support: normally we can't rely on new kernel features
> like that (even SECCOMP_FILTER_FLAG_TSYNC is still too new to use
> without a fallback), but in this case there would have to be an older
> kernel combined with a significantly newer glibc (whichever version
> eventually unrecoverably breaks the use of SIGSYS) to have problems,
> so this is hopefully an exception.

I hope that these new seccomp interfaces are available on those systems
which benefit of them most (where applications or glibc would use
clone3, for example).

Thanks,
Florian

Reply | Threaded
Open this post in threaded view
|

Re: Fallout from dlopen() blocking SIGSYS

Jed Davis
On Wed, Dec 18, 2019 at 9:26 AM Florian Weimer <[hidden email]> wrote:
> I hope that these new seccomp interfaces are available on those systems
> which benefit of them most (where applications or glibc would use
> clone3, for example).

clone3 is going to be a problem with seccomp regardless, and as far as
I know there's nothing concretely planned for the kernel that will
help.  If the caller can fall back to regular clone(), the seccomp
policy can make it fail with ENOSYS via SECCOMP_RET_ERRNO, which
doesn't care about signal state.  If signals aren't blocked *and* the
clone3 call is in the subset supported by clone(), then a SIGSYS trap
could theoretically rewrite it, but we've already established that
signals may be blocked so that doesn't matter.  SECCOMP_RET_USER_NOTIF
doesn't appear to allow anything useful here, since there's no “inject
thread into other process” operation as far as I know.  (Maybe if the
supervisor could attach with ptrace and force the supervisee into a
clone() syscall… but that still doesn't help if it's a call that
actually needs clone3.)

As far as reasons to filter clone3: requiring CLONE_THREAD is slightly
important, because our seccomp-bpf policy allows tgkill (and
rt_tgsigqueueinfo) with tgid == the process's pid, so if the process
could fork and wait for its old pid to be recycled… this is relatively
minor but it is something that can be prevented when using the old
clone.  (Containing the process in its own pid namespace would prevent
that, but we don't currently require access to namespace creation as
Chrome does.)  There's also the general problem of attack surface
reduction; CLONE_NEWUSER is probably the biggest risk, but in our use
cases if the distribution allows it for unprivileged processes then
we'll use it to call chroot, which prevents any further use of
CLONE_NEWUSER (and, by extension, gaining capabilities to use other
namespace-related flags like CLONE_NEWNET).  So that's not the end of
the world, but it's not ideal, and I don't think the Chromium security
people would like it either.

To summarize, if in the future pthread_create could use clone3 with no
fallback, then there are going to be minor but fundamental problems
with sandboxing and we'd need the kernel to provide some first-class
way to handle memory parameters.  There was a high-level proposal at
[1], but there were concerns about the design raised in followup
messages, and as far as I know nothing at the level of a prototype or
a concrete plan.

--Jed

[1]: https://lore.kernel.org/ksummit-discuss/CALCETrVpbSDraiwJRmOj28wepTjEPiSDQz=DUuSig_P1rSGZ6Q@.../
Reply | Threaded
Open this post in threaded view
|

Re: Fallout from dlopen() blocking SIGSYS

Christian Brauner-2
On Wed, Dec 18, 2019 at 07:14:45PM -0700, Jed Davis wrote:

> On Wed, Dec 18, 2019 at 9:26 AM Florian Weimer <[hidden email]> wrote:
> > I hope that these new seccomp interfaces are available on those systems
> > which benefit of them most (where applications or glibc would use
> > clone3, for example).
>
> clone3 is going to be a problem with seccomp regardless, and as far as
> I know there's nothing concretely planned for the kernel that will
> help.  If the caller can fall back to regular clone(), the seccomp
> policy can make it fail with ENOSYS via SECCOMP_RET_ERRNO, which
> doesn't care about signal state.  If signals aren't blocked *and* the
> clone3 call is in the subset supported by clone(), then a SIGSYS trap
> could theoretically rewrite it, but we've already established that
> signals may be blocked so that doesn't matter.  SECCOMP_RET_USER_NOTIF

It does to some extent. As I mentioned, you can continue syscalls from
the notifier. So you can - beware of races with other threads - rewrite
the syscall arguments to what you want. You'd intercept clone3(), look
at the memory arguments and either deny, or rewrite the syscall args and
continue the task.

> doesn't appear to allow anything useful here, since there's no “inject
> thread into other process” operation as far as I know.  (Maybe if the
> supervisor could attach with ptrace and force the supervisee into a
> clone() syscall… but that still doesn't help if it's a call that
> actually needs clone3.)
>
> As far as reasons to filter clone3: requiring CLONE_THREAD is slightly
> important, because our seccomp-bpf policy allows tgkill (and
> rt_tgsigqueueinfo) with tgid == the process's pid, so if the process
> could fork and wait for its old pid to be recycled… this is relatively
> minor but it is something that can be prevented when using the old
> clone.  (Containing the process in its own pid namespace would prevent
> that, but we don't currently require access to namespace creation as
> Chrome does.)  There's also the general problem of attack surface
> reduction; CLONE_NEWUSER is probably the biggest risk, but in our use

Apart from the generic "it used to have bugs" what is your main worry
here and why are you not using namespaces to _limit_ attack surface?
There's quite a few things you can do just in terms of creating user
namespaces, dropping capabilities, coming up with very limited and
targeted id mappings (340 mappings per userns can be specified
nowadays) etc. pp.

> cases if the distribution allows it for unprivileged processes then
> we'll use it to call chroot, which prevents any further use of
> CLONE_NEWUSER (and, by extension, gaining capabilities to use other

You can limit the number of user namespaces by writing
echo 10 > /proc/sys/user/max_user_namespaces since v4.9.
Setting it to 10 will mean you can only create 10 user namespaces
globally iirc.
It's similar for all other namespaces.

> namespace-related flags like CLONE_NEWNET).  So that's not the end of
> the world, but it's not ideal, and I don't think the Chromium security
> people would like it either.
>
> To summarize, if in the future pthread_create could use clone3 with no
> fallback, then there are going to be minor but fundamental problems
> with sandboxing and we'd need the kernel to provide some first-class
> way to handle memory parameters.  There was a high-level proposal at
> [1], but there were concerns about the design raised in followup
> messages, and as far as I know nothing at the level of a prototype or
> a concrete plan.

I've mentioned in the previous message that I intend to work on this.
How urgent is this aka when would you want something for this? The
approach Kees and I settled one was to do this in a per-system call
basis for now.

Christian