Possible sockets/fd race condition

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Possible sockets/fd race condition

Ross Wightman
Myself and another engineer I am working with have recently
encountered what appears to be some sort of sockets (BSD Stack) or fd
related race condition.

We have two threads relevant to the problem. One of them is sending
out small (~64byte) UDP multicast packets at a configurable rate of
1-100 per second. The other thread is running a basic single threaded
http server that serves one connection at a time (but serves many
small requests quite rapidly).

Running by themselves either thread can send out hundreds of thousands
of UDP packets or serve millions of http requests with no problems.
However, as soon as the two threads are both running they last for
anywhere from a few hundred to a few thousand served requests before
the UDP thread dies a horrible death trying to close its socket.

The exact crash occurs when tcp_usr_detach in tcp_usrreq.c tries to
dereference a NULL pointer (which should be pointing to a tcp control
block). This is very bizarre because we are supposed to be closing a
UDP socket. If we look at the socket itself a few stack frames down it
is a non active TCP socket with the SS_NOFDREF and SS_INCOMP flags set
and no TCP control block. It is certainly not an active UDP socket as
it should be.

Every time the crash has occurred, the http server is waiting in
unlock_inner from an splx call via various tcp calls.

The Socket/File Descriptor # for the UDP socket and the TCP socket at
the time of the crash is 4 and 5 respectively.

We have yet to pinpoint when exactly things start to go wrong. We have
however looked for possible race conditions and we noticed that the
socreate function does not have any splnet protection like most of the
other functions at that level. So we added that and after half a day
running the problem has not occurred again. However it's not clear if
we fixed the problem in a correct manner or smashed it with a blunt
hammer and masked the real cause.

Any thoughts would be much appreciated. More details follow below.

Regards,

Ross



This is the pseudo code sequence of the UDP thread.

{
socket s;

loop
{
wait on flag()

s=socket(...);

setsocketopt(s, ...);

ioctl(s,.... ) //for address change check

sendto(s, ....)

close(s)

}
}

The http server thread is running a typical TCP serving loop and
shares nothing with the UDP thread. The only interaction between the
two threads is via any locking or data sharing that occurs in the
tcp/ip stack.

The relevant portion of the UDP thread stack is as follows:

#0  0x001f01b0 in tcp_disconnect (tp=0x0)
    at /cygdrive/d/ecos/ecos-cvs
/packages/net/bsd_tcpip/current/src/sys/netinet/tcp_usrreq.c:1070
 #1  0x001eedb4 in tcp_usr_detach (so=0x2cc010)
    at /cygdrive/d/ecos/ecos-cvs/packages/net/bsd_tcpip/current/src/sys/netinet/tcp_usrreq.c:181
#2  0x001c33ec in cyg_soclose (so=0x2cc010)
    at /cygdrive/d/ecos/ecos-cvs/packages/net/bsd_tcpip/current/src/sys/kern/uipc_socket.c:321
#3  0x0016f244 in bsd_close (fp=0x2b99ec)
    at /cygdrive/d/ecos/ecos-cvs/packages/net/bsd_tcpip/current/src/sys/kern/sockio.c:730
#4  0x0019c17c in fp_ucount_dec (fp=0x2b99ec)
    at /cygdrive/d/ecos/ecos-cvs/packages/io/fileio/current/src/fd.cxx:160
#5  0x0019c284 in fd_close (fd=5)
    at /cygdrive/d/ecos/ecos-cvs/packages/io/fileio/current/src/fd.cxx:189
#6  0x0019c55c in cyg_fd_free (fd=5)
    at /cygdrive/d/ecos/ecos-cvs/packages/io/fileio/current/src/fd.cxx:260
#7  0x0019f8ac in close (fd=5)
    at /cygdrive/d/ecos/ecos-cvs/packages/io/fileio/current/src/io.cxx:216
#8  0x000524b0 in Packet::PacketSend() (this=0x32d5a0)
    at src/App/Packet.cpp:141

The print out of socket (so) is: This should be a UDP socket but
instead it's an inactive TCP socket.

(gdb) print/x *(struct socket*)(0x2cc010)
$2 = {so_zone = 0x2cc180, so_type = 0x1, so_options = 0x20, so_linger = 0x0,
  so_state = 0x801, so_pcb = 0x2cb478, so_proto = 0x2a51a0,
  so_head = 0x2cc0c0, so_incomp = {tqh_first = 0x0, tqh_last = 0x0},
  so_comp = {tqh_first = 0x0, tqh_last = 0x0}, so_list = {tqe_next = 0x0,
    tqe_prev = 0x2cc0d8}, so_qlen = 0x0, so_incqlen = 0x0, so_qlimit = 0x0,
  so_timeo = 0x0, so_error = 0x0, so_sigio = 0x0, so_oobmark = 0x0,
  so_aiojobq = {tqh_first = 0x0, tqh_last = 0x2cc054}, so_rcv = {sb_cc = 0x0,
    sb_hiwat = 0x4470, sb_mbcnt = 0x0, sb_mbmax = 0x22380, sb_lowat = 0x1,
    sb_mb = 0x0, sb_sel = {si_info = 0x0, si_waitFlag = 0x0}, sb_flags = 0x0,
    sb_timeo = 0x0}, so_snd = {sb_cc = 0x40, sb_hiwat = 0x4470,
    sb_mbcnt = 0x80, sb_mbmax = 0x22380, sb_lowat = 0x800, sb_mb = 0x2ccf80,
    sb_sel = {si_info = 0x0, si_waitFlag = 0x0}, sb_flags = 0x0,
    sb_timeo = 0x0}, so_upcall = 0x0, so_upcallarg = 0x0, so_cred = 0x0,
  cyg_so_gencnt = 0x8646, so_emuldata = 0x0, so_accf = 0x0}

The print out of inpcb (so_pcb) is:

(gdb) print/x *(struct inpcb*)(0x2cb478)
$3 = {inp_hash = {le_next = 0x0, le_prev = 0x0}, inp_fport = 0x0,
  inp_lport = 0x0, inp_list = {le_next = 0x0, le_prev = 0x2a6e18},
  inp_flow = 0x0, inp_dependfaddr = {inp46_foreign = {ia46_pad32 = {0x0, 0x0,
        0x0}, ia46_addr4 = {s_addr = 0x0}}, inp6_foreign = {__u6_addr = {
        __u6_addr8 = {0x0 <repeats 16 times>}, __u6_addr16 = {0x0, 0x0, 0x0,
          0x0, 0x0, 0x0, 0x0, 0x0}, __u6_addr32 = {0x0, 0x0, 0x0, 0x0}}}},
  inp_dependladdr = {inp46_local = {ia46_pad32 = {0x0, 0x0, 0x0},
      ia46_addr4 = {s_addr = 0x0}}, inp6_local = {__u6_addr = {__u6_addr8 = {
          0x0 <repeats 16 times>}, __u6_addr16 = {0x0, 0x0, 0x0, 0x0, 0x0,
          0x0, 0x0, 0x0}, __u6_addr32 = {0x0, 0x0, 0x0, 0x0}}}},
  inp_ppcb = 0x0, inp_pcbinfo = 0x308310, inp_socket = 0x2cc010,
  inp_flags = 0x0, inp_dependroute = {inp4_route = {ro_rt = 0x0, ro_u = {
        rou_dst_ss = {ss_len = 0x0, ss_family = 0x0, __ss_pad1 = {0x0, 0x0,
            0x0, 0x0, 0x0, 0x0}, __ss_align = 0x0, __ss_pad2 = {
            0x0 <repeats 112 times>}}, ro_dst_sa = {sa_len = 0x0,
          sa_family = 0x0, sa_data = {0x0 <repeats 30 times>}}}},
     inp6_route = {ro_rt = 0x0, ro_u = {rou_dst_ss = {ss_len = 0x0,
          ss_family = 0x0, __ss_pad1 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0},
          __ss_align = 0x0, __ss_pad2 = {0x0 <repeats 112 times>}},
         ro_dst_sa = {sa_len = 0x0, sa_family = 0x0, sa_data = {
            0x0 <repeats 30 times>}}}}}, inp_sp = 0x0, inp_vflag = 0x1,
  inp_ip_ttl = 0x40, inp_ip_p = 0x0, inp_depend4 = {inp4_ip_tos = 0x0,
     inp4_options = 0x0, inp4_moptions = 0x0}, inp_depend6 = {
    inp6_options = 0x0, inp6_inputopts = 0x0, inp6_outputopts = 0x0,
    inp6_moptions = 0x0, inp6_icmp6filt = 0x0, inp6_cksum = 0x0,
    inp6_ifindex = 0x0, inp6_hops = 0x0}, inp_portlist = {le_next = 0x0,
    le_prev = 0x0}, inp_phd = 0x0, inp_gencnt = 0x8d9d}


The relevant portion of the HTTP serving thread is:

#0  Cyg_Scheduler::unlock_inner(unsigned) (new_lock=0)
    at /cygdrive/d/ecos/ecos-cvs/packages/kernel/current/src/sched/sched.cxx:220
#1  0x00266858 in Cyg_Scheduler::unlock() ()
    at /cygdrive/d/work/firmware/micro/ecos/conf/avppc1/ecos/install/include/cyg/kernel/sched.inl:100
#2  0x001868cc in Cyg_Mutex::unlock() (this=0x2ff580)
    at /cygdrive/d/ecos/ecos-cvs/packages/kernel/current/src/sync/mutex.cxx:445
#3  0x0017e470 in cyg_mutex_unlock (mutex=0x2ff580)
    at /cygdrive/d/ecos/ecos-cvs/packages/kernel/current/src/common/kapi.cxx:1081
#4  0x001a6048 in cyg_splx (old_state=0)
    at /cygdrive/d/ecos/ecos-cvs/packages/net/bsd_tcpip/current/src/ecos/synch.c:209
#5  0x001c43d8 in cyg_sosend (so=0x2cbeb0, addr=0x0, uio=0x2b8310,
    top=0x2cce80, control=0x0, flags=0, p=0x0)
    at /cygdrive/d/ecos/ecos-cvs/packages/net/bsd_tcpip/current/src/sys/kern/uipc_socket.c:622
 #6  0x0016f9dc in bsd_sendit (fp=0x2b99cc, mp=0x2b83b0, flags=0,
    retsize=0x2b83a0)
    at /cygdrive/d/ecos/ecos-cvs/packages/net/bsd_tcpip/current/src/sys/kern/sockio.c:1039
#7  0x0016eb64 in bsd_sendmsg (fp=0x2b99cc, m=0x2b83b0, flags=0,
    retsize=0x2b83a0)
    at /cygdrive/d/ecos/ecos-cvs/packages/net/bsd_tcpip/current/src/sys/kern/sockio.c:572
#8  0x001a2450 in sendto (s=4, buf=0x32d7fc, len=543, flags=0, to=0x0, tolen=0)
    at /cygdrive/d/ecos/ecos-cvs/packages/io/fileio/current/src/socket.cxx:612
#9  0x001a251c in send (s=4, buf=0x32d7fc, len=543, flags=0)
    at /cygdrive/d/ecos/ecos-cvs/packages/io/fileio/current/src/socket.cxx:629
#10 0x00143694 in fsend (pData=0x32d5e8, len=543) at src/Interface/serv.cpp:518

--
Before posting, please read the FAQ: http://ecos.sourceware.org/fom/ecos
and search the list archive: http://ecos.sourceware.org/ml/ecos-discuss

Reply | Threaded
Open this post in threaded view
|

RE: Possible sockets/fd race condition

Jay Foster-4
Do you have ASSERTions turned on.  This could help identify a problem
earlier.

I have seen other similar kinds of problems (with assertions turned on), but
have not spent any time trying to resolve them yet.  Assertions such as:

eCos panic: cyg_fp_free() will close fd
eCos panic: sbdrop
eCos panic: soaccept: !NOFDREF
eCos panic: sofree: not queued
eCos panic: soclose: NOFDREF
eCos panic: Locking mutex I already own
eCos panic: sbflush: locked

I did a search of the eCos archives for these panics, and came accross this
(http://sourceware.org/ml/ecos-discuss/2000-04/msg00153.html)

I could find no evidence that this issue was ever addressed.

Jay

-----Original Message-----
From: Ross Wightman [mailto:[hidden email]]
Sent: Wednesday, November 16, 2005 12:45 AM
To: [hidden email]
Subject: [ECOS] Possible sockets/fd race condition


Myself and another engineer I am working with have recently
encountered what appears to be some sort of sockets (BSD Stack) or fd
related race condition.

We have two threads relevant to the problem. One of them is sending
out small (~64byte) UDP multicast packets at a configurable rate of
1-100 per second. The other thread is running a basic single threaded
http server that serves one connection at a time (but serves many
small requests quite rapidly).

Running by themselves either thread can send out hundreds of thousands
of UDP packets or serve millions of http requests with no problems.
However, as soon as the two threads are both running they last for
anywhere from a few hundred to a few thousand served requests before
the UDP thread dies a horrible death trying to close its socket.

The exact crash occurs when tcp_usr_detach in tcp_usrreq.c tries to
dereference a NULL pointer (which should be pointing to a tcp control
block). This is very bizarre because we are supposed to be closing a
UDP socket. If we look at the socket itself a few stack frames down it
is a non active TCP socket with the SS_NOFDREF and SS_INCOMP flags set
and no TCP control block. It is certainly not an active UDP socket as
it should be.

Every time the crash has occurred, the http server is waiting in
unlock_inner from an splx call via various tcp calls.

The Socket/File Descriptor # for the UDP socket and the TCP socket at
the time of the crash is 4 and 5 respectively.

We have yet to pinpoint when exactly things start to go wrong. We have
however looked for possible race conditions and we noticed that the
socreate function does not have any splnet protection like most of the
other functions at that level. So we added that and after half a day
running the problem has not occurred again. However it's not clear if
we fixed the problem in a correct manner or smashed it with a blunt
hammer and masked the real cause.

Any thoughts would be much appreciated. More details follow below.

Regards,

Ross



This is the pseudo code sequence of the UDP thread.

{
socket s;

loop
{
wait on flag()

s=socket(...);

setsocketopt(s, ...);

ioctl(s,.... ) //for address change check

sendto(s, ....)

close(s)

}
}

The http server thread is running a typical TCP serving loop and
shares nothing with the UDP thread. The only interaction between the
two threads is via any locking or data sharing that occurs in the
tcp/ip stack.

The relevant portion of the UDP thread stack is as follows:

#0  0x001f01b0 in tcp_disconnect (tp=0x0)
    at /cygdrive/d/ecos/ecos-cvs
/packages/net/bsd_tcpip/current/src/sys/netinet/tcp_usrreq.c:1070
 #1  0x001eedb4 in tcp_usr_detach (so=0x2cc010)
    at
/cygdrive/d/ecos/ecos-cvs/packages/net/bsd_tcpip/current/src/sys/netinet/tcp
_usrreq.c:181
#2  0x001c33ec in cyg_soclose (so=0x2cc010)
    at
/cygdrive/d/ecos/ecos-cvs/packages/net/bsd_tcpip/current/src/sys/kern/uipc_s
ocket.c:321
#3  0x0016f244 in bsd_close (fp=0x2b99ec)
    at
/cygdrive/d/ecos/ecos-cvs/packages/net/bsd_tcpip/current/src/sys/kern/sockio
.c:730
#4  0x0019c17c in fp_ucount_dec (fp=0x2b99ec)
    at /cygdrive/d/ecos/ecos-cvs/packages/io/fileio/current/src/fd.cxx:160
#5  0x0019c284 in fd_close (fd=5)
    at /cygdrive/d/ecos/ecos-cvs/packages/io/fileio/current/src/fd.cxx:189
#6  0x0019c55c in cyg_fd_free (fd=5)
    at /cygdrive/d/ecos/ecos-cvs/packages/io/fileio/current/src/fd.cxx:260
#7  0x0019f8ac in close (fd=5)
    at /cygdrive/d/ecos/ecos-cvs/packages/io/fileio/current/src/io.cxx:216
#8  0x000524b0 in Packet::PacketSend() (this=0x32d5a0)
    at src/App/Packet.cpp:141

The print out of socket (so) is: This should be a UDP socket but
instead it's an inactive TCP socket.

(gdb) print/x *(struct socket*)(0x2cc010)
$2 = {so_zone = 0x2cc180, so_type = 0x1, so_options = 0x20, so_linger = 0x0,
  so_state = 0x801, so_pcb = 0x2cb478, so_proto = 0x2a51a0,
  so_head = 0x2cc0c0, so_incomp = {tqh_first = 0x0, tqh_last = 0x0},
  so_comp = {tqh_first = 0x0, tqh_last = 0x0}, so_list = {tqe_next = 0x0,
    tqe_prev = 0x2cc0d8}, so_qlen = 0x0, so_incqlen = 0x0, so_qlimit = 0x0,
  so_timeo = 0x0, so_error = 0x0, so_sigio = 0x0, so_oobmark = 0x0,
  so_aiojobq = {tqh_first = 0x0, tqh_last = 0x2cc054}, so_rcv = {sb_cc =
0x0,
    sb_hiwat = 0x4470, sb_mbcnt = 0x0, sb_mbmax = 0x22380, sb_lowat = 0x1,
    sb_mb = 0x0, sb_sel = {si_info = 0x0, si_waitFlag = 0x0}, sb_flags =
0x0,
    sb_timeo = 0x0}, so_snd = {sb_cc = 0x40, sb_hiwat = 0x4470,
    sb_mbcnt = 0x80, sb_mbmax = 0x22380, sb_lowat = 0x800, sb_mb = 0x2ccf80,
    sb_sel = {si_info = 0x0, si_waitFlag = 0x0}, sb_flags = 0x0,
    sb_timeo = 0x0}, so_upcall = 0x0, so_upcallarg = 0x0, so_cred = 0x0,
  cyg_so_gencnt = 0x8646, so_emuldata = 0x0, so_accf = 0x0}

The print out of inpcb (so_pcb) is:

(gdb) print/x *(struct inpcb*)(0x2cb478)
$3 = {inp_hash = {le_next = 0x0, le_prev = 0x0}, inp_fport = 0x0,
  inp_lport = 0x0, inp_list = {le_next = 0x0, le_prev = 0x2a6e18},
  inp_flow = 0x0, inp_dependfaddr = {inp46_foreign = {ia46_pad32 = {0x0,
0x0,
        0x0}, ia46_addr4 = {s_addr = 0x0}}, inp6_foreign = {__u6_addr = {
        __u6_addr8 = {0x0 <repeats 16 times>}, __u6_addr16 = {0x0, 0x0, 0x0,
          0x0, 0x0, 0x0, 0x0, 0x0}, __u6_addr32 = {0x0, 0x0, 0x0, 0x0}}}},
  inp_dependladdr = {inp46_local = {ia46_pad32 = {0x0, 0x0, 0x0},
      ia46_addr4 = {s_addr = 0x0}}, inp6_local = {__u6_addr = {__u6_addr8 =
{
          0x0 <repeats 16 times>}, __u6_addr16 = {0x0, 0x0, 0x0, 0x0, 0x0,
          0x0, 0x0, 0x0}, __u6_addr32 = {0x0, 0x0, 0x0, 0x0}}}},
  inp_ppcb = 0x0, inp_pcbinfo = 0x308310, inp_socket = 0x2cc010,
  inp_flags = 0x0, inp_dependroute = {inp4_route = {ro_rt = 0x0, ro_u = {
        rou_dst_ss = {ss_len = 0x0, ss_family = 0x0, __ss_pad1 = {0x0, 0x0,
            0x0, 0x0, 0x0, 0x0}, __ss_align = 0x0, __ss_pad2 = {
            0x0 <repeats 112 times>}}, ro_dst_sa = {sa_len = 0x0,
          sa_family = 0x0, sa_data = {0x0 <repeats 30 times>}}}},
     inp6_route = {ro_rt = 0x0, ro_u = {rou_dst_ss = {ss_len = 0x0,
          ss_family = 0x0, __ss_pad1 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0},
          __ss_align = 0x0, __ss_pad2 = {0x0 <repeats 112 times>}},
         ro_dst_sa = {sa_len = 0x0, sa_family = 0x0, sa_data = {
            0x0 <repeats 30 times>}}}}}, inp_sp = 0x0, inp_vflag = 0x1,
  inp_ip_ttl = 0x40, inp_ip_p = 0x0, inp_depend4 = {inp4_ip_tos = 0x0,
     inp4_options = 0x0, inp4_moptions = 0x0}, inp_depend6 = {
    inp6_options = 0x0, inp6_inputopts = 0x0, inp6_outputopts = 0x0,
    inp6_moptions = 0x0, inp6_icmp6filt = 0x0, inp6_cksum = 0x0,
    inp6_ifindex = 0x0, inp6_hops = 0x0}, inp_portlist = {le_next = 0x0,
    le_prev = 0x0}, inp_phd = 0x0, inp_gencnt = 0x8d9d}


The relevant portion of the HTTP serving thread is:

#0  Cyg_Scheduler::unlock_inner(unsigned) (new_lock=0)
    at
/cygdrive/d/ecos/ecos-cvs/packages/kernel/current/src/sched/sched.cxx:220
#1  0x00266858 in Cyg_Scheduler::unlock() ()
    at
/cygdrive/d/work/firmware/micro/ecos/conf/avppc1/ecos/install/include/cyg/ke
rnel/sched.inl:100
#2  0x001868cc in Cyg_Mutex::unlock() (this=0x2ff580)
    at
/cygdrive/d/ecos/ecos-cvs/packages/kernel/current/src/sync/mutex.cxx:445
#3  0x0017e470 in cyg_mutex_unlock (mutex=0x2ff580)
    at
/cygdrive/d/ecos/ecos-cvs/packages/kernel/current/src/common/kapi.cxx:1081
#4  0x001a6048 in cyg_splx (old_state=0)
    at
/cygdrive/d/ecos/ecos-cvs/packages/net/bsd_tcpip/current/src/ecos/synch.c:20
9
#5  0x001c43d8 in cyg_sosend (so=0x2cbeb0, addr=0x0, uio=0x2b8310,
    top=0x2cce80, control=0x0, flags=0, p=0x0)
    at
/cygdrive/d/ecos/ecos-cvs/packages/net/bsd_tcpip/current/src/sys/kern/uipc_s
ocket.c:622
 #6  0x0016f9dc in bsd_sendit (fp=0x2b99cc, mp=0x2b83b0, flags=0,
    retsize=0x2b83a0)
    at
/cygdrive/d/ecos/ecos-cvs/packages/net/bsd_tcpip/current/src/sys/kern/sockio
.c:1039
#7  0x0016eb64 in bsd_sendmsg (fp=0x2b99cc, m=0x2b83b0, flags=0,
    retsize=0x2b83a0)
    at
/cygdrive/d/ecos/ecos-cvs/packages/net/bsd_tcpip/current/src/sys/kern/sockio
.c:572
#8  0x001a2450 in sendto (s=4, buf=0x32d7fc, len=543, flags=0, to=0x0,
tolen=0)
    at
/cygdrive/d/ecos/ecos-cvs/packages/io/fileio/current/src/socket.cxx:612
#9  0x001a251c in send (s=4, buf=0x32d7fc, len=543, flags=0)
    at
/cygdrive/d/ecos/ecos-cvs/packages/io/fileio/current/src/socket.cxx:629
#10 0x00143694 in fsend (pData=0x32d5e8, len=543) at
src/Interface/serv.cpp:518

--
Before posting, please read the FAQ: http://ecos.sourceware.org/fom/ecos
and search the list archive: http://ecos.sourceware.org/ml/ecos-discuss

--
Before posting, please read the FAQ: http://ecos.sourceware.org/fom/ecos
and search the list archive: http://ecos.sourceware.org/ml/ecos-discuss

Reply | Threaded
Open this post in threaded view
|

RE: Possible sockets/fd race condition

Gary Thomas
On Wed, 2005-11-16 at 08:16 -0800, Jay Foster wrote:

> Do you have ASSERTions turned on.  This could help identify a problem
> earlier.
>
> I have seen other similar kinds of problems (with assertions turned on), but
> have not spent any time trying to resolve them yet.  Assertions such as:
>
> eCos panic: cyg_fp_free() will close fd
> eCos panic: sbdrop
> eCos panic: soaccept: !NOFDREF
> eCos panic: sofree: not queued
> eCos panic: soclose: NOFDREF
> eCos panic: Locking mutex I already own
> eCos panic: sbflush: locked
>
> I did a search of the eCos archives for these panics, and came accross this
> (http://sourceware.org/ml/ecos-discuss/2000-04/msg00153.html)
>
> I could find no evidence that this issue was ever addressed.
>

While there may not be any explicit mention of addressing these issues,
let me assure you that they were - at great length.  After that message
was written, we imported a whole new stack based on FreeBSD and also
made considerable changes in how the whole thing fit into eCos.  The
end result is that [at least hopefully] the issues Grant was
experiencing were handled.

The race mentioned below may indeed exist.  If protecting the basic call
with splnet() works, I don't see a problem.

Note: this particular problem may only show up by way of the way Ross'
program works.  It is unusual to create and destroy sockets for every
packet (which is what the pseudo-code implies) and this would then
exercise any failures which might exist in the socreate() path.

> Jay
>
> -----Original Message-----
> From: Ross Wightman [mailto:[hidden email]]
> Sent: Wednesday, November 16, 2005 12:45 AM
> To: [hidden email]
> Subject: [ECOS] Possible sockets/fd race condition
>
>
> Myself and another engineer I am working with have recently
> encountered what appears to be some sort of sockets (BSD Stack) or fd
> related race condition.
>
> We have two threads relevant to the problem. One of them is sending
> out small (~64byte) UDP multicast packets at a configurable rate of
> 1-100 per second. The other thread is running a basic single threaded
> http server that serves one connection at a time (but serves many
> small requests quite rapidly).
>
> Running by themselves either thread can send out hundreds of thousands
> of UDP packets or serve millions of http requests with no problems.
> However, as soon as the two threads are both running they last for
> anywhere from a few hundred to a few thousand served requests before
> the UDP thread dies a horrible death trying to close its socket.
>
> The exact crash occurs when tcp_usr_detach in tcp_usrreq.c tries to
> dereference a NULL pointer (which should be pointing to a tcp control
> block). This is very bizarre because we are supposed to be closing a
> UDP socket. If we look at the socket itself a few stack frames down it
> is a non active TCP socket with the SS_NOFDREF and SS_INCOMP flags set
> and no TCP control block. It is certainly not an active UDP socket as
> it should be.
>
> Every time the crash has occurred, the http server is waiting in
> unlock_inner from an splx call via various tcp calls.
>
> The Socket/File Descriptor # for the UDP socket and the TCP socket at
> the time of the crash is 4 and 5 respectively.
>
> We have yet to pinpoint when exactly things start to go wrong. We have
> however looked for possible race conditions and we noticed that the
> socreate function does not have any splnet protection like most of the
> other functions at that level. So we added that and after half a day
> running the problem has not occurred again. However it's not clear if
> we fixed the problem in a correct manner or smashed it with a blunt
> hammer and masked the real cause.
>
> Any thoughts would be much appreciated. More details follow below.
>
> Regards,
>
> Ross
>
>
>
> This is the pseudo code sequence of the UDP thread.
>
> {
> socket s;
>
> loop
> {
> wait on flag()
>
> s=socket(...);
>
> setsocketopt(s, ...);
>
> ioctl(s,.... ) //for address change check
>
> sendto(s, ....)
>
> close(s)
>
> }
> }
>
> The http server thread is running a typical TCP serving loop and
> shares nothing with the UDP thread. The only interaction between the
> two threads is via any locking or data sharing that occurs in the
> tcp/ip stack.
>
> The relevant portion of the UDP thread stack is as follows:
>
> #0  0x001f01b0 in tcp_disconnect (tp=0x0)
>     at /cygdrive/d/ecos/ecos-cvs
> /packages/net/bsd_tcpip/current/src/sys/netinet/tcp_usrreq.c:1070
>  #1  0x001eedb4 in tcp_usr_detach (so=0x2cc010)
>     at
> /cygdrive/d/ecos/ecos-cvs/packages/net/bsd_tcpip/current/src/sys/netinet/tcp
> _usrreq.c:181
> #2  0x001c33ec in cyg_soclose (so=0x2cc010)
>     at
> /cygdrive/d/ecos/ecos-cvs/packages/net/bsd_tcpip/current/src/sys/kern/uipc_s
> ocket.c:321
> #3  0x0016f244 in bsd_close (fp=0x2b99ec)
>     at
> /cygdrive/d/ecos/ecos-cvs/packages/net/bsd_tcpip/current/src/sys/kern/sockio
> .c:730
> #4  0x0019c17c in fp_ucount_dec (fp=0x2b99ec)
>     at /cygdrive/d/ecos/ecos-cvs/packages/io/fileio/current/src/fd.cxx:160
> #5  0x0019c284 in fd_close (fd=5)
>     at /cygdrive/d/ecos/ecos-cvs/packages/io/fileio/current/src/fd.cxx:189
> #6  0x0019c55c in cyg_fd_free (fd=5)
>     at /cygdrive/d/ecos/ecos-cvs/packages/io/fileio/current/src/fd.cxx:260
> #7  0x0019f8ac in close (fd=5)
>     at /cygdrive/d/ecos/ecos-cvs/packages/io/fileio/current/src/io.cxx:216
> #8  0x000524b0 in Packet::PacketSend() (this=0x32d5a0)
>     at src/App/Packet.cpp:141
>
> The print out of socket (so) is: This should be a UDP socket but
> instead it's an inactive TCP socket.
>
> (gdb) print/x *(struct socket*)(0x2cc010)
> $2 = {so_zone = 0x2cc180, so_type = 0x1, so_options = 0x20, so_linger = 0x0,
>   so_state = 0x801, so_pcb = 0x2cb478, so_proto = 0x2a51a0,
>   so_head = 0x2cc0c0, so_incomp = {tqh_first = 0x0, tqh_last = 0x0},
>   so_comp = {tqh_first = 0x0, tqh_last = 0x0}, so_list = {tqe_next = 0x0,
>     tqe_prev = 0x2cc0d8}, so_qlen = 0x0, so_incqlen = 0x0, so_qlimit = 0x0,
>   so_timeo = 0x0, so_error = 0x0, so_sigio = 0x0, so_oobmark = 0x0,
>   so_aiojobq = {tqh_first = 0x0, tqh_last = 0x2cc054}, so_rcv = {sb_cc =
> 0x0,
>     sb_hiwat = 0x4470, sb_mbcnt = 0x0, sb_mbmax = 0x22380, sb_lowat = 0x1,
>     sb_mb = 0x0, sb_sel = {si_info = 0x0, si_waitFlag = 0x0}, sb_flags =
> 0x0,
>     sb_timeo = 0x0}, so_snd = {sb_cc = 0x40, sb_hiwat = 0x4470,
>     sb_mbcnt = 0x80, sb_mbmax = 0x22380, sb_lowat = 0x800, sb_mb = 0x2ccf80,
>     sb_sel = {si_info = 0x0, si_waitFlag = 0x0}, sb_flags = 0x0,
>     sb_timeo = 0x0}, so_upcall = 0x0, so_upcallarg = 0x0, so_cred = 0x0,
>   cyg_so_gencnt = 0x8646, so_emuldata = 0x0, so_accf = 0x0}
>
> The print out of inpcb (so_pcb) is:
>
> (gdb) print/x *(struct inpcb*)(0x2cb478)
> $3 = {inp_hash = {le_next = 0x0, le_prev = 0x0}, inp_fport = 0x0,
>   inp_lport = 0x0, inp_list = {le_next = 0x0, le_prev = 0x2a6e18},
>   inp_flow = 0x0, inp_dependfaddr = {inp46_foreign = {ia46_pad32 = {0x0,
> 0x0,
>         0x0}, ia46_addr4 = {s_addr = 0x0}}, inp6_foreign = {__u6_addr = {
>         __u6_addr8 = {0x0 <repeats 16 times>}, __u6_addr16 = {0x0, 0x0, 0x0,
>           0x0, 0x0, 0x0, 0x0, 0x0}, __u6_addr32 = {0x0, 0x0, 0x0, 0x0}}}},
>   inp_dependladdr = {inp46_local = {ia46_pad32 = {0x0, 0x0, 0x0},
>       ia46_addr4 = {s_addr = 0x0}}, inp6_local = {__u6_addr = {__u6_addr8 =
> {
>           0x0 <repeats 16 times>}, __u6_addr16 = {0x0, 0x0, 0x0, 0x0, 0x0,
>           0x0, 0x0, 0x0}, __u6_addr32 = {0x0, 0x0, 0x0, 0x0}}}},
>   inp_ppcb = 0x0, inp_pcbinfo = 0x308310, inp_socket = 0x2cc010,
>   inp_flags = 0x0, inp_dependroute = {inp4_route = {ro_rt = 0x0, ro_u = {
>         rou_dst_ss = {ss_len = 0x0, ss_family = 0x0, __ss_pad1 = {0x0, 0x0,
>             0x0, 0x0, 0x0, 0x0}, __ss_align = 0x0, __ss_pad2 = {
>             0x0 <repeats 112 times>}}, ro_dst_sa = {sa_len = 0x0,
>           sa_family = 0x0, sa_data = {0x0 <repeats 30 times>}}}},
>      inp6_route = {ro_rt = 0x0, ro_u = {rou_dst_ss = {ss_len = 0x0,
>           ss_family = 0x0, __ss_pad1 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0},
>           __ss_align = 0x0, __ss_pad2 = {0x0 <repeats 112 times>}},
>          ro_dst_sa = {sa_len = 0x0, sa_family = 0x0, sa_data = {
>             0x0 <repeats 30 times>}}}}}, inp_sp = 0x0, inp_vflag = 0x1,
>   inp_ip_ttl = 0x40, inp_ip_p = 0x0, inp_depend4 = {inp4_ip_tos = 0x0,
>      inp4_options = 0x0, inp4_moptions = 0x0}, inp_depend6 = {
>     inp6_options = 0x0, inp6_inputopts = 0x0, inp6_outputopts = 0x0,
>     inp6_moptions = 0x0, inp6_icmp6filt = 0x0, inp6_cksum = 0x0,
>     inp6_ifindex = 0x0, inp6_hops = 0x0}, inp_portlist = {le_next = 0x0,
>     le_prev = 0x0}, inp_phd = 0x0, inp_gencnt = 0x8d9d}
>
>
> The relevant portion of the HTTP serving thread is:
>
> #0  Cyg_Scheduler::unlock_inner(unsigned) (new_lock=0)
>     at
> /cygdrive/d/ecos/ecos-cvs/packages/kernel/current/src/sched/sched.cxx:220
> #1  0x00266858 in Cyg_Scheduler::unlock() ()
>     at
> /cygdrive/d/work/firmware/micro/ecos/conf/avppc1/ecos/install/include/cyg/ke
> rnel/sched.inl:100
> #2  0x001868cc in Cyg_Mutex::unlock() (this=0x2ff580)
>     at
> /cygdrive/d/ecos/ecos-cvs/packages/kernel/current/src/sync/mutex.cxx:445
> #3  0x0017e470 in cyg_mutex_unlock (mutex=0x2ff580)
>     at
> /cygdrive/d/ecos/ecos-cvs/packages/kernel/current/src/common/kapi.cxx:1081
> #4  0x001a6048 in cyg_splx (old_state=0)
>     at
> /cygdrive/d/ecos/ecos-cvs/packages/net/bsd_tcpip/current/src/ecos/synch.c:20
> 9
> #5  0x001c43d8 in cyg_sosend (so=0x2cbeb0, addr=0x0, uio=0x2b8310,
>     top=0x2cce80, control=0x0, flags=0, p=0x0)
>     at
> /cygdrive/d/ecos/ecos-cvs/packages/net/bsd_tcpip/current/src/sys/kern/uipc_s
> ocket.c:622
>  #6  0x0016f9dc in bsd_sendit (fp=0x2b99cc, mp=0x2b83b0, flags=0,
>     retsize=0x2b83a0)
>     at
> /cygdrive/d/ecos/ecos-cvs/packages/net/bsd_tcpip/current/src/sys/kern/sockio
> .c:1039
> #7  0x0016eb64 in bsd_sendmsg (fp=0x2b99cc, m=0x2b83b0, flags=0,
>     retsize=0x2b83a0)
>     at
> /cygdrive/d/ecos/ecos-cvs/packages/net/bsd_tcpip/current/src/sys/kern/sockio
> .c:572
> #8  0x001a2450 in sendto (s=4, buf=0x32d7fc, len=543, flags=0, to=0x0,
> tolen=0)
>     at
> /cygdrive/d/ecos/ecos-cvs/packages/io/fileio/current/src/socket.cxx:612
> #9  0x001a251c in send (s=4, buf=0x32d7fc, len=543, flags=0)
>     at
> /cygdrive/d/ecos/ecos-cvs/packages/io/fileio/current/src/socket.cxx:629
> #10 0x00143694 in fsend (pData=0x32d5e8, len=543) at
> src/Interface/serv.cpp:518
>
> --
> Before posting, please read the FAQ: http://ecos.sourceware.org/fom/ecos
> and search the list archive: http://ecos.sourceware.org/ml/ecos-discuss

--
------------------------------------------------------------
Gary Thomas                 |  Consulting for the
MLB Associates              |    Embedded world
------------------------------------------------------------


--
Before posting, please read the FAQ: http://ecos.sourceware.org/fom/ecos
and search the list archive: http://ecos.sourceware.org/ml/ecos-discuss

Reply | Threaded
Open this post in threaded view
|

Re: Possible sockets/fd race condition

Grant Edwards
In reply to this post by Jay Foster-4
In gmane.os.ecos.general, you wrote:

> I did a search of the eCos archives for these panics, and came accross this
> (http://sourceware.org/ml/ecos-discuss/2000-04/msg00153.html)
>
> I could find no evidence that this issue was ever addressed.

That problem (race conditions accessing skbuf structures) was
fixed within a day or two of my reporting it. That problem was
with the "old" network stack, and I suppose it could have
resurfaced in the "new" one.

--
Grant Edwards                   grante             Yow!  My mind is making
                                  at               ashtrays in Dayton....
                               visi.com            

--
Before posting, please read the FAQ: http://ecos.sourceware.org/fom/ecos
and search the list archive: http://ecos.sourceware.org/ml/ecos-discuss

Reply | Threaded
Open this post in threaded view
|

Re: Possible sockets/fd race condition

Ross Wightman
In reply to this post by Gary Thomas
On 11/16/05, Gary Thomas <[hidden email]> wrote:

> On Wed, 2005-11-16 at 08:16 -0800, Jay Foster wrote:
> > Do you have ASSERTions turned on.  This could help identify a problem
> > earlier.
> >
> > I have seen other similar kinds of problems (with assertions turned on), but
> > have not spent any time trying to resolve them yet.  Assertions such as:
> >
> > eCos panic: cyg_fp_free() will close fd
> > eCos panic: sbdrop
> > eCos panic: soaccept: !NOFDREF
> > eCos panic: sofree: not queued
> > eCos panic: soclose: NOFDREF
> > eCos panic: Locking mutex I already own
> > eCos panic: sbflush: locked

Asserts are enabled. I have yet to see any asserts or panics fire.
Just an exception when the NULL pointer is dereferenced.

>
> The race mentioned below may indeed exist.  If protecting the basic call
> with splnet() works, I don't see a problem.

So far so good with the splnet in socreate. We've been hammering it
for 12+ hours now.

> Note: this particular problem may only show up by way of the way Ross'
> program works.  It is unusual to create and destroy sockets for every
> packet (which is what the pseudo-code implies) and this would then
> exercise any failures which might exist in the socreate() path.
>
> > Jay

The current state of our program does indeed stress such issues. We
also cranked up the rate of UDP sends to increase the probability of
the crash happening after we first saw it. Normally it should be
sending only one per second so the create/destroy of the socket
wouldn't be a performance issue. However the crash occurs in either
case, it's just a difference of waiting half a day or a few min for it
to occur.

--
Before posting, please read the FAQ: http://ecos.sourceware.org/fom/ecos
and search the list archive: http://ecos.sourceware.org/ml/ecos-discuss