Subtle bug not calling DSRs.

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Subtle bug not calling DSRs.

Sergei Organov-3
Hello,

I believe I've found bug when eCos doesn't call pended DSR(s) until
_next_ time the scheduler lock is released. The scenario I see is like
this:

1. Start single thread that is run as soon as scheduler is enabled.
2. The thread suspends itself that way or another (wait on a primitive
   or just sleep, -- it doesn't matter).
3. The (2) results in context switch from this thread to the idle thread
   from within Cyg_Scheduler::unlock_inner(0).
4. During context switch interrupt happens that is requesting
   corresponding DSR to be run. However, as scheduler is locked during
   context switch, the DSR is not invoked on exit from the interrupt.
5. Context switch continues and transfers control to the idle thread
   entry address being Cyg_HardwareThread::thread_entry(). This is
   because the idle thread had no chance to run before this point.
6. Cyg_HardwareThread::thread_entry() has no code to check for pended
   DSRs, so the DSR posted in (4) is not run.

In my case system just hangs (loops forever in idle thread) at this
point as my thread (1) in fact waits for the interrupt, but
corresponding DSR that is meant to wakeup the thread is never called.

The quick-and-dirty fix to the problem is:

--- thread.cxx 14 Dec 2005 13:48:31 -0000
+++ thread.cxx 22 Dec 2005 18:24:03 -0000
@@ -102,6 +102,9 @@
     Cyg_Scheduler::zero_sched_lock();     // the assignment into the code above.
     HAL_REORDER_BARRIER();
 
+    Cyg_Scheduler::lock();
+    Cyg_Scheduler::unlock();
+    
     // Call entry point in a loop.
 
     for(;;)

But while we are at it, it seems that entire prologue of the
Cyg_HardwareThread::thread_entry() should be moved to a new method of
the Cyg_Scheduler class as the prologue consists entirely of the
intimate details of how scheduler works. Moreover, this prologue is in
fact a slightly modified copy of the code found in the
Cyg_Scheduler::unlock_inner() executing after the line

                HAL_THREAD_SWITCH_CONTEXT( &current->stack_ptr,
                                           &next->stack_ptr );

Even if this common part can't be easily factored out, it is much better
to at least keep two slightly different copies in one place close to
each other I think.

Anyway, somebody more experienced in the eCos internal architecture
(Nick?) may wish to take a look at this and could probably come up with
a better patch.

--
Sergei.

Reply | Threaded
Open this post in threaded view
|

Re: Subtle bug not calling DSRs.

Nick Garnett
Sergei Organov <[hidden email]> writes:

> Hello,
>
> I believe I've found bug when eCos doesn't call pended DSR(s) until
> _next_ time the scheduler lock is released. The scenario I see is like
> this:
>
> 1. Start single thread that is run as soon as scheduler is enabled.
> 2. The thread suspends itself that way or another (wait on a primitive
>    or just sleep, -- it doesn't matter).
> 3. The (2) results in context switch from this thread to the idle thread
>    from within Cyg_Scheduler::unlock_inner(0).
> 4. During context switch interrupt happens that is requesting
>    corresponding DSR to be run. However, as scheduler is locked during
>    context switch, the DSR is not invoked on exit from the interrupt.
> 5. Context switch continues and transfers control to the idle thread
>    entry address being Cyg_HardwareThread::thread_entry(). This is
>    because the idle thread had no chance to run before this point.
> 6. Cyg_HardwareThread::thread_entry() has no code to check for pended
>    DSRs, so the DSR posted in (4) is not run.

Good catch.

This is a somewhat obscure bug since the number of applications that
start the scheduler and then do nothing at all is very small. However,
failing to handle DSRs across the context switch to start a thread is
clearly wrong.

>
> Even if this common part can't be easily factored out, it is much better
> to at least keep two slightly different copies in one place close to
> each other I think.

Yep, that makes sense. The contents of Cyg_Thread::thread_entry() has
become more elaborate since it was originally written. A little
refactoring now won't go amiss.

>
> Anyway, somebody more experienced in the eCos internal architecture
> (Nick?) may wish to take a look at this and could probably come up with
> a better patch.

I think the following patch fixes the bug and reorganizes things to be
a little tidier. If this proves to be the case I'll generate a
ChangeLog entry and check it in in a couple of days.


Index: kernel/current/include/sched.hxx
===================================================================
RCS file: /cvs/ecos/ecos/packages/kernel/current/include/sched.hxx,v
retrieving revision 1.12
diff -u -5 -r1.12 sched.hxx
--- kernel/current/include/sched.hxx 23 May 2002 23:06:48 -0000 1.12
+++ kernel/current/include/sched.hxx 3 Jan 2006 10:23:36 -0000
@@ -174,10 +174,13 @@
     // decrement the lock but also look for a reschedule opportunity
     static void             unlock_reschedule();
 
     // release the preemption lock without rescheduling
     static void             unlock_simple();
+
+    // perform thread startup housekeeping
+    void Cyg_Scheduler::thread_entry( Cyg_Thread *thread );
     
     // Start execution of the scheduler
     static void start() CYGBLD_ATTRIB_NORET;
 
     // Start execution of the scheduler on the current CPU
Index: kernel/current/src/common/thread.cxx
===================================================================
RCS file: /cvs/ecos/ecos/packages/kernel/current/src/common/thread.cxx,v
retrieving revision 1.20
diff -u -5 -r1.20 thread.cxx
--- kernel/current/src/common/thread.cxx 30 Jan 2003 07:02:53 -0000 1.20
+++ kernel/current/src/common/thread.cxx 3 Jan 2006 10:23:37 -0000
@@ -84,28 +84,14 @@
 void
 Cyg_HardwareThread::thread_entry( Cyg_Thread *thread )
 {
     CYG_REPORT_FUNCTION();
 
-    Cyg_Scheduler::scheduler.clear_need_reschedule(); // finished rescheduling
-    Cyg_Scheduler::scheduler.set_current_thread(thread); // restore current thread pointer
-
-    CYG_INSTRUMENT_THREAD(ENTER,thread,0);
-    
-#ifdef CYGSEM_KERNEL_SCHED_TIMESLICE
-    // Reset the timeslice counter so that this thread gets a full
-    // quantum.
-    Cyg_Scheduler::reset_timeslice_count();
-#endif
+    // Call the scheduler to do any housekeeping
+    Cyg_Scheduler::scheduler.thread_entry( thread );
     
-    // Zero the lock
-    HAL_REORDER_BARRIER ();            // Prevent the compiler from moving
-    Cyg_Scheduler::zero_sched_lock();     // the assignment into the code above.
-    HAL_REORDER_BARRIER();
-
     // Call entry point in a loop.
-
     for(;;)
     {
         thread->entry_point(thread->entry_data);
         thread->exit();
     }
@@ -1223,15 +1209,15 @@
 
 #  endif // CYGNUM_KERNEL_THREADS_IDLE_STACK_SIZE < CYGNUM_HAL_STACK_SIZE_MINIMUM
 # endif // CYGNUM_KERNEL_THREADS_IDLE_STACK_SIZE
 #endif // CYGNUM_HAL_STACK_SIZE_MINIMUM
 
-static char idle_thread_stack[CYGNUM_KERNEL_CPU_MAX][CYGNUM_KERNEL_THREADS_IDLE_STACK_SIZE];
-
 // Loop counter for debugging/housekeeping
 cyg_uint32 idle_thread_loops[CYGNUM_KERNEL_CPU_MAX];
 
+static char idle_thread_stack[CYGNUM_KERNEL_CPU_MAX][CYGNUM_KERNEL_THREADS_IDLE_STACK_SIZE];
+
 // -------------------------------------------------------------------------
 // Idle thread code.
 
 void
 idle_thread_main( CYG_ADDRESS data )
Index: kernel/current/src/sched/sched.cxx
===================================================================
RCS file: /cvs/ecos/ecos/packages/kernel/current/src/sched/sched.cxx,v
retrieving revision 1.16
diff -u -5 -r1.16 sched.cxx
--- kernel/current/src/sched/sched.cxx 9 Aug 2002 17:13:28 -0000 1.16
+++ kernel/current/src/sched/sched.cxx 3 Jan 2006 10:23:38 -0000
@@ -307,10 +307,32 @@
 
     CYG_FAIL( "Should not be executed" );
 }
 
 // -------------------------------------------------------------------------
+// Thread startup. This is called from Cyg_Thread::thread_entry() and
+// performs some housekeeping for a newly started thread.
+
+void Cyg_Scheduler::thread_entry( Cyg_Thread *thread )
+{
+    clear_need_reschedule();            // finished rescheduling
+    set_current_thread(thread);         // restore current thread pointer
+
+    CYG_INSTRUMENT_THREAD(ENTER,thread,0);
+    
+#ifdef CYGSEM_KERNEL_SCHED_TIMESLICE
+    // Reset the timeslice counter so that this thread gets a full
+    // quantum.
+    reset_timeslice_count();
+#endif
+    
+    // Finally unlock the scheduler. As well as clearing the scheduler
+    // lock this allows any pending DSRs to execute.
+    unlock();    
+}
+
+// -------------------------------------------------------------------------
 // Start the scheduler. This is called after the initial threads have been
 // created to start scheduling. It gets any other CPUs running, and then
 // enters the scheduler.
 
 void Cyg_Scheduler::start()




--
Nick Garnett                                     eCos Kernel Architect
http://www.ecoscentric.com                The eCos and RedBoot experts


Reply | Threaded
Open this post in threaded view
|

Re: Subtle bug not calling DSRs.

Sergei Organov-3
Nick Garnett <[hidden email]> writes:
[...]
> I think the following patch fixes the bug and reorganizes things to be
> a little tidier. If this proves to be the case I'll generate a
> ChangeLog entry and check it in in a couple of days.

Looks fine to me, -- I've applied the patch and it does fix the problem.

--
Sergei.