Scheduler startup question

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Scheduler startup question

Michael Jones
I have a question about proper scheduler locking startup behavior.

The context is I am cleaning up my iMX6 HAL and attempting to make things work without a couple of kernel hacks I added to make it work.

The question has to do with sched_lock. By default this has a value of 1, so during startup the scheduler is locked.

When there is an interrupt, sched_lock is incremented in Vectors.S, and decremented in interrupt_end.

However, I am getting an assert in sync.h which is part of the BSD stack. The assert is because it expects the lock to be zero.

The question is, during the startup process, how does the lock get set to zero after initialization? Is it supposed to stay 1 while hardware is initialized and through all the constructors, etc? Is it cleared by the scheduler somehow? Is the HAL supposed to zero it at some point during startup?

My HAL is part of the ARM hal, so if this is device specific, it is the ARM HAL I am working with.

Mike
--
Before posting, please read the FAQ: http://ecos.sourceware.org/fom/ecos
and search the list archive: http://ecos.sourceware.org/ml/ecos-discuss

Reply | Threaded
Open this post in threaded view
|

Re: Scheduler startup question

Lambrecht Jürgen
As far as I know the scheduler is started after cyg_user_start(), used by your application to initialize everything.  Do you use cyg_user_start?


Verzonden vanaf Samsung Mobile



-------- Oorspronkelijk bericht --------
Van: Michael Jones <[hidden email]>
Datum:
Aan: ecos discuss <[hidden email]>
Onderwerp: [ECOS] Scheduler startup question


I have a question about proper scheduler locking startup behavior.

The context is I am cleaning up my iMX6 HAL and attempting to make things work without a couple of kernel hacks I added to make it work.

The question has to do with sched_lock. By default this has a value of 1, so during startup the scheduler is locked.

When there is an interrupt, sched_lock is incremented in Vectors.S, and decremented in interrupt_end.

However, I am getting an assert in sync.h which is part of the BSD stack. The assert is because it expects the lock to be zero.

The question is, during the startup process, how does the lock get set to zero after initialization? Is it supposed to stay 1 while hardware is initialized and through all the constructors, etc? Is it cleared by the scheduler somehow? Is the HAL supposed to zero it at some point during startup?

My HAL is part of the ARM hal, so if this is device specific, it is the ARM HAL I am working with.

Mike
--
Before posting, please read the FAQ: http://ecos.sourceware.org/fom/ecos
and search the list archive: http://ecos.sourceware.org/ml/ecos-discuss


--
Before posting, please read the FAQ: http://ecos.sourceware.org/fom/ecos
and search the list archive: http://ecos.sourceware.org/ml/ecos-discuss

Reply | Threaded
Open this post in threaded view
|

Re: Scheduler startup question

Michael Jones
By tracing through a Cortex M application that works, I found that when the first thread is run, there is a loop at the bottom of the thread entry call that calls unlock until sched_lock is 0. Every thread entry does this.

This seems a bit dangerous to me, as the unlocking occurs anytime a new thread is created. I have to assume that the thread could not be entered when in a critical section between scheduler locks.

I'll look into that behavior and see if it is related to my BSD assertion.



On Feb 26, 2014, at 11:40 PM, Lambrecht Jürgen <[hidden email]> wrote:

> As far as I know the scheduler is started after cyg_user_start(), used by your application to initialize everything.  Do you use cyg_user_start?
>
>
> Verzonden vanaf Samsung Mobile
>
>
>
> -------- Oorspronkelijk bericht --------
> Van: Michael Jones <[hidden email]>
> Datum:
> Aan: ecos discuss <[hidden email]>
> Onderwerp: [ECOS] Scheduler startup question
>
>
> I have a question about proper scheduler locking startup behavior.
>
> The context is I am cleaning up my iMX6 HAL and attempting to make things work without a couple of kernel hacks I added to make it work.
>
> The question has to do with sched_lock. By default this has a value of 1, so during startup the scheduler is locked.
>
> When there is an interrupt, sched_lock is incremented in Vectors.S, and decremented in interrupt_end.
>
> However, I am getting an assert in sync.h which is part of the BSD stack. The assert is because it expects the lock to be zero.
>
> The question is, during the startup process, how does the lock get set to zero after initialization? Is it supposed to stay 1 while hardware is initialized and through all the constructors, etc? Is it cleared by the scheduler somehow? Is the HAL supposed to zero it at some point during startup?
>
> My HAL is part of the ARM hal, so if this is device specific, it is the ARM HAL I am working with.
>
> Mike
> --
> Before posting, please read the FAQ: http://ecos.sourceware.org/fom/ecos
> and search the list archive: http://ecos.sourceware.org/ml/ecos-discuss
>
>
> --
> Before posting, please read the FAQ: http://ecos.sourceware.org/fom/ecos
> and search the list archive: http://ecos.sourceware.org/ml/ecos-discuss
>


--
Before posting, please read the FAQ: http://ecos.sourceware.org/fom/ecos
and search the list archive: http://ecos.sourceware.org/ml/ecos-discuss

Reply | Threaded
Open this post in threaded view
|

Re: Scheduler startup question

Michael Jones
In reply to this post by Lambrecht Jürgen
Jurgen,

I think I fully understand how the scheduler locking works during interrupt now. Vectors.S takes the lock, and interrupt_end clears it. However, the normal technique of incrementing the lock count does not work with SMP. The problem is that another CPU may have the lock. Incrementing anyway leads to assertions. Attempting to take the lock with the spinlock can lead to deadlocks or an unresponsive network application.

So I changed things so that in Vectors.S, during an interrupt, an attempt at locking is made. This means trying to take a spinlock that might fail. If the lock is taken, interrupt_end is called. If the lock fails, interrupt_end is not called.

This means that a DSR may not be posted on that interrupt. This can cause some latency based on the real time clock interrupt rate, or time until a thread switch. However, it is stable and assertion free. Also, a HAL could implement a timeout on the try spinlock which might reduce latency.

To support the try and testing if the lock was taken, I had to add some functions to the kernel. The following wiki page has been updated to reflect the kernel changes.

https://sourceforge.net/p/ecosfreescale/wiki/SMP%20Kernel/

Anyone with SMP knowledge might want to take a look. There may be better solutions to some of these problems. But at least for now, the IMX6 SMP HAL seems stable and I can run IO intensive Lua scripts over telnet reliably, even when the client aborts.

The client abort means telnet has to kill a thread. This was quite a challenge. Telnet is creating a separate heap for Lua so it can kill the thread and reclaim memory. The remaining problem is closing file handles. I still get some assertions when a handle is sometimes killed by a thread that does not own it. I don't think that can be solved without adding some new functions dedicated to clean up of file handles by an outside thread.

Mike



On Feb 26, 2014, at 11:40 PM, Lambrecht Jürgen <[hidden email]> wrote:

> As far as I know the scheduler is started after cyg_user_start(), used by your application to initialize everything.  Do you use cyg_user_start?
>
>
> Verzonden vanaf Samsung Mobile
>
>
>
> -------- Oorspronkelijk bericht --------
> Van: Michael Jones <[hidden email]>
> Datum:
> Aan: ecos discuss <[hidden email]>
> Onderwerp: [ECOS] Scheduler startup question
>
>
> I have a question about proper scheduler locking startup behavior.
>
> The context is I am cleaning up my iMX6 HAL and attempting to make things work without a couple of kernel hacks I added to make it work.
>
> The question has to do with sched_lock. By default this has a value of 1, so during startup the scheduler is locked.
>
> When there is an interrupt, sched_lock is incremented in Vectors.S, and decremented in interrupt_end.
>
> However, I am getting an assert in sync.h which is part of the BSD stack. The assert is because it expects the lock to be zero.
>
> The question is, during the startup process, how does the lock get set to zero after initialization? Is it supposed to stay 1 while hardware is initialized and through all the constructors, etc? Is it cleared by the scheduler somehow? Is the HAL supposed to zero it at some point during startup?
>
> My HAL is part of the ARM hal, so if this is device specific, it is the ARM HAL I am working with.
>
> Mike
> --
> Before posting, please read the FAQ: http://ecos.sourceware.org/fom/ecos
> and search the list archive: http://ecos.sourceware.org/ml/ecos-discuss
>
>
> --
> Before posting, please read the FAQ: http://ecos.sourceware.org/fom/ecos
> and search the list archive: http://ecos.sourceware.org/ml/ecos-discuss
>


--
Before posting, please read the FAQ: http://ecos.sourceware.org/fom/ecos
and search the list archive: http://ecos.sourceware.org/ml/ecos-discuss

Reply | Threaded
Open this post in threaded view
|

Re: Scheduler startup question

Christophe Coutand
Hi Michael,

I might remember wrong but I think in case of SMP target, the lock is
not taken in Vector.S but directly after entering interrupt_end. Of
course this is spinlock based so it might delay posting/scheduling of
the DSR.

Christophe

On 3/2/2014 9:19 PM, Michael Jones wrote:

> Jurgen,
>
> I think I fully understand how the scheduler locking works during interrupt now. Vectors.S takes the lock, and interrupt_end clears it. However, the normal technique of incrementing the lock count does not work with SMP. The problem is that another CPU may have the lock. Incrementing anyway leads to assertions. Attempting to take the lock with the spinlock can lead to deadlocks or an unresponsive network application.
>
> So I changed things so that in Vectors.S, during an interrupt, an attempt at locking is made. This means trying to take a spinlock that might fail. If the lock is taken, interrupt_end is called. If the lock fails, interrupt_end is not called.
>
> This means that a DSR may not be posted on that interrupt. This can cause some latency based on the real time clock interrupt rate, or time until a thread switch. However, it is stable and assertion free. Also, a HAL could implement a timeout on the try spinlock which might reduce latency.
>
> To support the try and testing if the lock was taken, I had to add some functions to the kernel. The following wiki page has been updated to reflect the kernel changes.
>
> https://sourceforge.net/p/ecosfreescale/wiki/SMP%20Kernel/
>
> Anyone with SMP knowledge might want to take a look. There may be better solutions to some of these problems. But at least for now, the IMX6 SMP HAL seems stable and I can run IO intensive Lua scripts over telnet reliably, even when the client aborts.
>
> The client abort means telnet has to kill a thread. This was quite a challenge. Telnet is creating a separate heap for Lua so it can kill the thread and reclaim memory. The remaining problem is closing file handles. I still get some assertions when a handle is sometimes killed by a thread that does not own it. I don't think that can be solved without adding some new functions dedicated to clean up of file handles by an outside thread.
>
> Mike
>
>
>
> On Feb 26, 2014, at 11:40 PM, Lambrecht Jürgen <[hidden email]> wrote:
>
>> As far as I know the scheduler is started after cyg_user_start(), used by your application to initialize everything.  Do you use cyg_user_start?
>>
>>
>> Verzonden vanaf Samsung Mobile
>>
>>
>>
>> -------- Oorspronkelijk bericht --------
>> Van: Michael Jones <[hidden email]>
>> Datum:
>> Aan: ecos discuss <[hidden email]>
>> Onderwerp: [ECOS] Scheduler startup question
>>
>>
>> I have a question about proper scheduler locking startup behavior.
>>
>> The context is I am cleaning up my iMX6 HAL and attempting to make things work without a couple of kernel hacks I added to make it work.
>>
>> The question has to do with sched_lock. By default this has a value of 1, so during startup the scheduler is locked.
>>
>> When there is an interrupt, sched_lock is incremented in Vectors.S, and decremented in interrupt_end.
>>
>> However, I am getting an assert in sync.h which is part of the BSD stack. The assert is because it expects the lock to be zero.
>>
>> The question is, during the startup process, how does the lock get set to zero after initialization? Is it supposed to stay 1 while hardware is initialized and through all the constructors, etc? Is it cleared by the scheduler somehow? Is the HAL supposed to zero it at some point during startup?
>>
>> My HAL is part of the ARM hal, so if this is device specific, it is the ARM HAL I am working with.
>>
>> Mike
>> --
>> Before posting, please read the FAQ: http://ecos.sourceware.org/fom/ecos
>> and search the list archive: http://ecos.sourceware.org/ml/ecos-discuss
>>
>>
>> --
>> Before posting, please read the FAQ: http://ecos.sourceware.org/fom/ecos
>> and search the list archive: http://ecos.sourceware.org/ml/ecos-discuss
>>
>


--
Before posting, please read the FAQ: http://ecos.sourceware.org/fom/ecos
and search the list archive: http://ecos.sourceware.org/ml/ecos-discuss

Reply | Threaded
Open this post in threaded view
|

Re: Scheduler startup question

Michael Jones
Christophe,

When I first got SMP to work I added some code in interrupt_end to take the lock, but I moved it back to Vectors.S because I was trying to reduce changes to the kernel. Functionally, the only difference is getting the lock before the ISR is executed or not.

My bigger concern is how the lock is taken. When I increase the lock count, the core doing so (core 0) may not be the holder of the lock, which leads to assertions. And if it spins while taking the lock, it deadlocks. I have not traced down the deadlock, but I think the problem is in the scheduler, where some secondary CPU is waiting.

My current solution is to use a trylock in Vectors.S and living with the fact that when it fails, it will take another real time clock interrupt to try again. So interrupt_end is not guaranteed to called on each interrupt. This keeps things simple. All interrupts go to core 0 except inter cpu interrupts. Some latency is added because taking the lock is not guaranteed.

Other ways to handle this is to send interrupts to all cores, use inter core interrupts, etc, in an effort to guarantee a lock is incremented by the core that holds the lock.

I was not able to figure our how i386 handled this. Does anyone know how the i386 SMP incremented the lock if the core that got the interrupt did not hold the lock?

Mike


On Mar 4, 2014, at 8:37 AM, christophe <[hidden email]> wrote:

> Hi Michael,
>
> I might remember wrong but I think in case of SMP target, the lock is not taken in Vector.S but directly after entering interrupt_end. Of course this is spinlock based so it might delay posting/scheduling of the DSR.
>
> Christophe
>
> On 3/2/2014 9:19 PM, Michael Jones wrote:
>> Jurgen,
>>
>> I think I fully understand how the scheduler locking works during interrupt now. Vectors.S takes the lock, and interrupt_end clears it. However, the normal technique of incrementing the lock count does not work with SMP. The problem is that another CPU may have the lock. Incrementing anyway leads to assertions. Attempting to take the lock with the spinlock can lead to deadlocks or an unresponsive network application.
>>
>> So I changed things so that in Vectors.S, during an interrupt, an attempt at locking is made. This means trying to take a spinlock that might fail. If the lock is taken, interrupt_end is called. If the lock fails, interrupt_end is not called.
>>
>> This means that a DSR may not be posted on that interrupt. This can cause some latency based on the real time clock interrupt rate, or time until a thread switch. However, it is stable and assertion free. Also, a HAL could implement a timeout on the try spinlock which might reduce latency.
>>
>> To support the try and testing if the lock was taken, I had to add some functions to the kernel. The following wiki page has been updated to reflect the kernel changes.
>>
>> https://sourceforge.net/p/ecosfreescale/wiki/SMP%20Kernel/
>>
>> Anyone with SMP knowledge might want to take a look. There may be better solutions to some of these problems. But at least for now, the IMX6 SMP HAL seems stable and I can run IO intensive Lua scripts over telnet reliably, even when the client aborts.
>>
>> The client abort means telnet has to kill a thread. This was quite a challenge. Telnet is creating a separate heap for Lua so it can kill the thread and reclaim memory. The remaining problem is closing file handles. I still get some assertions when a handle is sometimes killed by a thread that does not own it. I don't think that can be solved without adding some new functions dedicated to clean up of file handles by an outside thread.
>>
>> Mike
>>
>>
>>
>> On Feb 26, 2014, at 11:40 PM, Lambrecht Jürgen <[hidden email]> wrote:
>>
>>> As far as I know the scheduler is started after cyg_user_start(), used by your application to initialize everything.  Do you use cyg_user_start?
>>>
>>>
>>> Verzonden vanaf Samsung Mobile
>>>
>>>
>>>
>>> -------- Oorspronkelijk bericht --------
>>> Van: Michael Jones <[hidden email]>
>>> Datum:
>>> Aan: ecos discuss <[hidden email]>
>>> Onderwerp: [ECOS] Scheduler startup question
>>>
>>>
>>> I have a question about proper scheduler locking startup behavior.
>>>
>>> The context is I am cleaning up my iMX6 HAL and attempting to make things work without a couple of kernel hacks I added to make it work.
>>>
>>> The question has to do with sched_lock. By default this has a value of 1, so during startup the scheduler is locked.
>>>
>>> When there is an interrupt, sched_lock is incremented in Vectors.S, and decremented in interrupt_end.
>>>
>>> However, I am getting an assert in sync.h which is part of the BSD stack. The assert is because it expects the lock to be zero.
>>>
>>> The question is, during the startup process, how does the lock get set to zero after initialization? Is it supposed to stay 1 while hardware is initialized and through all the constructors, etc? Is it cleared by the scheduler somehow? Is the HAL supposed to zero it at some point during startup?
>>>
>>> My HAL is part of the ARM hal, so if this is device specific, it is the ARM HAL I am working with.
>>>
>>> Mike
>>> --
>>> Before posting, please read the FAQ: http://ecos.sourceware.org/fom/ecos
>>> and search the list archive: http://ecos.sourceware.org/ml/ecos-discuss
>>>
>>>
>>> --
>>> Before posting, please read the FAQ: http://ecos.sourceware.org/fom/ecos
>>> and search the list archive: http://ecos.sourceware.org/ml/ecos-discuss
>>>
>>
>
>
> --
> Before posting, please read the FAQ: http://ecos.sourceware.org/fom/ecos
> and search the list archive: http://ecos.sourceware.org/ml/ecos-discuss
>


--
Before posting, please read the FAQ: http://ecos.sourceware.org/fom/ecos
and search the list archive: http://ecos.sourceware.org/ml/ecos-discuss

Reply | Threaded
Open this post in threaded view
|

Re: Scheduler startup question

Christophe Coutand
Michael,

I am not sure what you mean by adding code in interrupt_end to take the
lock. The locking mechanism is present for SMP target, no change required:

externC void
interrupt_end(
     cyg_uint32          isr_ret,
     Cyg_Interrupt       *intr,
     HAL_SavedRegisters  *regs
     )
{
//    CYG_REPORT_FUNCTION();

#ifdef CYGPKG_KERNEL_SMP_SUPPORT
     Cyg_Scheduler::lock();
#endif

The macro for incrementing the lock in SMP looks at the current owner of
the lock and spin when required.

I found the kernel instrumentation option very useful for debugging
deadlocks. I was using CodeConfidence plugin in Eclipse to analyze the
trace which makes it pretty efficient debugging.

Christophe

On 3/4/2014 4:58 PM, Michael Jones wrote:

> Christophe,
>
> When I first got SMP to work I added some code in interrupt_end to take the lock, but I moved it back to Vectors.S because I was trying to reduce changes to the kernel. Functionally, the only difference is getting the lock before the ISR is executed or not.
>
> My bigger concern is how the lock is taken. When I increase the lock count, the core doing so (core 0) may not be the holder of the lock, which leads to assertions. And if it spins while taking the lock, it deadlocks. I have not traced down the deadlock, but I think the problem is in the scheduler, where some secondary CPU is waiting.
>
> My current solution is to use a trylock in Vectors.S and living with the fact that when it fails, it will take another real time clock interrupt to try again. So interrupt_end is not guaranteed to called on each interrupt. This keeps things simple. All interrupts go to core 0 except inter cpu interrupts. Some latency is added because taking the lock is not guaranteed.
>
> Other ways to handle this is to send interrupts to all cores, use inter core interrupts, etc, in an effort to guarantee a lock is incremented by the core that holds the lock.
>
> I was not able to figure our how i386 handled this. Does anyone know how the i386 SMP incremented the lock if the core that got the interrupt did not hold the lock?
>
> Mike
>
>
> On Mar 4, 2014, at 8:37 AM, christophe <[hidden email]> wrote:
>
>> Hi Michael,
>>
>> I might remember wrong but I think in case of SMP target, the lock is not taken in Vector.S but directly after entering interrupt_end. Of course this is spinlock based so it might delay posting/scheduling of the DSR.
>>
>> Christophe
>>
>> On 3/2/2014 9:19 PM, Michael Jones wrote:
>>> Jurgen,
>>>
>>> I think I fully understand how the scheduler locking works during interrupt now. Vectors.S takes the lock, and interrupt_end clears it. However, the normal technique of incrementing the lock count does not work with SMP. The problem is that another CPU may have the lock. Incrementing anyway leads to assertions. Attempting to take the lock with the spinlock can lead to deadlocks or an unresponsive network application.
>>>
>>> So I changed things so that in Vectors.S, during an interrupt, an attempt at locking is made. This means trying to take a spinlock that might fail. If the lock is taken, interrupt_end is called. If the lock fails, interrupt_end is not called.
>>>
>>> This means that a DSR may not be posted on that interrupt. This can cause some latency based on the real time clock interrupt rate, or time until a thread switch. However, it is stable and assertion free. Also, a HAL could implement a timeout on the try spinlock which might reduce latency.
>>>
>>> To support the try and testing if the lock was taken, I had to add some functions to the kernel. The following wiki page has been updated to reflect the kernel changes.
>>>
>>> https://sourceforge.net/p/ecosfreescale/wiki/SMP%20Kernel/
>>>
>>> Anyone with SMP knowledge might want to take a look. There may be better solutions to some of these problems. But at least for now, the IMX6 SMP HAL seems stable and I can run IO intensive Lua scripts over telnet reliably, even when the client aborts.
>>>
>>> The client abort means telnet has to kill a thread. This was quite a challenge. Telnet is creating a separate heap for Lua so it can kill the thread and reclaim memory. The remaining problem is closing file handles. I still get some assertions when a handle is sometimes killed by a thread that does not own it. I don't think that can be solved without adding some new functions dedicated to clean up of file handles by an outside thread.
>>>
>>> Mike
>>>
>>>
>>>
>>> On Feb 26, 2014, at 11:40 PM, Lambrecht Jürgen <[hidden email]> wrote:
>>>
>>>> As far as I know the scheduler is started after cyg_user_start(), used by your application to initialize everything.  Do you use cyg_user_start?
>>>>
>>>>
>>>> Verzonden vanaf Samsung Mobile
>>>>
>>>>
>>>>
>>>> -------- Oorspronkelijk bericht --------
>>>> Van: Michael Jones <[hidden email]>
>>>> Datum:
>>>> Aan: ecos discuss <[hidden email]>
>>>> Onderwerp: [ECOS] Scheduler startup question
>>>>
>>>>
>>>> I have a question about proper scheduler locking startup behavior.
>>>>
>>>> The context is I am cleaning up my iMX6 HAL and attempting to make things work without a couple of kernel hacks I added to make it work.
>>>>
>>>> The question has to do with sched_lock. By default this has a value of 1, so during startup the scheduler is locked.
>>>>
>>>> When there is an interrupt, sched_lock is incremented in Vectors.S, and decremented in interrupt_end.
>>>>
>>>> However, I am getting an assert in sync.h which is part of the BSD stack. The assert is because it expects the lock to be zero.
>>>>
>>>> The question is, during the startup process, how does the lock get set to zero after initialization? Is it supposed to stay 1 while hardware is initialized and through all the constructors, etc? Is it cleared by the scheduler somehow? Is the HAL supposed to zero it at some point during startup?
>>>>
>>>> My HAL is part of the ARM hal, so if this is device specific, it is the ARM HAL I am working with.
>>>>
>>>> Mike
>>>> --
>>>> Before posting, please read the FAQ: http://ecos.sourceware.org/fom/ecos
>>>> and search the list archive: http://ecos.sourceware.org/ml/ecos-discuss
>>>>
>>>>
>>>> --
>>>> Before posting, please read the FAQ: http://ecos.sourceware.org/fom/ecos
>>>> and search the list archive: http://ecos.sourceware.org/ml/ecos-discuss
>>>>
>>
>> --
>> Before posting, please read the FAQ: http://ecos.sourceware.org/fom/ecos
>> and search the list archive: http://ecos.sourceware.org/ml/ecos-discuss
>>


--
Before posting, please read the FAQ: http://ecos.sourceware.org/fom/ecos
and search the list archive: http://ecos.sourceware.org/ml/ecos-discuss

Reply | Threaded
Open this post in threaded view
|

Re: Scheduler startup question

Michael Jones
Christophe,

What I mean is the lock shown in the code you put below is not in the eCos code database. So when I said I added code, I added the code you put below.

I removed that code and moved it to Vectors.S, where it is now a trylock, rather than the main lock call. (My latest code on source forge does not have this lock call shown below.)

When the lock was called in inteterrupt_end, it did not deadlock. When it was called in Vectors.S, it deadlocked.

The functional difference is that when the lock was called in Vectors.S, it was called before the ISR was called.

But as I said, I have not tried to find the root cause of the deadlock.

Perhaps I can try the kernel instrumentation when I have some time this weekend.

Mike

On Mar 4, 2014, at 9:16 AM, christophe <[hidden email]> wrote:

> Michael,
>
> I am not sure what you mean by adding code in interrupt_end to take the lock. The locking mechanism is present for SMP target, no change required:
>
> externC void
> interrupt_end(
>    cyg_uint32          isr_ret,
>    Cyg_Interrupt       *intr,
>    HAL_SavedRegisters  *regs
>    )
> {
> //    CYG_REPORT_FUNCTION();
>
> #ifdef CYGPKG_KERNEL_SMP_SUPPORT
>    Cyg_Scheduler::lock();
> #endif
>
> The macro for incrementing the lock in SMP looks at the current owner of the lock and spin when required.
>
> I found the kernel instrumentation option very useful for debugging deadlocks. I was using CodeConfidence plugin in Eclipse to analyze the trace which makes it pretty efficient debugging.
>
> Christophe
>
> On 3/4/2014 4:58 PM, Michael Jones wrote:
>> Christophe,
>>
>> When I first got SMP to work I added some code in interrupt_end to take the lock, but I moved it back to Vectors.S because I was trying to reduce changes to the kernel. Functionally, the only difference is getting the lock before the ISR is executed or not.
>>
>> My bigger concern is how the lock is taken. When I increase the lock count, the core doing so (core 0) may not be the holder of the lock, which leads to assertions. And if it spins while taking the lock, it deadlocks. I have not traced down the deadlock, but I think the problem is in the scheduler, where some secondary CPU is waiting.
>>
>> My current solution is to use a trylock in Vectors.S and living with the fact that when it fails, it will take another real time clock interrupt to try again. So interrupt_end is not guaranteed to called on each interrupt. This keeps things simple. All interrupts go to core 0 except inter cpu interrupts. Some latency is added because taking the lock is not guaranteed.
>>
>> Other ways to handle this is to send interrupts to all cores, use inter core interrupts, etc, in an effort to guarantee a lock is incremented by the core that holds the lock.
>>
>> I was not able to figure our how i386 handled this. Does anyone know how the i386 SMP incremented the lock if the core that got the interrupt did not hold the lock?
>>
>> Mike
>>
>>
>> On Mar 4, 2014, at 8:37 AM, christophe <[hidden email]> wrote:
>>
>>> Hi Michael,
>>>
>>> I might remember wrong but I think in case of SMP target, the lock is not taken in Vector.S but directly after entering interrupt_end. Of course this is spinlock based so it might delay posting/scheduling of the DSR.
>>>
>>> Christophe
>>>
>>> On 3/2/2014 9:19 PM, Michael Jones wrote:
>>>> Jurgen,
>>>>
>>>> I think I fully understand how the scheduler locking works during interrupt now. Vectors.S takes the lock, and interrupt_end clears it. However, the normal technique of incrementing the lock count does not work with SMP. The problem is that another CPU may have the lock. Incrementing anyway leads to assertions. Attempting to take the lock with the spinlock can lead to deadlocks or an unresponsive network application.
>>>>
>>>> So I changed things so that in Vectors.S, during an interrupt, an attempt at locking is made. This means trying to take a spinlock that might fail. If the lock is taken, interrupt_end is called. If the lock fails, interrupt_end is not called.
>>>>
>>>> This means that a DSR may not be posted on that interrupt. This can cause some latency based on the real time clock interrupt rate, or time until a thread switch. However, it is stable and assertion free. Also, a HAL could implement a timeout on the try spinlock which might reduce latency.
>>>>
>>>> To support the try and testing if the lock was taken, I had to add some functions to the kernel. The following wiki page has been updated to reflect the kernel changes.
>>>>
>>>> https://sourceforge.net/p/ecosfreescale/wiki/SMP%20Kernel/
>>>>
>>>> Anyone with SMP knowledge might want to take a look. There may be better solutions to some of these problems. But at least for now, the IMX6 SMP HAL seems stable and I can run IO intensive Lua scripts over telnet reliably, even when the client aborts.
>>>>
>>>> The client abort means telnet has to kill a thread. This was quite a challenge. Telnet is creating a separate heap for Lua so it can kill the thread and reclaim memory. The remaining problem is closing file handles. I still get some assertions when a handle is sometimes killed by a thread that does not own it. I don't think that can be solved without adding some new functions dedicated to clean up of file handles by an outside thread.
>>>>
>>>> Mike
>>>>
>>>>
>>>>
>>>> On Feb 26, 2014, at 11:40 PM, Lambrecht Jürgen <[hidden email]> wrote:
>>>>
>>>>> As far as I know the scheduler is started after cyg_user_start(), used by your application to initialize everything.  Do you use cyg_user_start?
>>>>>
>>>>>
>>>>> Verzonden vanaf Samsung Mobile
>>>>>
>>>>>
>>>>>
>>>>> -------- Oorspronkelijk bericht --------
>>>>> Van: Michael Jones <[hidden email]>
>>>>> Datum:
>>>>> Aan: ecos discuss <[hidden email]>
>>>>> Onderwerp: [ECOS] Scheduler startup question
>>>>>
>>>>>
>>>>> I have a question about proper scheduler locking startup behavior.
>>>>>
>>>>> The context is I am cleaning up my iMX6 HAL and attempting to make things work without a couple of kernel hacks I added to make it work.
>>>>>
>>>>> The question has to do with sched_lock. By default this has a value of 1, so during startup the scheduler is locked.
>>>>>
>>>>> When there is an interrupt, sched_lock is incremented in Vectors.S, and decremented in interrupt_end.
>>>>>
>>>>> However, I am getting an assert in sync.h which is part of the BSD stack. The assert is because it expects the lock to be zero.
>>>>>
>>>>> The question is, during the startup process, how does the lock get set to zero after initialization? Is it supposed to stay 1 while hardware is initialized and through all the constructors, etc? Is it cleared by the scheduler somehow? Is the HAL supposed to zero it at some point during startup?
>>>>>
>>>>> My HAL is part of the ARM hal, so if this is device specific, it is the ARM HAL I am working with.
>>>>>
>>>>> Mike
>>>>> --
>>>>> Before posting, please read the FAQ: http://ecos.sourceware.org/fom/ecos
>>>>> and search the list archive: http://ecos.sourceware.org/ml/ecos-discuss
>>>>>
>>>>>
>>>>> --
>>>>> Before posting, please read the FAQ: http://ecos.sourceware.org/fom/ecos
>>>>> and search the list archive: http://ecos.sourceware.org/ml/ecos-discuss
>>>>>
>>>
>>> --
>>> Before posting, please read the FAQ: http://ecos.sourceware.org/fom/ecos
>>> and search the list archive: http://ecos.sourceware.org/ml/ecos-discuss
>>>
>
>
> --
> Before posting, please read the FAQ: http://ecos.sourceware.org/fom/ecos
> and search the list archive: http://ecos.sourceware.org/ml/ecos-discuss
>


--
Before posting, please read the FAQ: http://ecos.sourceware.org/fom/ecos
and search the list archive: http://ecos.sourceware.org/ml/ecos-discuss

Reply | Threaded
Open this post in threaded view
|

Re: Scheduler startup question

Michael Jones
Christophe,

If you are looking at my source forge code, the integration brach is my latest code with the trylock in Vectors.S.

Mike

On Mar 4, 2014, at 9:51 AM, Michael Jones <[hidden email]> wrote:

> Christophe,
>
> What I mean is the lock shown in the code you put below is not in the eCos code database. So when I said I added code, I added the code you put below.
>
> I removed that code and moved it to Vectors.S, where it is now a trylock, rather than the main lock call. (My latest code on source forge does not have this lock call shown below.)
>
> When the lock was called in inteterrupt_end, it did not deadlock. When it was called in Vectors.S, it deadlocked.
>
> The functional difference is that when the lock was called in Vectors.S, it was called before the ISR was called.
>
> But as I said, I have not tried to find the root cause of the deadlock.
>
> Perhaps I can try the kernel instrumentation when I have some time this weekend.
>
> Mike
>
> On Mar 4, 2014, at 9:16 AM, christophe <[hidden email]> wrote:
>
>> Michael,
>>
>> I am not sure what you mean by adding code in interrupt_end to take the lock. The locking mechanism is present for SMP target, no change required:
>>
>> externC void
>> interrupt_end(
>>   cyg_uint32          isr_ret,
>>   Cyg_Interrupt       *intr,
>>   HAL_SavedRegisters  *regs
>>   )
>> {
>> //    CYG_REPORT_FUNCTION();
>>
>> #ifdef CYGPKG_KERNEL_SMP_SUPPORT
>>   Cyg_Scheduler::lock();
>> #endif
>>
>> The macro for incrementing the lock in SMP looks at the current owner of the lock and spin when required.
>>
>> I found the kernel instrumentation option very useful for debugging deadlocks. I was using CodeConfidence plugin in Eclipse to analyze the trace which makes it pretty efficient debugging.
>>
>> Christophe
>>
>> On 3/4/2014 4:58 PM, Michael Jones wrote:
>>> Christophe,
>>>
>>> When I first got SMP to work I added some code in interrupt_end to take the lock, but I moved it back to Vectors.S because I was trying to reduce changes to the kernel. Functionally, the only difference is getting the lock before the ISR is executed or not.
>>>
>>> My bigger concern is how the lock is taken. When I increase the lock count, the core doing so (core 0) may not be the holder of the lock, which leads to assertions. And if it spins while taking the lock, it deadlocks. I have not traced down the deadlock, but I think the problem is in the scheduler, where some secondary CPU is waiting.
>>>
>>> My current solution is to use a trylock in Vectors.S and living with the fact that when it fails, it will take another real time clock interrupt to try again. So interrupt_end is not guaranteed to called on each interrupt. This keeps things simple. All interrupts go to core 0 except inter cpu interrupts. Some latency is added because taking the lock is not guaranteed.
>>>
>>> Other ways to handle this is to send interrupts to all cores, use inter core interrupts, etc, in an effort to guarantee a lock is incremented by the core that holds the lock.
>>>
>>> I was not able to figure our how i386 handled this. Does anyone know how the i386 SMP incremented the lock if the core that got the interrupt did not hold the lock?
>>>
>>> Mike
>>>
>>>
>>> On Mar 4, 2014, at 8:37 AM, christophe <[hidden email]> wrote:
>>>
>>>> Hi Michael,
>>>>
>>>> I might remember wrong but I think in case of SMP target, the lock is not taken in Vector.S but directly after entering interrupt_end. Of course this is spinlock based so it might delay posting/scheduling of the DSR.
>>>>
>>>> Christophe
>>>>
>>>> On 3/2/2014 9:19 PM, Michael Jones wrote:
>>>>> Jurgen,
>>>>>
>>>>> I think I fully understand how the scheduler locking works during interrupt now. Vectors.S takes the lock, and interrupt_end clears it. However, the normal technique of incrementing the lock count does not work with SMP. The problem is that another CPU may have the lock. Incrementing anyway leads to assertions. Attempting to take the lock with the spinlock can lead to deadlocks or an unresponsive network application.
>>>>>
>>>>> So I changed things so that in Vectors.S, during an interrupt, an attempt at locking is made. This means trying to take a spinlock that might fail. If the lock is taken, interrupt_end is called. If the lock fails, interrupt_end is not called.
>>>>>
>>>>> This means that a DSR may not be posted on that interrupt. This can cause some latency based on the real time clock interrupt rate, or time until a thread switch. However, it is stable and assertion free. Also, a HAL could implement a timeout on the try spinlock which might reduce latency.
>>>>>
>>>>> To support the try and testing if the lock was taken, I had to add some functions to the kernel. The following wiki page has been updated to reflect the kernel changes.
>>>>>
>>>>> https://sourceforge.net/p/ecosfreescale/wiki/SMP%20Kernel/
>>>>>
>>>>> Anyone with SMP knowledge might want to take a look. There may be better solutions to some of these problems. But at least for now, the IMX6 SMP HAL seems stable and I can run IO intensive Lua scripts over telnet reliably, even when the client aborts.
>>>>>
>>>>> The client abort means telnet has to kill a thread. This was quite a challenge. Telnet is creating a separate heap for Lua so it can kill the thread and reclaim memory. The remaining problem is closing file handles. I still get some assertions when a handle is sometimes killed by a thread that does not own it. I don't think that can be solved without adding some new functions dedicated to clean up of file handles by an outside thread.
>>>>>
>>>>> Mike
>>>>>
>>>>>
>>>>>
>>>>> On Feb 26, 2014, at 11:40 PM, Lambrecht Jürgen <[hidden email]> wrote:
>>>>>
>>>>>> As far as I know the scheduler is started after cyg_user_start(), used by your application to initialize everything.  Do you use cyg_user_start?
>>>>>>
>>>>>>
>>>>>> Verzonden vanaf Samsung Mobile
>>>>>>
>>>>>>
>>>>>>
>>>>>> -------- Oorspronkelijk bericht --------
>>>>>> Van: Michael Jones <[hidden email]>
>>>>>> Datum:
>>>>>> Aan: ecos discuss <[hidden email]>
>>>>>> Onderwerp: [ECOS] Scheduler startup question
>>>>>>
>>>>>>
>>>>>> I have a question about proper scheduler locking startup behavior.
>>>>>>
>>>>>> The context is I am cleaning up my iMX6 HAL and attempting to make things work without a couple of kernel hacks I added to make it work.
>>>>>>
>>>>>> The question has to do with sched_lock. By default this has a value of 1, so during startup the scheduler is locked.
>>>>>>
>>>>>> When there is an interrupt, sched_lock is incremented in Vectors.S, and decremented in interrupt_end.
>>>>>>
>>>>>> However, I am getting an assert in sync.h which is part of the BSD stack. The assert is because it expects the lock to be zero.
>>>>>>
>>>>>> The question is, during the startup process, how does the lock get set to zero after initialization? Is it supposed to stay 1 while hardware is initialized and through all the constructors, etc? Is it cleared by the scheduler somehow? Is the HAL supposed to zero it at some point during startup?
>>>>>>
>>>>>> My HAL is part of the ARM hal, so if this is device specific, it is the ARM HAL I am working with.
>>>>>>
>>>>>> Mike
>>>>>> --
>>>>>> Before posting, please read the FAQ: http://ecos.sourceware.org/fom/ecos
>>>>>> and search the list archive: http://ecos.sourceware.org/ml/ecos-discuss
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Before posting, please read the FAQ: http://ecos.sourceware.org/fom/ecos
>>>>>> and search the list archive: http://ecos.sourceware.org/ml/ecos-discuss
>>>>>>
>>>>
>>>> --
>>>> Before posting, please read the FAQ: http://ecos.sourceware.org/fom/ecos
>>>> and search the list archive: http://ecos.sourceware.org/ml/ecos-discuss
>>>>
>>
>>
>> --
>> Before posting, please read the FAQ: http://ecos.sourceware.org/fom/ecos
>> and search the list archive: http://ecos.sourceware.org/ml/ecos-discuss
>>
>


--
Before posting, please read the FAQ: http://ecos.sourceware.org/fom/ecos
and search the list archive: http://ecos.sourceware.org/ml/ecos-discuss