ISR not causing an DSR in some rare conditions

classic Classic list List threaded Threaded
16 messages Options
Reply | Threaded
Open this post in threaded view
|

ISR not causing an DSR in some rare conditions

Stefan Sommerfeld
Hi,

i'm using XScale PXA270 processor with latest eCos and I think I found a
problem with DSR's. I'm 100% sure that there must be a condition where eCos
not calls the DSR of an interrupt. I have a IRQ which comes 45 times a
second on a system running at high load. After more then 10 hours one DSR
is missing. This is a bad situation which makes the system unstable.

Is this problem known? Maybe it depends only on the architecture (ARM).

Would it be a good idea to contact ecoscentric for a solution?

Bye...


--
Before posting, please read the FAQ: http://ecos.sourceware.org/fom/ecos
and search the list archive: http://ecos.sourceware.org/ml/ecos-discuss

Reply | Threaded
Open this post in threaded view
|

Re: ISR not causing an DSR in some rare conditions

Andrew Lunn-2
On Thu, Jan 12, 2006 at 01:47:43PM +0100, Stefan Sommerfeld wrote:
> Hi,
>
> i'm using XScale PXA270 processor with latest eCos and I think I found a
> problem with DSR's. I'm 100% sure that there must be a condition where eCos
> not calls the DSR of an interrupt. I have a IRQ which comes 45 times a
> second on a system running at high load. After more then 10 hours one DSR
> is missing. This is a bad situation which makes the system unstable.
>
> Is this problem known? Maybe it depends only on the architecture (ARM).

Does the ISR reenable the interrupt? It could be the next interrupt
arrives before the DSR is called. In that case the DSR will be called
with the count value of 2.

        Andrew

--
Before posting, please read the FAQ: http://ecos.sourceware.org/fom/ecos
and search the list archive: http://ecos.sourceware.org/ml/ecos-discuss

Reply | Threaded
Open this post in threaded view
|

Re: ISR not causing an DSR in some rare conditions

Stefan Sommerfeld
Hi,

>>
>> i'm using XScale PXA270 processor with latest eCos and I think I found a
>> problem with DSR's. I'm 100% sure that there must be a condition where
>> eCos
>> not calls the DSR of an interrupt. I have a IRQ which comes 45 times a
>> second on a system running at high load. After more then 10 hours one
>> DSR
>> is missing. This is a bad situation which makes the system unstable.
>>
>> Is this problem known? Maybe it depends only on the architecture (ARM).
>
> Does the ISR reenable the interrupt? It could be the next interrupt
> arrives before the DSR is called. In that case the DSR will be called
> with the count value of 2.

No. The interrupts will not be disabled at any time. The interrupt from
this source will only be acknowledged in the isr function.

The functionality is simple. A hardware unit will be started and reports
the finish with an interrupt. I have made a counter on hardware unit start,
isr and dsr. After this long-run test, the dsr counter is one less than the
other counters.

Bye...


--
Before posting, please read the FAQ: http://ecos.sourceware.org/fom/ecos
and search the list archive: http://ecos.sourceware.org/ml/ecos-discuss

Reply | Threaded
Open this post in threaded view
|

Re: ISR not causing an DSR in some rare conditions

Gary Thomas
On Thu, 2006-01-12 at 14:42 +0100, Stefan Sommerfeld wrote:

> Hi,
> >>
> >> i'm using XScale PXA270 processor with latest eCos and I think I found a
> >> problem with DSR's. I'm 100% sure that there must be a condition where
> >> eCos
> >> not calls the DSR of an interrupt. I have a IRQ which comes 45 times a
> >> second on a system running at high load. After more then 10 hours one
> >> DSR
> >> is missing. This is a bad situation which makes the system unstable.
> >>
> >> Is this problem known? Maybe it depends only on the architecture (ARM).
> >
> > Does the ISR reenable the interrupt? It could be the next interrupt
> > arrives before the DSR is called. In that case the DSR will be called
> > with the count value of 2.
>
> No. The interrupts will not be disabled at any time. The interrupt from
> this source will only be acknowledged in the isr function.
>
> The functionality is simple. A hardware unit will be started and reports
> the finish with an interrupt. I have made a counter on hardware unit start,
> isr and dsr. After this long-run test, the dsr counter is one less than the
> other counters.

So, what else is going on that creates your "high load?"  Possibly
there is some side effect [you may not even be aware of] that could
cause the DSR loss.

Have you checked to see if the DSR is ever called with a count other
than 1?

--
------------------------------------------------------------
Gary Thomas                 |  Consulting for the
MLB Associates              |    Embedded world
------------------------------------------------------------


--
Before posting, please read the FAQ: http://ecos.sourceware.org/fom/ecos
and search the list archive: http://ecos.sourceware.org/ml/ecos-discuss

Reply | Threaded
Open this post in threaded view
|

Re: ISR not causing an DSR in some rare conditions

Sergei Organov-3
In reply to this post by Stefan Sommerfeld
"Stefan Sommerfeld" <[hidden email]> writes:

> Hi,
>>>
>>> i'm using XScale PXA270 processor with latest eCos and I think I found a
>>> problem with DSR's. I'm 100% sure that there must be a condition
>>> where eCos
>>> not calls the DSR of an interrupt. I have a IRQ which comes 45 times a
>>> second on a system running at high load. After more then 10 hours
>>> one DSR
>>> is missing. This is a bad situation which makes the system unstable.
>>>
>>> Is this problem known? Maybe it depends only on the architecture (ARM).
>>
>> Does the ISR reenable the interrupt? It could be the next interrupt
>> arrives before the DSR is called. In that case the DSR will be called
>> with the count value of 2.
>
> No. The interrupts will not be disabled at any time. The interrupt from
> this source will only be acknowledged in the isr function.
>
> The functionality is simple. A hardware unit will be started and reports
> the finish with an interrupt. I have made a counter on hardware unit start,
> isr and dsr. After this long-run test, the dsr counter is one less than the
> other counters.

Did you bother to read the manual? Here is citation:

"""
void
dsr_function(cyg_vector_t vector,
             cyg_ucount32 count,
             cyg_addrword_t data)
{
}

[...] The second argument indicates the number of these interrupts
that have occurred and for which the ISR requested a DSR.  Usually this
will be 1, unless the system is suffering from a very heavy load.
"""

Thus, you need to increment your test counter by the value of 'count'
argument in the DSR handler, do you?

-- Sergei.


--
Before posting, please read the FAQ: http://ecos.sourceware.org/fom/ecos
and search the list archive: http://ecos.sourceware.org/ml/ecos-discuss

Reply | Threaded
Open this post in threaded view
|

Re: ISR not causing an DSR in some rare conditions

Stefan Sommerfeld
In reply to this post by Gary Thomas
Hi,

>> >>
>> >> i'm using XScale PXA270 processor with latest eCos and I think I
>> >> found a
>> >> problem with DSR's. I'm 100% sure that there must be a condition
>> >> where
>> >> eCos
>> >> not calls the DSR of an interrupt. I have a IRQ which comes 45 times
>> >> a
>> >> second on a system running at high load. After more then 10 hours one
>> >> DSR
>> >> is missing. This is a bad situation which makes the system unstable.
>> >>
>> >> Is this problem known? Maybe it depends only on the architecture
>> >> (ARM).
>> >
>> > Does the ISR reenable the interrupt? It could be the next interrupt
>> > arrives before the DSR is called. In that case the DSR will be called
>> > with the count value of 2.
>>
>> No. The interrupts will not be disabled at any time. The interrupt from
>> this source will only be acknowledged in the isr function.
>>
>> The functionality is simple. A hardware unit will be started and reports
>> the finish with an interrupt. I have made a counter on hardware unit
>> start,
>> isr and dsr. After this long-run test, the dsr counter is one less than
>> the
>> other counters.
>
> So, what else is going on that creates your "high load?"  Possibly
> there is some side effect [you may not even be aware of] that could
> cause the DSR loss.

The system is decoding video and audio including output timing control, so
there are lot of irq coming from Hardware and DMA channels.

> Have you checked to see if the DSR is ever called with a count other
> than 1?

I have not check this due to the behaviour of the usage. It's like
start->isr->dsr->start... so dsr count could not be more than 1. After the
missing DSR the hardware unit will not be started again, so i had time to
dump counter values and status info. I did check the dsr count on the DMA
isr/dsr. It was 2 (but not more) from time to time.

Another thing is the calling order of the dsr's. I would quess the best
would be first come, first serve, so the dsr of the oldest irq will be
first served. But in "linked list" mode it's the opposite (last come, first
serve). Would this be different if i use the array version of the dsr's?

Bye...


--
Before posting, please read the FAQ: http://ecos.sourceware.org/fom/ecos
and search the list archive: http://ecos.sourceware.org/ml/ecos-discuss

Reply | Threaded
Open this post in threaded view
|

Re: Re: ISR not causing an DSR in some rare conditions

Stefan Sommerfeld
In reply to this post by Sergei Organov-3
Hi,

>>>>
>>>> i'm using XScale PXA270 processor with latest eCos and I think I found
>>>> a
>>>> problem with DSR's. I'm 100% sure that there must be a condition
>>>> where eCos
>>>> not calls the DSR of an interrupt. I have a IRQ which comes 45 times a
>>>> second on a system running at high load. After more then 10 hours
>>>> one DSR
>>>> is missing. This is a bad situation which makes the system unstable.
>>>>
>>>> Is this problem known? Maybe it depends only on the architecture
>>>> (ARM).
>>>
>>> Does the ISR reenable the interrupt? It could be the next interrupt
>>> arrives before the DSR is called. In that case the DSR will be called
>>> with the count value of 2.
>>
>> No. The interrupts will not be disabled at any time. The interrupt from
>> this source will only be acknowledged in the isr function.
>>
>> The functionality is simple. A hardware unit will be started and reports
>> the finish with an interrupt. I have made a counter on hardware unit
>> start,
>> isr and dsr. After this long-run test, the dsr counter is one less than
>> the
>> other counters.
>
> Did you bother to read the manual? Here is citation:
>
> """
> void
> dsr_function(cyg_vector_t vector,
>             cyg_ucount32 count,
>             cyg_addrword_t data)
> {
> }
>
> [...] The second argument indicates the number of these interrupts
> that have occurred and for which the ISR requested a DSR.  Usually this
> will be 1, unless the system is suffering from a very heavy load.
> """
>
> Thus, you need to increment your test counter by the value of 'count'
> argument in the DSR handler, do you?

No... you don't understand. The dsr count of this particular irq cannot be
other than 1. The hardware unit causes only one irq some time after it was
started. If ecos does not call the dsr routine, the hardware unit will not
be started again and so no more irq's will be caused.

Bye...


--
Before posting, please read the FAQ: http://ecos.sourceware.org/fom/ecos
and search the list archive: http://ecos.sourceware.org/ml/ecos-discuss

Reply | Threaded
Open this post in threaded view
|

Re: ISR not causing an DSR in some rare conditions

Sergei Organov-3
"Stefan Sommerfeld" <[hidden email]> writes:
[...]
>>> No. The interrupts will not be disabled at any time. The interrupt from
>>> this source will only be acknowledged in the isr function.
>>>
>>> The functionality is simple. A hardware unit will be started and
>>> reports the finish with an interrupt. I have made a counter on
>>> hardware unit start, isr and dsr. After this long-run test, the dsr
>>> counter is one less than the other counters.
>>
[...]
>> Thus, you need to increment your test counter by the value of 'count'
>> argument in the DSR handler, do you?
>
> No... you don't understand.

Well, maybe, but that's how I read your:

>>> No. The interrupts will not be disabled at any time. The interrupt from
>>> this source will only be acknowledged in the isr function.

If interrupts are not disabled, and interrupt is acked in isr function
as you wrote above, then more than one could happen while no dsr is run
yet.

> The dsr count of this particular irq cannot be other than 1.

Well, seems I indeed misunderstand, but did you *actually* check it
isn't? Ah, well, from your other reply I see you didn't, though I'd
check anyway as "cannot" and "doesn't indeed happen" are surprisingly
different, at least in my experience.

> The hardware unit causes only one irq some time after it was
> started. If ecos does not call the dsr routine, the hardware unit will
> not be started again and so no more irq's will be caused.

Do you say that it's DSR that restarts the hardware unit? Seems so
though you didn't tell it to us before.

After DSR is missed is the entire system operational? I mean is the rest
of the system running OK (including ISRs/DSRs from other sources,
e.g., timer) after the DSR is missing? If so, then it looks like some
race somewhere, either in your code or in the eCos, or entirely
independent bug that just happens to break this thing.

BTW, if it runs on ARM, do you use FIQ? I think I know at least 2 bugs
in the ARM HAL, one of which is with FIQ handling (and another one being
in the context switch), but chances are very low they show themselves
the way you see.

-- Sergei.


--
Before posting, please read the FAQ: http://ecos.sourceware.org/fom/ecos
and search the list archive: http://ecos.sourceware.org/ml/ecos-discuss

Reply | Threaded
Open this post in threaded view
|

Re: Re: ISR not causing an DSR in some rare conditions

Gary Thomas
On Thu, 2006-01-12 at 19:11 +0300, Sergei Organov wrote:
 <...snip>
>
> BTW, if it runs on ARM, do you use FIQ? I think I know at least 2 bugs
> in the ARM HAL, one of which is with FIQ handling (and another one being
> in the context switch), but chances are very low they show themselves
> the way you see.

What bugs are you speaking of?
Do you have patches that fix them?

--
------------------------------------------------------------
Gary Thomas                 |  Consulting for the
MLB Associates              |    Embedded world
------------------------------------------------------------


--
Before posting, please read the FAQ: http://ecos.sourceware.org/fom/ecos
and search the list archive: http://ecos.sourceware.org/ml/ecos-discuss

Reply | Threaded
Open this post in threaded view
|

Re: Re: ISR not causing an DSR in some rare conditions

Stefan Sommerfeld
In reply to this post by Sergei Organov-3
Hi,

>>>> No. The interrupts will not be disabled at any time. The interrupt
>>>> from
>>>> this source will only be acknowledged in the isr function.
>>>>
>>>> The functionality is simple. A hardware unit will be started and
>>>> reports the finish with an interrupt. I have made a counter on
>>>> hardware unit start, isr and dsr. After this long-run test, the dsr
>>>> counter is one less than the other counters.
>>>
> [...]
>>> Thus, you need to increment your test counter by the value of 'count'
>>> argument in the DSR handler, do you?
>>
>> No... you don't understand.
>
> Well, maybe, but that's how I read your:
>
>>>> No. The interrupts will not be disabled at any time. The interrupt
>>>> from
>>>> this source will only be acknowledged in the isr function.
>
> If interrupts are not disabled, and interrupt is acked in isr function
> as you wrote above, then more than one could happen while no dsr is run
> yet.
>
>> The dsr count of this particular irq cannot be other than 1.
>
> Well, seems I indeed misunderstand, but did you *actually* check it
> isn't? Ah, well, from your other reply I see you didn't, though I'd
> check anyway as "cannot" and "doesn't indeed happen" are surprisingly
> different, at least in my experience.

Maybe i was not clear in my descriptions. I'll try to go in more detail.
The XScale is using a picture scaler in an FPGA (so outside of the
processor). It will setup the scaler and start it. The irq tells the system
the scaling has finished. Now in the dsr function a new scaler setup is
loaded and the scaler is restarted. That's way only one irq,dsr will come.
You're right, that i did not check the dsr count, but there's no
possibility for multiple dsr's. What i see is that my scaling stops after a
long time and from what i found out, it's the missing dsr, because the
expected isr was there.

>> The hardware unit causes only one irq some time after it was
>> started. If ecos does not call the dsr routine, the hardware unit will
>> not be started again and so no more irq's will be caused.
>
> Do you say that it's DSR that restarts the hardware unit? Seems so
> though you didn't tell it to us before.

Sorry... i hope now it's more clear.
>
> After DSR is missed is the entire system operational? I mean is the rest
> of the system running OK (including ISRs/DSRs from other sources,
> e.g., timer) after the DSR is missing? If so, then it looks like some
> race somewhere, either in your code or in the eCos, or entirely
> independent bug that just happens to break this thing.

The entire system is working well. Other irqs/dsrs are working and
multithreading too. For me it looks like a race condition, maybe not in
ecos itself, but in a platform/variant implementation. I had the "feeling"
that the system sometimes "loses" dsrs before with the dma channels, but
currently i'm using multiple dma channels at the same time, so multiple
irqs will happen and if a dsrs is missing, another irq will retry the dsr.
I plan to setup a test with a bit hardware support (FPGA generate multiple
irqs in a short time)

> BTW, if it runs on ARM, do you use FIQ? I think I know at least 2 bugs
> in the ARM HAL, one of which is with FIQ handling (and another one being
> in the context switch), but chances are very low they show themselves
> the way you see.

No. From what i know the current pxa2x0 variant does not support FIQ's, so
i don't use them. For my understanding, the IRQ's are okay only the
IRQ->DSR functionality isn't working.

For now i've implemented a watchdog to check for the missing dsr, which
"fixes" the problem, but my company is starting a new project and we are
not sure if we should continue using ecos. I like it, but the DSR problems
are critical. I also had problems with very long delays between ISR and
DSR, which maybe caused by the execution order.

Bye...


--
Before posting, please read the FAQ: http://ecos.sourceware.org/fom/ecos
and search the list archive: http://ecos.sourceware.org/ml/ecos-discuss

Reply | Threaded
Open this post in threaded view
|

RE: Re: ISR not causing an DSR in some rare conditions

Doyle, Patrick
In reply to this post by Stefan Sommerfeld
There was a discussion on the developers list last week or the week before
about a subtle race condition that arose when an interrupt occurred at the
exact instant that a single thread called a blocking primitive and the
scheduler was busy transitioning to the idle thread.  See
http://ecos.sourceware.org/ml/ecos-devel/2006-01/msg00000.html for more
details.  Nick posted a patch, which I presume was applied to the CVS tree.
You could try applying his patch yourself or updating to the latest CVS tree
(and seeing if the tree does, in fact, include his patch).

hth...

--wpd

> -----Original Message-----
> From: Stefan Sommerfeld [mailto:[hidden email]]
> Sent: Thursday, January 12, 2006 11:58 AM
> To: [hidden email]
> Subject: Re: [ECOS] Re: ISR not causing an DSR in some rare conditions
>
>
> Hi,
>
> >>>> No. The interrupts will not be disabled at any time. The
> interrupt
> >>>> from
> >>>> this source will only be acknowledged in the isr function.
> >>>>
> >>>> The functionality is simple. A hardware unit will be started and
> >>>> reports the finish with an interrupt. I have made a counter on
> >>>> hardware unit start, isr and dsr. After this long-run
> test, the dsr
> >>>> counter is one less than the other counters.
> >>>
> > [...]
> >>> Thus, you need to increment your test counter by the
> value of 'count'
> >>> argument in the DSR handler, do you?
> >>
> >> No... you don't understand.
> >
> > Well, maybe, but that's how I read your:
> >
> >>>> No. The interrupts will not be disabled at any time. The
> interrupt
> >>>> from
> >>>> this source will only be acknowledged in the isr function.
> >
> > If interrupts are not disabled, and interrupt is acked in
> isr function
> > as you wrote above, then more than one could happen while
> no dsr is run
> > yet.
> >
> >> The dsr count of this particular irq cannot be other than 1.
> >
> > Well, seems I indeed misunderstand, but did you *actually* check it
> > isn't? Ah, well, from your other reply I see you didn't, though I'd
> > check anyway as "cannot" and "doesn't indeed happen" are
> surprisingly
> > different, at least in my experience.
>
> Maybe i was not clear in my descriptions. I'll try to go in
> more detail.
> The XScale is using a picture scaler in an FPGA (so outside of the
> processor). It will setup the scaler and start it. The irq
> tells the system
> the scaling has finished. Now in the dsr function a new
> scaler setup is
> loaded and the scaler is restarted. That's way only one
> irq,dsr will come.
> You're right, that i did not check the dsr count, but there's no
> possibility for multiple dsr's. What i see is that my scaling
> stops after a
> long time and from what i found out, it's the missing dsr,
> because the
> expected isr was there.
>
> >> The hardware unit causes only one irq some time after it was
> >> started. If ecos does not call the dsr routine, the
> hardware unit will
> >> not be started again and so no more irq's will be caused.
> >
> > Do you say that it's DSR that restarts the hardware unit? Seems so
> > though you didn't tell it to us before.
>
> Sorry... i hope now it's more clear.
> >
> > After DSR is missed is the entire system operational? I
> mean is the rest
> > of the system running OK (including ISRs/DSRs from other sources,
> > e.g., timer) after the DSR is missing? If so, then it looks
> like some
> > race somewhere, either in your code or in the eCos, or entirely
> > independent bug that just happens to break this thing.
>
> The entire system is working well. Other irqs/dsrs are working and
> multithreading too. For me it looks like a race condition,
> maybe not in
> ecos itself, but in a platform/variant implementation. I had
> the "feeling"
> that the system sometimes "loses" dsrs before with the dma
> channels, but
> currently i'm using multiple dma channels at the same time,
> so multiple
> irqs will happen and if a dsrs is missing, another irq will
> retry the dsr.
> I plan to setup a test with a bit hardware support (FPGA
> generate multiple
> irqs in a short time)
>
> > BTW, if it runs on ARM, do you use FIQ? I think I know at
> least 2 bugs
> > in the ARM HAL, one of which is with FIQ handling (and
> another one being
> > in the context switch), but chances are very low they show
> themselves
> > the way you see.
>
> No. From what i know the current pxa2x0 variant does not
> support FIQ's, so
> i don't use them. For my understanding, the IRQ's are okay only the
> IRQ->DSR functionality isn't working.
>
> For now i've implemented a watchdog to check for the missing
> dsr, which
> "fixes" the problem, but my company is starting a new project
> and we are
> not sure if we should continue using ecos. I like it, but the
> DSR problems
> are critical. I also had problems with very long delays
> between ISR and
> DSR, which maybe caused by the execution order.
>
> Bye...
>
>
> --
> Before posting, please read the FAQ:
> http://ecos.sourceware.org/fom/ecos
> and search the list archive:
> http://ecos.sourceware.org/ml/ecos-discuss
>


Patrick Doyle
Manager, Digital Systems Group
(603) 546-2179

 

This communication is from DTC Communications, Inc. and is intended to be
confidential and solely for the use of the persons or entities addressed
above.  If you are not an intended recipient, be aware that the information
contained herein may be protected from unauthorized use by privilege or law,
and any copying, distribution, disclosure, or other use of this information
is prohibited.  If you have received this communication in error, please
contact the sender by return e-mail or telephone the above number
immediately and delete or destroy all copies.  Thank you for your
cooperation.


--
Before posting, please read the FAQ: http://ecos.sourceware.org/fom/ecos
and search the list archive: http://ecos.sourceware.org/ml/ecos-discuss

Reply | Threaded
Open this post in threaded view
|

Re: Re: ISR not causing an DSR in some rare conditions

Sergei Organov-3
In reply to this post by Gary Thomas
Gary Thomas <[hidden email]> writes:

> On Thu, 2006-01-12 at 19:11 +0300, Sergei Organov wrote:
>  <...snip>
>>
>> BTW, if it runs on ARM, do you use FIQ? I think I know at least 2 bugs
>> in the ARM HAL, one of which is with FIQ handling (and another one being
>> in the context switch), but chances are very low they show themselves
>> the way you see.
>
> What bugs are you speaking of?
> Do you have patches that fix them?

Well, one of them is the first item here:

<http://article.gmane.org/gmane.os.ecos.general/16715/match=arm+hal+issues>

Another one, that is FIQ related, hasn't yet been reported, but I
can prepare a problem description and a patch to fix it, though I'd
appreciate some response to my 2 month old message mentioned above
first.

-- Sergei.

--
Before posting, please read the FAQ: http://ecos.sourceware.org/fom/ecos
and search the list archive: http://ecos.sourceware.org/ml/ecos-discuss

Reply | Threaded
Open this post in threaded view
|

Re: ISR not causing an DSR in some rare conditions

Sergei Organov-3
In reply to this post by Doyle, Patrick
"Doyle, Patrick" <[hidden email]> writes:
> There was a discussion on the developers list last week or the week before
> about a subtle race condition that arose when an interrupt occurred at the
> exact instant that a single thread called a blocking primitive and the
> scheduler was busy transitioning to the idle thread.  See
> http://ecos.sourceware.org/ml/ecos-devel/2006-01/msg00000.html for more
> details.  Nick posted a patch, which I presume was applied to the CVS tree.
> You could try applying his patch yourself or updating to the latest CVS tree
> (and seeing if the tree does, in fact, include his patch).

It's unlikely it's the source of the problem. I was the OP of the issue
and the bug could only result in a delay of DSR, not in missing of a DSR
provided (other) ISRs/DSRs continue to run (the latter being the case
for the OP of this thread).

-- Sergei.


--
Before posting, please read the FAQ: http://ecos.sourceware.org/fom/ecos
and search the list archive: http://ecos.sourceware.org/ml/ecos-discuss

Reply | Threaded
Open this post in threaded view
|

Re: Re: ISR not causing an DSR in some rare conditions

Nick Garnett
In reply to this post by Doyle, Patrick
"Doyle, Patrick" <[hidden email]> writes:

> There was a discussion on the developers list last week or the week before
> about a subtle race condition that arose when an interrupt occurred at the
> exact instant that a single thread called a blocking primitive and the
> scheduler was busy transitioning to the idle thread.  See
> http://ecos.sourceware.org/ml/ecos-devel/2006-01/msg00000.html for more
> details.  Nick posted a patch, which I presume was applied to the CVS tree.
> You could try applying his patch yourself or updating to the latest CVS tree
> (and seeing if the tree does, in fact, include his patch).

It is unlikely that this is the problem. The bug I fixed was a failure
to call DSRs during the initial context switch to a newly created
thread. In any case that only delayed the DSR until the next scheduler
unlock, it didn't lose the DSR entirely. The program that showed the
problem was somewhat unusual in that it had nothing else to do until
the DSR ran.

As for the reported problem. I cannot think of anything that might be
causing a DSR to be lost entirely. The code dealing with all of this
has been thoroughly exercised over many years and has been the subject
of much scrutiny. I'm as certain as anyone can be that it is
correct. If there were a race condition anywhere in here then I would
expect it to have manifested itself elsewhere before now.


Actually I can think of one reason why races may be introduced
unexpectedly. This is if the compiler is reordering instructions
incorrectly and moving things across barriers that it should not. In
particular if it is not honouring the volatile nature of the asm
inlines that enable and disable interrupts.

I don't know what version of the compiler you are using, but it might
be instructive to see if a different version exhibits different
behaviour. However, we have never seen any problems like this, so I am
really clutching at straws here.


--
Nick Garnett                                     eCos Kernel Architect
http://www.ecoscentric.com                The eCos and RedBoot experts


--
Before posting, please read the FAQ: http://ecos.sourceware.org/fom/ecos
and search the list archive: http://ecos.sourceware.org/ml/ecos-discuss

Reply | Threaded
Open this post in threaded view
|

Re: Re: ISR not causing an DSR in some rare conditions

Stefan Sommerfeld
Hi,

>> There was a discussion on the developers list last week or the week
>> before
>> about a subtle race condition that arose when an interrupt occurred at
>> the
>> exact instant that a single thread called a blocking primitive and the
>> scheduler was busy transitioning to the idle thread.  See
>> http://ecos.sourceware.org/ml/ecos-devel/2006-01/msg00000.html for more
>> details.  Nick posted a patch, which I presume was applied to the CVS
>> tree.
>> You could try applying his patch yourself or updating to the latest CVS
>> tree
>> (and seeing if the tree does, in fact, include his patch).
>
> It is unlikely that this is the problem. The bug I fixed was a failure
> to call DSRs during the initial context switch to a newly created
> thread. In any case that only delayed the DSR until the next scheduler
> unlock, it didn't lose the DSR entirely. The program that showed the
> problem was somewhat unusual in that it had nothing else to do until
> the DSR ran.
>
> As for the reported problem. I cannot think of anything that might be
> causing a DSR to be lost entirely. The code dealing with all of this
> has been thoroughly exercised over many years and has been the subject
> of much scrutiny. I'm as certain as anyone can be that it is
> correct. If there were a race condition anywhere in here then I would
> expect it to have manifested itself elsewhere before now.
>
>
> Actually I can think of one reason why races may be introduced
> unexpectedly. This is if the compiler is reordering instructions
> incorrectly and moving things across barriers that it should not. In
> particular if it is not honouring the volatile nature of the asm
> inlines that enable and disable interrupts.
>
> I don't know what version of the compiler you are using, but it might
> be instructive to see if a different version exhibits different
> behaviour. However, we have never seen any problems like this, so I am
> really clutching at straws here.

I'm using a self-compiled gcc 3.4.3 for xscale. I'll try to compile a
different version to check if this helps. I'll also try to setup a test
system which should trigger this problem faster (not 24 hours) to do some
more investigation.

I also noticed while searching for the isr to dsr delay problem that the
scheduler lock count sometimes raises quite high (up to 10), but i don't
have nested interrupts enabled.

Bye...


--
Before posting, please read the FAQ: http://ecos.sourceware.org/fom/ecos
and search the list archive: http://ecos.sourceware.org/ml/ecos-discuss

Reply | Threaded
Open this post in threaded view
|

Re: Re: ISR not causing an DSR in some rare conditions

Nick Garnett
"Stefan Sommerfeld" <[hidden email]> writes:

> I'm using a self-compiled gcc 3.4.3 for xscale. I'll try to compile a
> different version to check if this helps. I'll also try to setup a
> test system which should trigger this problem faster (not 24 hours) to
> do some more investigation.

You could try using one of the precompiled toolchains from the
website. We are reasonably sure that these have no unexpected
problems.

>
> I also noticed while searching for the isr to dsr delay problem that
> the scheduler lock count sometimes raises quite high (up to 10), but i
> don't have nested interrupts enabled.

That just doesn't seem right. I wouldn't expect the lock count to rise
much beyond 3 or 4 at the most. It sounds like interrupt nesting may
be happening even when you don't want it. If this is the case, then
that might also explain the lost DSRs. It almost sounds as if
interrupt disable is not working. I cannot imagine why that would be
the case, though.


--
Nick Garnett                                     eCos Kernel Architect
http://www.ecoscentric.com                The eCos and RedBoot experts


--
Before posting, please read the FAQ: http://ecos.sourceware.org/fom/ecos
and search the list archive: http://ecos.sourceware.org/ml/ecos-discuss