Difference between revisions of "OCEOSmp/introduction"

From wiki
Jump to navigation Jump to search
 
(12 intermediate revisions by the same user not shown)
Line 1: Line 1:
= <span style="color:#0000ff">'''Introduction'''</span> =


OCEOSmp is designed ab initio to be suitable for high reliability embedded systems. It is not started initially on power up but is started by the main application code, to which it may return if a fatal error is detected. It is designed to be transparent with information stored at statically allocated locations and accessible both while OCEOSmp is running and after it exits.
__NOTOC__
In addition to deterministic scheduling the qualities felt to be most important in designing OCEOSmp were (i) robustness (ii) efficiency (iii) compactness.


[[File:Oceosmp.png|800px]]
OCEOSmp is a real-time pre-emptive fixed priority operating system that can be used in applications that require European Cooperation for Space Standardization Category B or ISO 26262 standards. It has a small memory footprint (<20 kBytes), requires only one system stack per CPU rather than a stack for each task, and provides support for precisely timed data outputs independent of task scheduling. OCEOS supports applications running on RISC-V, ARM, & SPARC based hardware.
 
Robustness:
OCEOSmp is designed to be suitable for high reliability embedded systems. It is not started initially on power up but is started by the main application code, to which it may return if a fatal error is detected. It is designed to be transparent with information stored at statically allocated locations and accessible both while OCEOSmp is running and after it exits.
In addition to deterministic scheduling the qualities most important in designing OCEOSmp were (i) robustness (ii) efficiency (iii) compactness.
 
[[File:Wikioceosmp.png|1000px]]
 
== <span style="color:#0000ff">Robustness</span> ==
This is a primary objective in the design of OCEOSmp.
This is a primary objective in the design of OCEOSmp.
For a system design to be robust it must be possible to identify the range of problems with which it should deal and how it does so.
For a system design to be robust it must be possible to identify the range of problems with which it should deal and how it does so.
Such problems fall into three main categories, those excluded by the design, those that provide an advance warning that something is going wrong, and those that can occur without warning.
Such problems fall into three main categories, those excluded by the design, those that provide an advance warning that something is going wrong, and those that can occur without warning.


Excluded by design:
== <span style="color:#0000ff">Excluded by design</span> ==
 
The design of OCEOSmp is based on the stack resource policy (https://www.math.unipd.it/~tullio/RTS/2009/Baker-1991.pdf). This has a number of advantages and excludes a number of potential problems.  
The design of OCEOSmp is based on the stack resource policy (https://www.math.unipd.it/~tullio/RTS/2009/Baker-1991.pdf). This has a number of advantages and excludes a number of potential problems.  
o Unbounded priority inversion cannot occur in OCEOSmp.
*Unbounded priority inversion cannot occur in OCEOSmp.
o Deadlocks cannot occur if there is only one CPU.
*Deadlocks cannot occur if there is only one CPU.
o (With more than one CPU, deadlocks cannot occur if tasks always follow the same order in acquiring mutexes. OCEOSmp detects when this is not the case and automatically makes a system log entry, updates the state variable, and optionally calls a user defined problem handling function.)
*(With more than one CPU, deadlocks cannot occur if tasks always follow the same order in acquiring mutexes. OCEOSmp detects when this is not the case and automatically makes a system log entry, updates the state variable, and optionally calls a user defined problem handling function.)
o When a task is created an allowance of from 1 to 15 execution instances is specified. This can be set so that unexpected task start requests do not cause a problem. (It also allows the work of a task be split and done concurrently on multiple CPUs, or the same processing be done on multiple CPUs and results compared to check CPU behaviour.)  
*When a task is created an allowance of from 1 to 15 execution instances is specified. This can be set so that unexpected task start requests do not cause a problem. (It also allows the work of a task be split and done concurrently on multiple CPUs, or the same processing be done on multiple CPUs and results compared to check CPU behaviour.)  
o With OCEOSmp no CPU in a multi-core system is a single point of failure once scheduling has begun, use of CPUs is then symmetric.
*With OCEOSmp no CPU in a multi-core system is a single point of failure once scheduling has begun, use of CPUs is then symmetric.


Advance warning – problem anticipation:
== <span style="color:#0000ff">Advance warning – problem anticipation</span> ==
Some problems give advance warnings that things are going wrong, allowing preventive action be taken before the problem becomes critical. All software design involves assumptions and the ability to do run-time checks of how close these are to being violated allows preventative actions, or at least notification of what is happening, to be carried out before a critical situation develops.
Some problems give advance warnings that things are going wrong, allowing preventive action be taken before the problem becomes critical. All software design involves assumptions and the ability to do run-time checks of how close these are to being violated allows preventative actions, or at least notification of what is happening, to be carried out before a critical situation develops.


Line 25: Line 31:
For software this information includes:
For software this information includes:
For each CPU:
For each CPU:
o Maximum stack usage (OCEOSmp does not require a separate stack for each task)
*Maximum stack usage (OCEOSmp does not require a separate stack for each task)
o Currently executing and pending tasks
*Currently executing and pending tasks
For each task:
For each task:
o Minimum time between task start requests.
*Minimum time between task start requests.
o Maximum time to completion.
*Maximum time to completion.
o Maximum number of concurrent execution instances (jobs).
*Maximum number of concurrent execution instances (jobs).
o Whether deadline (if specified) has been missed
*Whether deadline (if specified) has been missed
For the system:
For the system:
o System log
*System log
o Context switch log (if enabled)
*Context switch log (if enabled)
o System state variable (reset on restart)
*System state variable (reset on restart)
o System state variable backup (not reset on restart)
*System state variable backup (not reset on restart)
o System state variable action mask (selects automatic call of application defined problem handler)
*System state variable action mask (selects automatic call of application defined problem handler)


Hardware problem warnings include:
Hardware problem warnings include:
For CPU problems:
For CPU problems:
o OCEOSmp allows simultaneous running of the same task on up to 15 different CPUs, with results stored on a per CPU basis and available for comparison.
*OCEOSmp allows simultaneous running of the same task on up to 15 different CPUs, with results stored on a per CPU basis and available for comparison.
o This can provide protection against CPU malfunction in carrying out critical tasks.
*This can provide protection against CPU malfunction in carrying out critical tasks.
o It also allows background checks that can warn that a CPU has developed a problem.
*It also allows background checks that can warn that a CPU has developed a problem.
o If a CPU is found to be faulty it can be taken out of use and the termination functions of its currently executing and pending tasks called from another CPU.
*If a CPU is found to be faulty it can be taken out of use and the termination functions of its currently executing and pending tasks called from another CPU.
For memory problems:
For memory problems:
o A memory scrubber task can be scheduled to start periodically to ensure single bit errors do not accumulate (assuming EDAC).
*A memory scrubber task can be scheduled to start periodically to ensure single bit errors do not accumulate (assuming EDAC).
For peripheral hardware:
For peripheral hardware:
o The minimum time between task start requests can warn that a peripheral is malfunctioning.
*The minimum time between task start requests can warn that a peripheral is malfunctioning.


No warning:
== <span style="color:#0000ff">No warning</span> ==
Software errors can arise suddenly due to unforeseen circumstances. The phrase ‘a failure mode we did not think of’ is not unknown among software developers. The resulting error typically occurs without warning.
Software errors can arise suddenly due to unforeseen circumstances. The phrase ‘a failure mode we did not think of’ is not unknown among software developers. The resulting error typically occurs without warning.
Hardware errors can also occur without warning due to radiation or other factors.
Hardware errors can also occur without warning due to radiation or other factors.
Line 59: Line 65:
Similar checks can also be done by the application software, which can also check the fixed area checksum (this is not done automatically by OCEOSmp for reasons of efficiency). As the fixed data area does not change once scheduling begins it is possible to create it in advance and store it in ROM for use if it is necessary to restart scheduling.
Similar checks can also be done by the application software, which can also check the fixed area checksum (this is not done automatically by OCEOSmp for reasons of efficiency). As the fixed data area does not change once scheduling begins it is possible to create it in advance and store it in ROM for use if it is necessary to restart scheduling.


Fault isolation
== <span style="color:#0000ff">Fault isolation</span> ==
When OCEOSmp detects a fault it can isolate it in a number of ways.
When OCEOSmp detects a fault it can isolate it in a number of ways.
For software problems it can
For software problems it can
o terminate the currently execution instance of a task
*terminate the currently execution instance of a task
o disable a task so that it will not be put into execution again
*disable a task so that it will not be put into execution again
o terminate all current execution instances of the task
*terminate all current execution instances of the task
If a CPU is found to be faulty
If a CPU is found to be faulty
o OCEOSmp can take the CPU out of use by OCEOSmp
*OCEOSmp can take the CPU out of use by OCEOSmp
o OCEOSmp can put the CPU in sleep mode
*OCEOSmp can put the CPU in sleep mode
o OCEOSmp can terminate all jobs pending or in execution on that CPU
*OCEOSmp can terminate all jobs pending or in execution on that CPU
The response usually involves
The response usually involves
o Updating the system state flags
*Updating the system state flags
o Updating the system log
*Updating the system log
o If the action mask matches the system state flags calling the application defined problem handling function  
*If the action mask matches the system state flags calling the application defined problem handling function  
In extreme cases where corruption of its internal structure is detected OCEOSmp will return to the main application with an appropriate system code. The system log, task information, and all other data are still available for analysis, and the main application code may restart OCEOSmp.
In extreme cases where corruption of its internal structure is detected OCEOSmp will return to the main application with an appropriate system code. The system log, task information, and all other data are still available for analysis, and the main application code may restart OCEOSmp.


Fault reporting
== <span style="color:#0000ff">Fault reporting</span> ==
A system state variable contains flags that indicate what problems have been detected.
A system state variable contains flags that indicate what problems have been detected.
A backup of this variable that is not reset on start-up allows a record of past events be kept across start cycles.
A backup of this variable that is not reset on start-up allows a record of past events be kept across start cycles.
Line 83: Line 89:
A context switch log can be enabled and its size set in the initial configuration. Structured as a circular buffer, it can be read by an external debugger such as DMON, or its content output by a task.
A context switch log can be enabled and its size set in the initial configuration. Structured as a circular buffer, it can be read by an external debugger such as DMON, or its content output by a task.


Fault recovery
== <span style="color:#0000ff">Fault recovery</span> ==
As mentioned above under ‘Fault Isolation’, OCEOSmp can enable and disable tasks, terminate execution instances of tasks, put a CPU out of use by OCEOSmp, and put a CPU into sleep mode.
As mentioned above under ‘Fault Isolation’, OCEOSmp can enable and disable tasks, terminate execution instances of tasks, put a CPU out of use by OCEOSmp, and put a CPU into sleep mode.
In most cases these actions will be taken by an application defined problem handling function. This is automatically called when a change in the system state corresponds to one of the action flags set by the application, and can then access the system log and system state variables and other information and use OCEOSmp directives to take the appropriate actions.
In most cases these actions will be taken by an application defined problem handling function. This is automatically called when a change in the system state corresponds to one of the action flags set by the application, and can then access the system log and system state variables and other information and use OCEOSmp directives to take the appropriate actions.
Line 89: Line 95:
An OCEOSmp directive causes an exit from OCEOSmp and return with an appropriate status code to the main application code that started OCEOSmp. This directive is used automatically by OCEOSmp if it detects a fatal error, and can be used by the application code. As well as the returned status code the main application code can readily access the system state variable, system logs, and all task and other information, and can use an OCEOSmp directive to restart scheduling.
An OCEOSmp directive causes an exit from OCEOSmp and return with an appropriate status code to the main application code that started OCEOSmp. This directive is used automatically by OCEOSmp if it detects a fatal error, and can be used by the application code. As well as the returned status code the main application code can readily access the system state variable, system logs, and all task and other information, and can use an OCEOSmp directive to restart scheduling.


Implementation
== <span style="color:#0000ff">Implementation</span> ==
The core aspects of OCEOSmp are written in C. This includes the code for initialising OCEOSmp data structures, for scheduling, and for mutexes, read-write mutexes, counting semaphores, data queues and timed actions.
The core aspects of OCEOSmp are written in C. This includes the code for initialising OCEOSmp data structures, for scheduling, and for mutexes, read-write mutexes, counting semaphores, data queues and timed actions.
CPU specific code is written in C and in assembly language for that CPU. This includes spin-lock mechanisms and interrupt control.
CPU specific code is written in C and in assembly language for that CPU. This includes spin-lock mechanisms and interrupt control.
Line 101: Line 107:
The Eclipse IDE is used for development in conjunction with a Subversion code repository, and OCE’s DMON tool and other tools used in debugging.  
The Eclipse IDE is used for development in conjunction with a Subversion code repository, and OCE’s DMON tool and other tools used in debugging.  


Efficiency:
== <span style="color:#0000ff">Efficiency</span> ==
The use of the stack resource policy greatly simplifies context switching, allowing it be treated as essentially a function call to the start function of a task. Context switch times are expected to be significantly faster than for other RTOS.
The use of the stack resource policy greatly simplifies context switching, allowing it be treated as essentially a function call to the start function of a task. Context switch times are expected to be significantly faster than for other RTOS.


Compactness:
== <span style="color:#0000ff">Compactness</span> ==
In traditional RTOS each task must be allocated its own stack, and this must be done relatively generously. With many tasks, considerable RAM can be required.
In traditional RTOS each task must be allocated its own stack, and this must be done relatively generously. With many tasks, considerable RAM can be required.
With just one stack required per CPU the stack requirements in OCEOSmp are much less, perhaps making it possible to reduce the number of memory chips in a system.
With just one stack required per CPU the stack requirements in OCEOSmp are much less, perhaps making it possible to reduce the number of memory chips in a system.
OCEOSmp does not use virtual addressing, allowing further savings in memory as page tables are not required.
OCEOSmp does not use virtual addressing, allowing further savings in memory as page tables are not required.
The code for OCEOSmp core components is small, about 20KiB.
The code for OCEOSmp core components is small, about 20KiB.
[[Category:backup]]

Latest revision as of 15:18, 16 January 2023

Introduction

OCEOSmp is a real-time pre-emptive fixed priority operating system that can be used in applications that require European Cooperation for Space Standardization Category B or ISO 26262 standards. It has a small memory footprint (<20 kBytes), requires only one system stack per CPU rather than a stack for each task, and provides support for precisely timed data outputs independent of task scheduling. OCEOS supports applications running on RISC-V, ARM, & SPARC based hardware.

OCEOSmp is designed to be suitable for high reliability embedded systems. It is not started initially on power up but is started by the main application code, to which it may return if a fatal error is detected. It is designed to be transparent with information stored at statically allocated locations and accessible both while OCEOSmp is running and after it exits. In addition to deterministic scheduling the qualities most important in designing OCEOSmp were (i) robustness (ii) efficiency (iii) compactness.

Wikioceosmp.png

Robustness

This is a primary objective in the design of OCEOSmp. For a system design to be robust it must be possible to identify the range of problems with which it should deal and how it does so. Such problems fall into three main categories, those excluded by the design, those that provide an advance warning that something is going wrong, and those that can occur without warning.

Excluded by design

The design of OCEOSmp is based on the stack resource policy (https://www.math.unipd.it/~tullio/RTS/2009/Baker-1991.pdf). This has a number of advantages and excludes a number of potential problems.

  • Unbounded priority inversion cannot occur in OCEOSmp.
  • Deadlocks cannot occur if there is only one CPU.
  • (With more than one CPU, deadlocks cannot occur if tasks always follow the same order in acquiring mutexes. OCEOSmp detects when this is not the case and automatically makes a system log entry, updates the state variable, and optionally calls a user defined problem handling function.)
  • When a task is created an allowance of from 1 to 15 execution instances is specified. This can be set so that unexpected task start requests do not cause a problem. (It also allows the work of a task be split and done concurrently on multiple CPUs, or the same processing be done on multiple CPUs and results compared to check CPU behaviour.)
  • With OCEOSmp no CPU in a multi-core system is a single point of failure once scheduling has begun, use of CPUs is then symmetric.

Advance warning – problem anticipation

Some problems give advance warnings that things are going wrong, allowing preventive action be taken before the problem becomes critical. All software design involves assumptions and the ability to do run-time checks of how close these are to being violated allows preventative actions, or at least notification of what is happening, to be carried out before a critical situation develops.

OCEOSmp will be designed to facilitate noticing advance warnings. All memory is statically allocated with information on all aspects of operations at predetermined locations that can be checked at any time.

For software this information includes: For each CPU:

  • Maximum stack usage (OCEOSmp does not require a separate stack for each task)
  • Currently executing and pending tasks

For each task:

  • Minimum time between task start requests.
  • Maximum time to completion.
  • Maximum number of concurrent execution instances (jobs).
  • Whether deadline (if specified) has been missed

For the system:

  • System log
  • Context switch log (if enabled)
  • System state variable (reset on restart)
  • System state variable backup (not reset on restart)
  • System state variable action mask (selects automatic call of application defined problem handler)

Hardware problem warnings include: For CPU problems:

  • OCEOSmp allows simultaneous running of the same task on up to 15 different CPUs, with results stored on a per CPU basis and available for comparison.
  • This can provide protection against CPU malfunction in carrying out critical tasks.
  • It also allows background checks that can warn that a CPU has developed a problem.
  • If a CPU is found to be faulty it can be taken out of use and the termination functions of its currently executing and pending tasks called from another CPU.

For memory problems:

  • A memory scrubber task can be scheduled to start periodically to ensure single bit errors do not accumulate (assuming EDAC).

For peripheral hardware:

  • The minimum time between task start requests can warn that a peripheral is malfunctioning.

No warning

Software errors can arise suddenly due to unforeseen circumstances. The phrase ‘a failure mode we did not think of’ is not unknown among software developers. The resulting error typically occurs without warning. Hardware errors can also occur without warning due to radiation or other factors. Some protection against hardware CPU failures can be obtained by exploiting redundancy and running the same task on multiple CPUs, and OCEOSmp supports this. Protection against memory errors is usually provided by EDAC. It is expected that an appropriate trap handler is put in place by the application to handle uncorrectable errors signalled by an EDAC system. The timed actions feature of OCEOSmp simplifies restarting a memory scrubber task at set time intervals, and allows the time interval be changed at run time depending on the error rate. Memory protection may also be provided by memory protection units which restrict defined memory address ranges to use only for execution code or for reading but not writing. The fixed address ranges used by OCEOSmp simplify configuring these units. OCEOSmp provides further protection against software and hardware errors by protecting data areas with fixed sentinels and automatically checking these, and checking other factors for consistency, whenever an OCEOSmp directive is used. Similar checks can also be done by the application software, which can also check the fixed area checksum (this is not done automatically by OCEOSmp for reasons of efficiency). As the fixed data area does not change once scheduling begins it is possible to create it in advance and store it in ROM for use if it is necessary to restart scheduling.

Fault isolation

When OCEOSmp detects a fault it can isolate it in a number of ways. For software problems it can

  • terminate the currently execution instance of a task
  • disable a task so that it will not be put into execution again
  • terminate all current execution instances of the task

If a CPU is found to be faulty

  • OCEOSmp can take the CPU out of use by OCEOSmp
  • OCEOSmp can put the CPU in sleep mode
  • OCEOSmp can terminate all jobs pending or in execution on that CPU

The response usually involves

  • Updating the system state flags
  • Updating the system log
  • If the action mask matches the system state flags calling the application defined problem handling function

In extreme cases where corruption of its internal structure is detected OCEOSmp will return to the main application with an appropriate system code. The system log, task information, and all other data are still available for analysis, and the main application code may restart OCEOSmp.

Fault reporting

A system state variable contains flags that indicate what problems have been detected. A backup of this variable that is not reset on start-up allows a record of past events be kept across start cycles. An action mask variable allows an application defined function be called to handle certain problems. The system log is automatically updated when OCEOSmp detects certain problems, and may be updated by the application if it detects problems. A range of system log codes are reserved for OCEOSmp, others are available for application use. The system log size is set in the initial configuration. It is structured as a circular buffer and an optional function can be defined that is called when the system log becomes ¾ full. A context switch log can be enabled and its size set in the initial configuration. Structured as a circular buffer, it can be read by an external debugger such as DMON, or its content output by a task.

Fault recovery

As mentioned above under ‘Fault Isolation’, OCEOSmp can enable and disable tasks, terminate execution instances of tasks, put a CPU out of use by OCEOSmp, and put a CPU into sleep mode. In most cases these actions will be taken by an application defined problem handling function. This is automatically called when a change in the system state corresponds to one of the action flags set by the application, and can then access the system log and system state variables and other information and use OCEOSmp directives to take the appropriate actions. When an OCEOSmp task is created a termination function for the task is usually defined and this allows tasks that have already started be terminated in an orderly way if the CPU on which they are executing is found to be faulty. An OCEOSmp directive causes an exit from OCEOSmp and return with an appropriate status code to the main application code that started OCEOSmp. This directive is used automatically by OCEOSmp if it detects a fatal error, and can be used by the application code. As well as the returned status code the main application code can readily access the system state variable, system logs, and all task and other information, and can use an OCEOSmp directive to restart scheduling.

Implementation

The core aspects of OCEOSmp are written in C. This includes the code for initialising OCEOSmp data structures, for scheduling, and for mutexes, read-write mutexes, counting semaphores, data queues and timed actions. CPU specific code is written in C and in assembly language for that CPU. This includes spin-lock mechanisms and interrupt control. All OCEOSmp code is re-entrant and can be run in parallel on multiple CPUs. Spin locks and interrupt disabling are used to allow updating of shared data. Interrupts disabled times are kept to a minimum. Use of the stack resource policy means that only a single system stack is needed for each CPU rather than a stack for each task. Context switch operations are basically function calls and done quickly. A central priority queue of CPUs is used where CPU priority is based on having sufficient remaining stack space to allow pre-emption and on the pre-emption threshold of the currently executing task. If the highest priority pending task has higher priority than the threshold and the CPU has stack space then context switch processing is started on that CPU and its current task is pre-empted. CPUs can be reserved for use only by tasks above a certain priority. Behaviour is deterministic. The highest priority waiting task will always be run if sufficient stack space has been allocated. OCEOSmp uses a log area, fixed data area, dynamic data area and stack area. The stack area is divided equally between the CPUs and initialised with a filler, allowing the maximum stack use by each CPU be readily determined. The other areas are typically set up as static arrays of 32-bit words with sentinels at each end and a size immediately after the first sentinel. The fixed data area contains pointers to the main components of all areas to simplify access and a checksum set when initialisation is complete. The Misra-C standard is automatically applied to all C code and any exceptions justified. The Eclipse IDE is used for development in conjunction with a Subversion code repository, and OCE’s DMON tool and other tools used in debugging.

Efficiency

The use of the stack resource policy greatly simplifies context switching, allowing it be treated as essentially a function call to the start function of a task. Context switch times are expected to be significantly faster than for other RTOS.

Compactness

In traditional RTOS each task must be allocated its own stack, and this must be done relatively generously. With many tasks, considerable RAM can be required. With just one stack required per CPU the stack requirements in OCEOSmp are much less, perhaps making it possible to reduce the number of memory chips in a system. OCEOSmp does not use virtual addressing, allowing further savings in memory as page tables are not required. The code for OCEOSmp core components is small, about 20KiB.