OCEOSmp/introduction

From wiki
Revision as of 13:05, 13 January 2023 by Bkavanagh (talk | contribs)
Jump to navigation Jump to search

OCEOSmp is designed to be suitable for high reliability embedded systems. It is not started initially on power up but is started by the main application code, to which it may return if a fatal error is detected. It is designed to be transparent with information stored at statically allocated locations and accessible both while OCEOSmp is running and after it exits. In addition to deterministic scheduling the qualities most important in designing OCEOSmp were (i) robustness (ii) efficiency (iii) compactness.

Wikioceosmp.png

Robustness

This is a primary objective in the design of OCEOSmp. For a system design to be robust it must be possible to identify the range of problems with which it should deal and how it does so. Such problems fall into three main categories, those excluded by the design, those that provide an advance warning that something is going wrong, and those that can occur without warning.

Excluded by design

The design of OCEOSmp is based on the stack resource policy (https://www.math.unipd.it/~tullio/RTS/2009/Baker-1991.pdf). This has a number of advantages and excludes a number of potential problems.

  • Unbounded priority inversion cannot occur in OCEOSmp.
  • Deadlocks cannot occur if there is only one CPU.
  • (With more than one CPU, deadlocks cannot occur if tasks always follow the same order in acquiring mutexes. OCEOSmp detects when this is not the case and automatically makes a system log entry, updates the state variable, and optionally calls a user defined problem handling function.)
  • When a task is created an allowance of from 1 to 15 execution instances is specified. This can be set so that unexpected task start requests do not cause a problem. (It also allows the work of a task be split and done concurrently on multiple CPUs, or the same processing be done on multiple CPUs and results compared to check CPU behaviour.)
  • With OCEOSmp no CPU in a multi-core system is a single point of failure once scheduling has begun, use of CPUs is then symmetric.

Advance warning – problem anticipation

Some problems give advance warnings that things are going wrong, allowing preventive action be taken before the problem becomes critical. All software design involves assumptions and the ability to do run-time checks of how close these are to being violated allows preventative actions, or at least notification of what is happening, to be carried out before a critical situation develops.

OCEOSmp will be designed to facilitate noticing advance warnings. All memory is statically allocated with information on all aspects of operations at predetermined locations that can be checked at any time.

For software this information includes: For each CPU:

  • Maximum stack usage (OCEOSmp does not require a separate stack for each task)
  • Currently executing and pending tasks

For each task:

  • Minimum time between task start requests.
  • Maximum time to completion.
  • Maximum number of concurrent execution instances (jobs).
  • Whether deadline (if specified) has been missed

For the system:

  • System log
  • Context switch log (if enabled)
  • System state variable (reset on restart)
  • System state variable backup (not reset on restart)
  • System state variable action mask (selects automatic call of application defined problem handler)

Hardware problem warnings include: For CPU problems:

  • OCEOSmp allows simultaneous running of the same task on up to 15 different CPUs, with results stored on a per CPU basis and available for comparison.
  • This can provide protection against CPU malfunction in carrying out critical tasks.
  • It also allows background checks that can warn that a CPU has developed a problem.
  • If a CPU is found to be faulty it can be taken out of use and the termination functions of its currently executing and pending tasks called from another CPU.

For memory problems:

  • A memory scrubber task can be scheduled to start periodically to ensure single bit errors do not accumulate (assuming EDAC).

For peripheral hardware:

  • The minimum time between task start requests can warn that a peripheral is malfunctioning.

No warning

Software errors can arise suddenly due to unforeseen circumstances. The phrase ‘a failure mode we did not think of’ is not unknown among software developers. The resulting error typically occurs without warning. Hardware errors can also occur without warning due to radiation or other factors. Some protection against hardware CPU failures can be obtained by exploiting redundancy and running the same task on multiple CPUs, and OCEOSmp supports this. Protection against memory errors is usually provided by EDAC. It is expected that an appropriate trap handler is put in place by the application to handle uncorrectable errors signalled by an EDAC system. The timed actions feature of OCEOSmp simplifies restarting a memory scrubber task at set time intervals, and allows the time interval be changed at run time depending on the error rate. Memory protection may also be provided by memory protection units which restrict defined memory address ranges to use only for execution code or for reading but not writing. The fixed address ranges used by OCEOSmp simplify configuring these units. OCEOSmp provides further protection against software and hardware errors by protecting data areas with fixed sentinels and automatically checking these, and checking other factors for consistency, whenever an OCEOSmp directive is used. Similar checks can also be done by the application software, which can also check the fixed area checksum (this is not done automatically by OCEOSmp for reasons of efficiency). As the fixed data area does not change once scheduling begins it is possible to create it in advance and store it in ROM for use if it is necessary to restart scheduling.

Fault isolation

When OCEOSmp detects a fault it can isolate it in a number of ways. For software problems it can

  • terminate the currently execution instance of a task
  • disable a task so that it will not be put into execution again
  • terminate all current execution instances of the task

If a CPU is found to be faulty

  • OCEOSmp can take the CPU out of use by OCEOSmp
  • OCEOSmp can put the CPU in sleep mode
  • OCEOSmp can terminate all jobs pending or in execution on that CPU

The response usually involves

  • Updating the system state flags
  • Updating the system log
  • If the action mask matches the system state flags calling the application defined problem handling function

In extreme cases where corruption of its internal structure is detected OCEOSmp will return to the main application with an appropriate system code. The system log, task information, and all other data are still available for analysis, and the main application code may restart OCEOSmp.

Fault reporting

A system state variable contains flags that indicate what problems have been detected. A backup of this variable that is not reset on start-up allows a record of past events be kept across start cycles. An action mask variable allows an application defined function be called to handle certain problems. The system log is automatically updated when OCEOSmp detects certain problems, and may be updated by the application if it detects problems. A range of system log codes are reserved for OCEOSmp, others are available for application use. The system log size is set in the initial configuration. It is structured as a circular buffer and an optional function can be defined that is called when the system log becomes ¾ full. A context switch log can be enabled and its size set in the initial configuration. Structured as a circular buffer, it can be read by an external debugger such as DMON, or its content output by a task.

Fault recovery

As mentioned above under ‘Fault Isolation’, OCEOSmp can enable and disable tasks, terminate execution instances of tasks, put a CPU out of use by OCEOSmp, and put a CPU into sleep mode. In most cases these actions will be taken by an application defined problem handling function. This is automatically called when a change in the system state corresponds to one of the action flags set by the application, and can then access the system log and system state variables and other information and use OCEOSmp directives to take the appropriate actions. When an OCEOSmp task is created a termination function for the task is usually defined and this allows tasks that have already started be terminated in an orderly way if the CPU on which they are executing is found to be faulty. An OCEOSmp directive causes an exit from OCEOSmp and return with an appropriate status code to the main application code that started OCEOSmp. This directive is used automatically by OCEOSmp if it detects a fatal error, and can be used by the application code. As well as the returned status code the main application code can readily access the system state variable, system logs, and all task and other information, and can use an OCEOSmp directive to restart scheduling.

Implementation

The core aspects of OCEOSmp are written in C. This includes the code for initialising OCEOSmp data structures, for scheduling, and for mutexes, read-write mutexes, counting semaphores, data queues and timed actions. CPU specific code is written in C and in assembly language for that CPU. This includes spin-lock mechanisms and interrupt control. All OCEOSmp code is re-entrant and can be run in parallel on multiple CPUs. Spin locks and interrupt disabling are used to allow updating of shared data. Interrupts disabled times are kept to a minimum. Use of the stack resource policy means that only a single system stack is needed for each CPU rather than a stack for each task. Context switch operations are basically function calls and done quickly. A central priority queue of CPUs is used where CPU priority is based on having sufficient remaining stack space to allow pre-emption and on the pre-emption threshold of the currently executing task. If the highest priority pending task has higher priority than the threshold and the CPU has stack space then context switch processing is started on that CPU and its current task is pre-empted. CPUs can be reserved for use only by tasks above a certain priority. Behaviour is deterministic. The highest priority waiting task will always be run if sufficient stack space has been allocated. OCEOSmp uses a log area, fixed data area, dynamic data area and stack area. The stack area is divided equally between the CPUs and initialised with a filler, allowing the maximum stack use by each CPU be readily determined. The other areas are typically set up as static arrays of 32-bit words with sentinels at each end and a size immediately after the first sentinel. The fixed data area contains pointers to the main components of all areas to simplify access and a checksum set when initialisation is complete. The Misra-C standard is automatically applied to all C code and any exceptions justified. The Eclipse IDE is used for development in conjunction with a Subversion code repository, and OCE’s DMON tool and other tools used in debugging.

Efficiency

The use of the stack resource policy greatly simplifies context switching, allowing it be treated as essentially a function call to the start function of a task. Context switch times are expected to be significantly faster than for other RTOS.

Compactness

In traditional RTOS each task must be allocated its own stack, and this must be done relatively generously. With many tasks, considerable RAM can be required. With just one stack required per CPU the stack requirements in OCEOSmp are much less, perhaps making it possible to reduce the number of memory chips in a system. OCEOSmp does not use virtual addressing, allowing further savings in memory as page tables are not required. The code for OCEOSmp core components is small, about 20KiB.