Difference between revisions of "OCEOS/oceos fdir"
Okhoruzhyy (talk | contribs) |
|||
Line 103: | Line 103: | ||
==<span style="color:#0000ff">FDIR Configuration</span>== | ==<span style="color:#0000ff">FDIR Configuration</span>== | ||
<blockquote> | <blockquote> | ||
User can provide function, that is called in case of system ERROR | User can provide function, that is called in case of system ERROR. | ||
For this function to be called it is necessary to set system_status_mask element in the struct application_configuration. This is normally set in config.c. | |||
<syntaxhighlight lang="C"> | <syntaxhighlight lang="C"> | ||
/** | /** |
Revision as of 08:31, 2 February 2024
OCEOS Fault Detection, Isolation and Recovery (FDIR)
FDIR Introduction
Hardware and software faults can occur that affect the behavior of the system. To detect such faults and anomalies, as scheduling proceeds OCEOS automatically checks that its key data areas have not been corrupted and that uses of resources are within their prescribed bounds. Parameters passed to directives are checked for consistency with the values originally declared in the application configuration.
Once detected, OCEOS provides five levels of responses to faults and anomalies:
Level 1: The status code returned by a directive indicates that there was a problem
The status code returned by a directive should always be checked.
This code identifies whether the directive succeeded and if not indicates the reason for the failure.
This is the only response made when the anomaly is an invalid parameter to a directive. It is combined with responses from other levels if a directive fails for some other reason.
Level 2: An appropriate log entry is added to the system log.
The log entry contains a code identifying the problem and a 32-bit timestamp.
This response occurs in addition to the Level 1 response if a directive fails due to an internal factor such as the maximum number of jobs for a task being already created or a data queue being full.
This response also occurs when a task misses its deadline or any unexpected behavior. This response is often accompanied by a Level 3 response.
Level 3: The system state variable is updated and a user defined problem handling function called.
The system state variable consists of 32 flags each indicating a particular problem type. It is reset by oceos_init(). A copy of this variable is also maintained. Both variables may be reset by the ASW. This response occurs also in most circumstances where the Level 2 response occurs.
Level 4: An ASW problem handling function is called by OCEOS.
A problem handling function can be identified in the application configuration structure passed to oceos_init() and if present may be called by OCEOS. This function can read the system log and system state variable, read the task timing and other information, enable and disable tasks, reset counting semaphores and data queues, and if necessary exit OCEOS. It can reset the system state variable, it is recommended that the copy of the system state variable not be reset so as to provide a longer term record of problems.
This response occurs also in most circumstances where the Level 3 response occurs
Level 5: OCEOS exits and returns to the ASW with an appropriate status code.
The status code in effect provides a Level 1 response to the oceos_start() directive used by the ASW to start OCEOS. In normal circumstances OCEOS never returns to the ASW.
This response by OCEOS only occurs when it detects that one of its essential internal elements has been corrupted. Depending on the severity of this corruption a Level 2 response may also be provided if possible, and the system state variable may also be updated. The user defined problem handling function is not called.
If OCEOS exits the ASW can inspect the system log and system status variable, and also the task timings and other information maintained by OCEOS. The ASW can decide what corrective action should be taken and may resume scheduling with a call to oceos_start(). The ASW can also inspect these items during scheduling, allowing it check the state of the system at any time.
This approach provides a graduated response by OCEOS to problems and allows some problems be isolated and recovered by, for example, disabling tasks or clearing a data queue if an appropriate problem handling function is provided as part of the ASW. The ASW may also check system information at any time to determine whether things are proceeding as expected.
Note
The ASW may set up a watchdog timer which will cause a system reset if a certain time elapses before it is reset. By using a lowest priority task to reset the watchdog failure of higher priority tasks to complete within their expected time can be detected. In OCEOS the lowest priority task, if it is the only task at that priority, is allowed to run forever and may be used to reset the watchdog.
OCEOS provides watchdog timer functionality for GR716 target and not part of OCEOS for other targets.
The log area contains the System State Variable, the system log, and the optional context switch log.
The OCEOS system state variable contains flags that indicate a certain problem has occurred. It is automatically updated by OCEOS using ‘OR’ to avoiding losing information, typically a user defined function is called to deal with the problem.
Two sets of flags are kept as part of the system state variable. One accumulates indicators of all problems that have occurred, and typically is reset by the application only after a restart. The other indicates current problems, and typically is reset by the user defined problem handling function. For system state variable location please see HERE.
Application developer can provide system status mask that is used to AND with system status variable and if not zero, calls user defined error handling function.
/**
*
* SYSTEM STATUS VARIABLE FLAGS
*
* These are stored in the 32-bit system status variable
* and updated by OCEOS when a problem is detected.
* Resetting them is the responsibility of the application.
*/
#define STATUS_NORMAL 0U // no flag set
#define STATUS_MASK_NORMAL 0x0U // no error function can be called
#define STATUS_INVALID 0xffffffffU // system status invalid
/* Task related problems */
#define STATUS_DISABLED_TASK_START 0x1U // An attempt to start a disabled task
#define STATUS_TASK_JOB_LIMIT_OVER 0x2U // An attempt to execute a task when its jobs limit is already reached.
#define STATUS_JOB_OVER_TIME 0x4U // Job time from creation to completion exceeds allowed maximum for task
#define STATUS_JOB_INTERVAL_SHORT 0x8U // Minimum time between job creations is less than the allowed minimum for task
#define STATUS_READYQ_FULL 0x10U // Ready queue unable to accept job as result of being full
#define STATUS_READYQ_NO_REMOVE 0x20U // Remove job from ready queue failed
/* Mutex related problems */
#define STATUS_MUTEX_ALREADY_HELD 0x40U // Mutex wait() when mutex already held
#define STATUS_MUTEX_NOT_HELD 0x80U // Mutex signal() when not already held
#define STATUS_MUTEX_NOT_RETURNED 0x100U // Mutex not returned before job terminates
#define STATUS_MUTEX_NOT_NESTED 0x200U // Use of multiple mutexes not nested
/* Counting semaphore and data queue related problems */
#define STATUS_SEMAPHORE_JOBS_FULL 0x400U // Attempt to add job to semaphore pending list when list full
#define STATUS_DATAQ_JOBS_FULL 0x800U // Attempt to add job to data queue pending list when list full
#define STATUS_DATAQ_FULL 0x1000U // Data queue write when queue already full
/* Timed actions related problems */
#define STATUS_TIMED_JOBS_FULL 0x2000U // Timed actions queue already full for timed task start
#define STATUS_TIMED_OUTPUT_FULL 0x4000U // Timed actions queue already full for timed output
#define STATUS_TIMED_ACTION_LATE 0x8000U // Timed action late
/* Default trap handler should have been replaced
* NOT USED AT THE MOMENT*/
#define STATUS_SYSTEM_ERROR 0x10000U // ERROR was handled by default trap handler
/* Flag to indicate problem with Stack Pointe
* NOT USED AT THE MOMENT*/
#define STATUS_SP_WARNING 0x20000U // SP not in expected range
/* Log system error */
#define STATUS_BAD_LOG 0x40000U // System log problem
/* Not using next few bits */
/* Will have to exit */
#define STATUS_BAD_SENTINEL 0x40000000U // Area sentinel corrupt
#define STATUS_BAD_META_PTR 0x80000000U // Null meta pointer
FDIR Configuration
User can provide function, that is called in case of system ERROR.
For this function to be called it is necessary to set system_status_mask element in the struct application_configuration. This is normally set in config.c.
/** * User to implement in case of system error; * Comment it out if not used and set field system_error_function in app_config to NULL */ void oceos_on_error() { // Application ERROR Handling code return; } ... /* * Create the application configuration structure */ struct application_configuration app_config = {0}; app_config.system_error_function = &oceos_on_error; // NULL => ignore
API Functions
Directive | Description | main | task | IRQ handler |
---|---|---|---|---|
oceos_system_state_get() | Get the value of the system state variable | * | * | * |
oceos_system_state_set() | Set system state variable | * | * | * |
oceos_system_watchdog_init() | Initialize the watchdog | * | * | * |
oceos_system_watchdog_enable() | Enable the watchdog | * | * | * |
oceos_system_watchdog_disable() | Disable the watchdog | * | * | * |
oceos_system_watchdog_ticks_remaining() | Get the number of ticks to watchdog timeout | * | * | * |
oceos_system_watchdog_reset() | Reset the watchdog | * | * | * |
oceos_system_state_get()
Header File
system_log.h
Description
Returns the value of the system state variable.Prototype
U32_t oceos_system_state_get(void);Parameters
Parameter Description Returns
This function returns U32_t.
U32_t Description STATUS_INVALID Failed to read System State Status >= 0 System State Status Example Usage
U32_t status; status = oceos_system_state_get(); if (STATUS_INVALID == status) { // Handle ERROR }
oceos_system_state_set()
Header File
system_log.h
Description
Set system state variable and save previous setting to System State variable OLD. Setting to zero means resetting the System State statusPrototype
enum DIRECTIVE_STATUS oceos_system_state_set( U32_t new_state // new value of system state variable );Parameters
Parameter Description new_state New value of system state variable Returns
This function returns enum DIRECTIVE_STATUS.
enum DIRECTIVE_STATUS Description INCORRECT_STATE System Meta pointer is NULL or Log pointer is NULL SUCCESSFUL All OK Example Usage
enum DIRECTIVE_STATUS status; U32_t sys_state = 0; status = oceos_system_state_set(sys_state); if (SUCCESSFUL != status) { // Handle ERROR }
oceos_system_watchdog_init()
Header File
FDIR.h
Description
FOR GR716 ONLY
This directive initializes the watchdog for GR716 at 0x80003070. The watchdog window reload value is in the control register. Large window sizes make resets more likely.Prototype
enum DIRECTIVE_STATUS oceos_system_watchdog_init( BOOLE_t enable, // enabled initially U32_t timeout, // watchdog timeout U8_t window_size // watchdog window );Parameters
Parameter Description enable TRUE to enable watchdog at initialization timeout Watchdog timeout window_size Watchdog windows size Returns
This function returns enum DIRECTIVE_STATUS.
enum DIRECTIVE_STATUS Description INVALID_NUMBER Parameter check failed SUCCESSFUL All OK Example Usage
enum DIRECTIVE_STATUS status; status = oceos_system_watchdog_init(TRUE, 1000, 0xFF); if (SUCCESSFUL != status) { // Handle ERROR }
oceos_system_watchdog_enable()
Header File
FDIR.h
Description
FOR GR716 ONLY
This directive enables the watchdog for GR716 at 0x80003070 which will use timeout and window values setup previously.Prototype
enum DIRECTIVE_STATUS oceos_system_watchdog_enable();Parameters
Parameter Description Returns
This function returns enum DIRECTIVE_STATUS.
enum DIRECTIVE_STATUS Description SUCCESSFUL All OK Example Usage
enum DIRECTIVE_STATUS status; status = oceos_system_watchdog_enable();
oceos_system_watchdog_disable()
Header File
FDIR.h
Description
FOR GR716 ONLY
This directive disables the watchdog for GR716 at 0x80003070.Prototype
enum DIRECTIVE_STATUS oceos_system_watchdog_disable();Parameters
Parameter Description Returns
This function returns enum DIRECTIVE_STATUS.
enum DIRECTIVE_STATUS Description SUCCESSFUL All OK Example Usage
enum DIRECTIVE_STATUS status; status = oceos_system_watchdog_disable();
oceos_system_watchdog_ticks_remaining()
Header File
FDIR.h
Description
FOR GR716 ONLY
This directive returns number of ticks remaining in watchdog timer for GR716 at 0x80003070.Prototype
unsigned int oceos_system_watchdog_ticks_remaining();Parameters
Parameter Description Returns
This function returns unsigned int.
unsigned int Description >= 0 Number of watchdog ticks Example Usage
unsigned int num_ticks; num_ticks = oceos_system_watchdog_ticks_remaining();
oceos_system_watchdog_reset()
Header File
FDIR.h
Description
FOR GR716 ONLY
This directive resets the watchdog timer for GR716 at 0x80003070, allowing it to be enabled or disabled after reset.Prototype
enum DIRECTIVE_STATUS oceos_system_watchdog_reset( BOOLE_t enable // whether to enable or disable after reset );Parameters
Parameter Description enable Whether to enable or disable after reset Returns
This function returns enum DIRECTIVE_STATUS.
enum DIRECTIVE_STATUS Description SUCCESSFUL All OK Example Usage
enum DIRECTIVE_STATUS status; status = oceos_system_watchdog_reset(TRUE);