Mohamadreza Ahmadi | Safe Autonomy

Autonomous systems are being increasingly deployed in environments, where safety is of fundamental importance. Accidents involving self-driving vehicles, for instance, demonstrate the need for safe operation in autonomous systems. These systems are required to make decisions under uncertainty while attaining safety.

From a modeling perspective, the partially observable Markov decision process (POMDP) framework is general enough to model a variety of real-world sequential decision processes and can be used to analyze and design many learning-based autonomous systems. A POMDP is a generalization of a Markov decision process (MDP). A POMDP models an agent’s decision process, in which it is assumed that the system dynamics are determined by an MDP, but the agent cannot directly observe the underlying state. Instead, it must maintain a probability distribution (belief) over the set of possible states, based on a set of observations and observation probabilities, and the underlying MDP.

Nonetheless, POMDPs are often computationally intractable to solve exactly, so computer scientists have developed methods that approximate solutions for POMDPs. Several promising point-based methods via finite abstraction of the belief space are proposed in the literature. Yet, these techniques often do not provide any guarantee for safety or optimality. That is, it is not clear whether the probability of satisfying the safety/optimality requirement is an upper-bound or a lower-bound for a given POMDP. Establishing guaranteed performance is of fundamental importance in safety-critical applications, e.g. aircraft collision avoidance, and Mars rovers.

I borrow notions from control theory to address some of these challenges in studying POMDP representations of AI and machine learning systems, without the need for finite abstraction.

I borrow notions from control theory to address some of these challenges in studying POMDP representations of AI and machine learning systems, without the need for finite abstraction.

Treating POMDPs as Hybrid Systems

As mentioned above, since the states are not directly accessible in a POMDP, decision making requires the history of observations. Therefore, we need to define the notion of a belief or the posterior as sufficient statistics for the history. Given a POMDP with finite states $q \in Q$ , finite actions $a \in A$ , and finite observations $z \in Z$ , the belief $b_t(q)$ denotes the probability of the system being in state $q$ at time $t$ . The evolution of the beliefs over time can then be derived in closed-form using Bayesian inference as

$b_t(q') =\frac{O(q',a_{t-1},z_{t})\sum_{q\in Q}T(q,a_{t-1},q')b_{t-1}(q)}{\sum_{q'\in Q}O(q',a_{t-1},z_{t})\sum_{q\in Q}T(q,a_{t-1},q')b_{t-1}(q)},$

where $O:Q\times A \times Z\rightarrow [0,1]$ is the given observation probabilities (sensor model), $T:Q\times A\times Q\rightarrow [0,1]$ is the given transition probabilities, and the beliefs belong to the belief unit simplex $\mathcal{B} = \left\{ b \in [0,1]^{|Q|} \mid \sum_q b_t(q)=1, \forall t \right\}.$

A policy in a POMDP setting is then a mapping $\pi : \mathcal{B} \to A$ , i.e., a mapping from the continuous beliefs space into the discrete and finite action space.

I represent the belief update equation as a discrete-time switched system, where the actions $a \in A$ define the switching modes. Formally, the belief dynamics can be described as

$b_t = f_a\left(b_{t-1},z_t\right),$

where $z \in Z$ are the observations representing inputs, and $t \in \{1,2,\ldots\}$ denote the discrete time instances. This switched system representation allows me to use tools from hybrid systems literature to analyze POMDPs.

I represent the belief update equation as a discrete-time switched system, allowing me to use tools from hybrid systems literature to analyze POMDPs.

To illustrate, the figure below shows the belief evolution of a POMDP modeled by a state-dependent switched system $b_t=\begin{cases}f_{a_1}(b_{t-1},z_t),~ b \in \mathcal{B}_1,\\ f_{a_2}(b_{t-1},z_t),~ b \in \mathcal{B}_2\end{cases}$ induced by the policy $\pi=\begin{cases}a_1,~ b \in \mathcal{B}_1,\\ a_2,~ b \in \mathcal{B}_2\end{cases}$ .

An example of a POMDP with two states and the state-dependent switching modes induced by the policy.

Based on this switched system representation, I verified the safety and/or optimality requirements of a given POMDP by formulating a barrier certificates theorem. I showed that if there exist a barrier certificate satisfying a set of inequalities along the belief update equation of the POMDP, the safety/optimality property is guaranteed to hold. These conditions can be computationally implemented as a set of sum-of-squares programs.

The figure below illustrates this framework: Instead of solving the POMDP, I find a barrier certificate $B(t,b_t)$ whose zero-level sets separate the evolutions of the belief update equation from a set of initial beliefs $\mathcal{B}_0$ , and a pre-specified set $\mathcal{B}_*$ that characterizes safety or optimality. If such a barrier certificate exists, then $b_t \notin \mathcal{B}_*$ .

Illustration of my method based on barrier certificates for POMDP verification.

In fact, many problems in autonomy can be described as verifying (or synthesizing a policy such that) $b_t \notin \mathcal{B}_*$ . In the following, I will briefly outline few of them.

In fact, many problems in autonomy can be described as verifying (or synthesizing a policy such that) $b_t \notin \mathcal{B}_*$ . In the following, I will briefly outline few of them.

Application: Machine Teaching with Guarantees

The demand for machine learning models has skyrocketed in the past few years, as more and more sectors in industry and services rely on them. This has led to a shortage of machine teachers that can build those models. One solution to the increasing demand is to make teaching machines easy, fast, and universally accessible.

Machine teaching is a paradigm shift from machine learning in that machine teaching is focused on passing on the knowledge rather than discovering knowledge (as is the case in the machine learning methods). Given a learning algorithm and a target model, a machine teaching algorithm finds an optimal (often the smallest) training set. For an illustration, see the example in the figure below.

Given a large set of images, the teacher (randomly) picks a small “teaching set”. For this set, expert labels, as well as candidate features and hypotheses used by the crowd are elicited. The teacher then uses this information to teach the rest of the crowd to label the rest of the data, for which no features or labels are available. The teacher sequentially provides an unlabeled example from the teaching set to the worker, who attempts an answer. Upon receipt of the correct label, the learner may update her hypothesis before the next example is shown. [A. Singla, I. Bogunovic, G. Bartok, A. Karbasi, A. Krause, Near-Optimally Teaching the Crowd to Classify, in Proc. Int. Conf. Mach. Learn., June 2014]

A fundamental setback to many promising machine teaching techniques in the literature is that their performance is not guaranteed. In control theory, however, an algorithm is guaranteed to work, because there exists a certificate, e.g. a Lyapunov function. To propose guaranteed machine teaching methods, I formulated a POMDP representation for machine teaching techniques and used barrier certificates to guarantee teaching performance.

To propose guaranteed machine teaching methods, I formulated a POMDP representation for machine teaching techniques and used barrier certificates to guarantee teaching performance.

The figure below illustrates the use of barrier certificates to compare two machine teaching algorithms (see Chen et. al. for more details on the algorithms). Based on a barrier certificates of degree 2, I was able to certify that the adaptive teaching method is guaranteed to teach (the location of the target) with probability greater than 0.8 in 13 steps; whereas, no certificate for the myopic teaching method could be constructed. Numerical experiments validate this result.

Teaching the target hypothesis (green square) on a grid. The orange square denotes the current hypothesis. The dark grey square is the current example given by the teaching algorithm and light grey denotes the trace (history) of the examples. The guaranteed performance of each algorithm can be verified by solving an optimization problem involving a barrier certificate.

Application: Safe Imitation Learning

Imitation learning methods are concerned with mimicking human behavior in a given task. An agent (a learning machine) is trained to perform a task from demonstrations by learning a mapping between observations and actions.

In imitation learning, we need to iteratively roll out a trained model to collect trajectories and update the model based on some feedback collected on those trajectories. In safe-critical applications, one would want to impose safety constraint for roll-outs during training.

Imitation learning algorithms can also be modeled as POMDPs. Therefore, we can use barrier certificates to guarantee safety of imitation learning methods.

Imitation learning algorithms can also be modeled as POMDPs. Therefore, we can use barrier certificates to guarantee safety of imitation learning methods.

As an illustration, consider a human giving demonstrations to a robotic arm to close a microwave door. As the articulation relationships of the objects do not remain static in time (such as for a microwave), while in the active learning phase, a robot can take some actions that result in a change in the articulation relationship of object. This can cause in belief divergence over the object model. To avoid this, one straightforward approach is to use hard constraints which define the geometric boundaries of these models. We can find a barrier certificate that ensures no transition in articulation state of the object occurs for a given initial state and a set of actions (see the video below).

Robot "Gemini" learning how to close the microwave door from demonstrations. The imitation learning method used is called shadowing. The left arm is basically following (shadowing) the right arm (to which the human is giving demonstrations). We need to apply a certain amount of force at the end to completely close the microwave door. The motion of the door with respect to the body of the microwave is composed of two parts: a revolute and a rigid (towards the end). [Courtesy of Personal Autonomous Robotics Lab (PeARL), UT Austin]

Application: Privacy in Cyber-Physical Systems

Privacy is becoming a rising concern in many modern engineering systems, which are increasingly connected over shared infrastructures, such as power grids, healthcare systems, smart homes, transportation systems, and etc. Potentially malicious intruders may have access to the information available publicly or privately based on which they attempt to infer some “secret” associated with the system, such as personal activity preferences, health conditions, and bank account details. If the privacy is compromised, it could lead to substantial social or economic loss. Therefore, it is of fundamental importance to design cyber-physical systems that are provably safe against privacy breaches.

The privacy concern arises in many engineering contexts.

In recent years, a privacy notion called opacity has received significant attention. Generally speaking, opacity is a confidentiality property that characterizes a system’s capability to conceal its “secret” information from being inferred by outside observers. These observers are assumed to have full knowledge of the system model, often as a finite automaton, and can observe or partially observe the behaviors of the system, such as the actions performed, but not the states of the system directly. Therefore, POMDPs are a natural framework to model such scenarios.

Hence, the notion of opacity I use is defined on the belief space of the intruder. That is, the privacy requirement requires that the intruder is never confident that the system is in the secret state $Q_s$ with a probability larger than or equal to a constant $\lambda \in [0, 1]$ , at any time $t$ :

$\sum_{q \in Q_s} b_t(q) \le \lambda,~~\forall t$

In order to verify given a POMDP whether a privacy requirement is satisfied or not it suffices to define $\mathcal{B}_* =\{ b \mid \sum_{q \in Q_s} b_t(q) > \lambda \}$ . Obviously, if no solution of the system is in $\mathcal{B}_*$ . Then, we have $\sum_{q \in Q_s} b_t(q) \le \lambda,~~\forall t$ .

The figure below shows the POMDP model for an inventory management system, with two actions and $\{\sigma_1,\sigma_2\}$ three states $Q = \{q_1, q_2, q_3\}$ representing different inventory levels of a company. The states $q_2 , q_3 \in Q_s$ correspond to the low and high inventory levels, respectively, and are the secret states. If the intruder, say a competitor or a supplier, has information over the current inventory levels being high or low, they may manipulate the price of the goods, and thus negatively affect the company’s profit. Therefore, it is of the company’s interest to conceal the inventory levels from the potential intruders. That is, the company seeks to maximize $\gamma$ such that $b_t(q_2) + b_t(q_3) \le \gamma,~\forall t \in \{1,2,\ldots,10\}$ .

A POMDP representation of an inventory management system.

In the table below, I report the results on the upper-bound of $\gamma$ obtained by increasing the degree of the barrier certificates.

$deg(B)$	2	4	6	8	10
$\gamma$	0.93	0.88	0.80	0.74	0.69

Application: Multi-Robot Safe planning

I introduced a safety shield concept that uses the proposed discrete-time barrier functions for POMDPs to ensure the satisfaction of a set of high-level mission specifications in terms of temporal logic formulae. This was carried out by formulating logical compositions of barrier functions and results for finite-time reachability. Given a nominal policy, e.g. by a an expert or some control algorithm, the safety shield, which is implemented online, assures the satisfaction of high-level specifications with minimal interference. We demonstrated the efficacy of the proposed method with a team of heterogeneous robots including a Segway, a quadrotor, and a flipper robot trying to locate an object while avoiding several obstacles in CAST.

(left) The experimental set up at CAST, Caltech. In this scenario, each autonomous agent has its own sensing and actuation capabilities and they communicate with each other if necessary to locate the target. While the quadrotor can provide aerial sensing of the CAST environment, only flipper (rover) can inspect the objects on the CAST floor. In addition, once an object is identified as a target by the flipper, the Segway can retrieve the object. (center) The safety-shield overrides the action of the human operator "Down" (blue) with "Right" (orange) to ensure obstacle avoidance with 95 percent probability. (right) The safe trajectory generated using the method based on barrier functions ensuring probabilistic 95 percent safety from colliding to the obstacles after their discovery by the Flipper and 100 percent chance of target retrieval after its location is identified by the quadrotor.

Application: Robust Decision Making

Most formulations assume the the transition and the observation probabilities for POMDPs are explicitly given. However, unforeseeable events, such as (unpredictable) structural damage to a system or an imprecise sensor model, necessitate a more robust formalism. This problem is directly related current challenges for the area of artificial intelligence, referred to as robust decision-making and safe exploration. Concretely, the problem is to provide a policy for an (autonomous) agent that ensures certain desired behavior by robustly accounting for any uncertainty and partial observability that may occur in the system. The policy should be optimal with respect to some performance measure and additionally ensure safe navigation through the environment.

Formally, I consider POMDPs with interval uncertain transition and/or observation probabilities $T(q,a,q') \in [\underline{l}_{q,a,q'},\overline{l}_{q,a,q'}],~ (q,a,q') \in T_u,$ and $O(q,a,z) \in [\underline{o}_{q,a,z},\overline{o}_{q,a,z}],~(q,a,z) \in O_u,$ where $T_u$ denote the set of triplets $(q,a,q')$ corresponding to the uncertain transition probabilities (similar definition for $O_u$ ).

Then, based on the hybrid system model, the robust decision making problem can be cast as a safety verification problem of an uncertain hybrid system.

Based on the hybrid system model, the robust decision making problem can be cast as a safety verification problem of an uncertain hybrid system.

I demonstrated the applicability of my methods using barrier certificates on a variant of the RockSample problem defined on a $4\times 4$ grid with 2 rocks.

A *RockSample[4, 2]* instance where the initial position of the rover and the two rock positions in the grid are known. To the right is the goal area, and to the lower side of the grid is a sand dune from which the rover may fall over.

In this problem, a Mars rover explores a terrain, where “scientifically valuable” rocks may be hidden. The locations of the rocks are known, but it is unknown whether they have the type “good” or “bad”. Once the rover moves to the immediate location of a rock, it can sample its type. As sampling is expensive, the rover is equipped with a noisy long-range sensor that returns an estimate on the type of the rock. The accuracy of the sensor decreases with the distance to the rock (thus, we have uncertain observation probabilities). Also, the states on the bottom of the grid are slippery sand dunes and should be avoided.

Based on barrier certificates, we find two policies such that the Mars rover can discover the rock types by maximizing

$P\left(\mathit{rock1\_good} \cap \mathit{rock2\_bad}\right) = b_{t^*}(rock1\_good) b_{t^*}(\mathit{rock2\_bad})$

and also reaching the goal region safely. The figure shows how these two policies evolve.

Top: Positions of the Mars rover at certain time steps with the first policy- the Mars rover stops at the initial position and use sensors to collect information about the rocks and then moves to the goal region. Bottom: Positions of the Mars rover at certain time steps with the second policy- the Mars rover first moves closer to the rocks, then checks the type of the rocks using the sensor, and moves to the goal after identifying the rock types.