tutorial-notes-3-dearden
21 pages
English

tutorial-notes-3-dearden

Le téléchargement nécessite un accès à la bibliothèque YouScribe
Tout savoir sur nos offres
21 pages
English
Le téléchargement nécessite un accès à la bibliothèque YouScribe
Tout savoir sur nos offres

Description

Planning and Learning in Hybrid Discrete ContinuousModelsRichard DeardenJune 20, 2005AbstractMany real world problems require richer representations than are typically studied in planning andlearning. For example, state estimation in complex systems such as vehicles or spacecraft often requires arepresentation that captures the rich continuous behaviour of these kinds of systems. Similarly, planningfor such systems may require a representation of continuous resource usage, particularly whenunder uncertainty. In this talk I will discuss some commonly used representations of these systems ashybrid systems, examine some approaches to planning and state estimation in them, and finally discusssome first steps toward learning a hybrid model, or at least parameters of such a model, directly fromdata.These notes are intended as a set of background information for an IJCAI tutorial talk. They are madeup of pieces of text taken from a variety of places, without necessarily having any continuity1 IntroductionIn these notes I discuss classical AI problems such as planning, diagnosis, and learning but applied ina much richer space of models than is used in traditional AI applications. The motivation for this isapplications such as Mars rovers, automated factories, chemical process plants, and spacecraft for whichdiscrete models such as those traditionally used in planning and learning are inadequate. In particular,we will look at systems represented using hybrid ...

Informations

Publié par
Nombre de lectures 14
Langue English

Extrait

Planning and Learning in Hybrid Discrete-Continuous Models Richard Dearden June 20, 2005
Abstract Many real-world problems require richer representations than are typically studied in planning and learning. For example, state estimation in complex systems such as vehicles or spacecraft often requires a representation that captures the rich continuous behaviour of these kinds of systems. Similarly, planning for such systems may require a representation of continuous resource usage, particularly when planning under uncertainty. In this talk I will discuss some commonly used representations of these systems as hybrid systems, examine some approaches to planning and state estimation in them, and nally discuss some rst steps toward learning a hybrid model, or at least parameters of such a model, directly from data. These notes are intended as a set of background information for an IJCAI tutorial talk. They are made up of pieces of text taken from a variety of places, without necessarily having any continuity
1 Introduction In these notes I discuss classical AI problems such as planning, diagnosis, and learning but applied in a much richer space of models than is used in traditional AI applications. The motivation for this is applications such as Mars rovers, automated factories, chemical process plants, and spacecraft for which discrete models such as those traditionally used in planning and learning are inadequate. In particular, we will look at systems represented using hybrid models, in which a mixture of discrete and continuous variables are used. These models use differential equations (or simpler representations where appropriate) to representthe continuous dynamics of the system, but make the equations depend on the discrete state, often referred to as the mode of the system. We will use the term state to refer to the combination of a mode and a value for each continuous variable. For example, consider a motor. It can be idle or powered, and has a number of fault modes such as having a faulty encoder. These correspond to the discrete part of the model. It also has continuous state, such as the speed it is running at, the current powering it, and so on. In each discrete mode, there is a set of differential equations that describe the relationship between the various continuous values, and the way those values evolve over time. Transitions from mode to mode may occur as a result of actions, or in some models may be induced by the continuous system behaviour, for example when a very high current through the motor causes it to burn out, its mode changes from ’working’ to ’failed’. In many cases, not all of the hybrid system will be observable. Therefore, we also have an observation function that denes the likelihood of an observation given the mode and the values of the continuous vari-ables. All these processes are inherently noisy, and the representation reects this by explicitly including noise in the continuous values, and stochastic transitions between system modes. We describe this hybrid model in more detail in Section 2.
1
Since rovers are illustrative of many of the problems we encounter in diagnosing systems of this type, and since the primary testbed we have used in this work is the K-9 rover at NASA Ames, we will use rovers as a running example throughout this talk. From the point of view of planning this example exhibits hybrid behaviour as it moves between discrete locations and operational modes while also needing estimates of continuous quantities such as the amount of power currently available from the battery. From the point of view of state estimation there are discrete variables corresponding to commanded modes and the health or otherwise of components, plus continuous variables such as the currents through different subsystems, or the speed the wheels rotate. The complex dynamics of a system such as a rover, along with its interaction with a potentially complex, poorly modeled and noisy environment—the surface of Mars in the rover’s case—makes it very difcult to determine the current true state with certainty, or to predict the efects of actions with certainty. The representation we will choose makes these sources of uncertainty explicit, allowing noisy observations of the underlying system state, and probabilistic outcomes for actions. By representing uncertainty explicitly we can reason about it when selecting actions to perform. This is extremely important as actions that appear good for the most likely state may be catastrophic in another of its possible states, or may have unlikely but catastrophic outcomes. In Section 2 we describe two hybrid system models in detail. The rst is a relatively simple model we use for planning, while the second is a richer model used for diagnosis and state estimation, and which makes an interesting target representation for algorithms trying to learn hybrid models of systems. In Section 4 we look at a classically-based approach to planning in these kinds of models, while in Section 5 we look at a Markov Decision Problem-based approach for relatively simple hybrid models. Finally, Section 6 provides an intial description of how such a model might be learned directly fro data.
2 Hybrid Models In this section we describe two different approaches to modelling hybrid systems. The rst is a relatively simple model used for planning. It is based on the assumption that discrete mode changes are only the result of control actions or the previous discrete mode. The second representation has been used for diagnosis and state estimation. It allows the full richness of arbitrary differential equations, the values of continuous parameters inuencing discrete mode changes, and partial observability. It is also the kind of model we would like to learn from data. 2.1 Hybrid MDPs For our simple hybrid representation we adopt a standard MDP model with the addition of continuous state variables: { ( Z, X ) , A, T , R } . Where Z is a set of discrete states (which could equally be represented usig a set of variables), and X is a vector of continuous state variables < X 1 , . . . , X d > . Without loss of generality, we will assume the value of the variables are all in the range [0 , 1) , and thus the state space is the unit square [0 , 1) d . We use ( z Z, x [0 , 1) d ) to refer to a particular state. A is a nite set of actions. T is the transition model: T a (( z 0 , x 0 ) , ( z, x )) is the probability that the system moves from state ( z, x ) to ( z 0 , x 0 ) if action a A is taken. Finally, R is the reward model: R az ( x ) is the reward for taking action a in state x . We are interested in maximizing the expected total reward of a nite-horizon plan. The Bellman Equa-tion for this model is: V zn +1 ( x ) = m a A x { R za ( x ) + X T a (( z 0 , x 0 ) , ( z, x )) V zn 0 ( x ) } (1) a x 0 ,z 0
2
where V zn ( x ) stands for the value function over the horizon of n time-steps. We dene the transition model by a a marginal probability distribution on the arrival discrete state: T ad ( z 0 , ( z, x )) , and a conditional distribution over the continuous space T ac ( x 0 , z 0 , ( z, x )) given the arrival discrete state. We will frequently assume that one of the continuous variables in the modelis time. This allows us to encode semi-MDPs in this framework. The factorization of the transition probability allows the following alternative form for a Bellman backup: V zn +1 ( x )= a m a A x { R az ( x )+ λ X T ad ( z 0 , ( z, x )) σ a ( z 0 ) } (2) z 0 σ a ( z 0 ) = Z T ac ( x 0 , z 0 , ( z, x )) V zn 0 ( x 0 ) (3) x 0 For reasons that will become clear when we use this model for planning, we will often assume that T ac is a discrete distribution, so the integral in Equation 3 can be replaced by a summation. 2.2 Probabilistic Hybrid Automata As we said above, the hybrid MDP representation described above is restricted in terms of the kinds of hy-brid system it can represent. For problems such as diagnosis where a richer representation is needed, we use the probabilistic hybrid automaton (PHA) model of [24] and [28]. A PHA is a tuple h Z, X, Y, F, G, T , P i , where: Z = z 1 , . . . , z n is the set of discrete modes the system can be in. X = x 1 , . . . , x m is the set of continuous state variables which capture the dynamic evolution of the automaton. Y = Y c Y d is the set of observable variables. Y c is the set of continuous observable variables, while Y d is the set of discrete observable variables, typically commands sent to the system. F = F 1 , . . . , F n is, for each mode z i the set of discrete time difference equations F i that describe the evolution of the continuous variables X in that mode. We write P ( X t | z t 1 , x t 1 ) for the distribution over X at time t given that the system is in state ( z, x ) at t 1 . G = G 1 , . . . , G n is, for each mode, the set of equations governing the relationship between the observational variables Y and the state variables X and Z . We write P ( Y t | z t , x t ) for the distribution of observations in state ( z t , x t ) . T is a probabilistic transition function over the discrete modes that species P ( Z t | z t 1 , x t 1 ) , the conditional probability distribution over modes at time t given that the system is in state ( z, x ) at t 1 . In some systems, this is independent of the continuous variables: P ( Z t | z t 1 , x t 1 ) = P ( Z t | z t 1 ) . P is the prior distribution P ( Z 0 , X 0 ) over states of the system. As before, we denote a hybrid state of the system by s = ( z, x ) , which consists of a discrete mode z , and an assignment to the state variables x . As an example, consider the model shown in Figure 1. This is part of a model for a single wheel of a Mars rover, with four discrete modes ( drive-at, drive-uphill, drive-downhill , and stuck-wheel ), one continuous variable, speed , and one observable variable obs-speed . The difference equation for the drive-downhill state is shown, as is the relationship between speed and obs-speed for that mode. The transition function is represented by the arrows connecting the modes. 3
Figure 1: A small example PHA with four discrete states and one continuous variable.
Note that the PHA model we have described assumes discrete decision/observation epochs and sum-marises the possible discrete transitions of the system—what could happen in a single epoch—by a simple transition function. Sometimes even this model is not expressive enough and we need a continuous-time model such as that presented in [ ? ].
3 Planning The key features of our representation which make it challenging from a planning perspective are the con-tinuous variables and the uncertainty in the model. In this section we briey examine existing approaches to planning under uncertainty, and discuss whether they could be applied in this hybrid system representation. Table 1 classies much of this work along the following two dimensions: Representation of uncertainty: whether uncertainty is modeled strictly logically, using disjunctions, or is modeled numerically, with probabilities. Observability assumptions: whether the uncertain outcomes of actions are not observable, partially ob-servable, or fully observable. There are a number of difculties in attempting to apply existing work on planning under uncertainty to hybrid models of systems such as spacecraft or rovers. First of all, the work listed in Table 1 assumes a very simple model of action in which explicit time constraints are not allowed, and actions are considered to be instantaneous. These characteristics are not as much of an obstacle for Partial-Order Planning frameworks such as SENS p [16], PUCCINI [19], WARPLAN -C [45], CNLP [35], Buridan [26], UDTPOP [34], C-Buridan [14], DTPOP [34], Mahinur [33] and Weaver [6]. In theory, these systems could represent plans in these models. The requirements for a rich model of time and action are more problematic for planning techniques that are based on the MDP or POMDP representations, satisability encodings, the graphplan representation, or state- space encodings. These techniques rely heavily on a discrete model of time and action. (See [40] for a more detailed discussion of this issue.) Semi-Markov decision processes ( SMDP s) [37] and temporal
4
Disjunction Probability CGP [41] Non CMBP [13, 1] Buridan [26] Observable C -PLAN [12, 17] UDTPOP [34] Fragplan [25] SENS p [16] C-Buridan [14] Cassandra [36] DTPOP [34] Partially PUCCINI [19] C -MAXPLAN [29] Observable SGP [46] ZANDER [29] QBF -P lan [38] Mahinur [33] GPT [7] POMDP [8] MBP [2] JIC [15] Fully WARPLAN -C [45] Plinth [20] Observable CNLP [35] Weaver [6] PGP [5] MDP [8] Table 1: A classication of planners that deal with uncertainty. Planners in the top row are often referred to as conformant planners, while those in the other two rows are often referred to as contingency planners
MDP s ( TMDP ) [9] can be used to represent actions with uncertain durations, but as we will see in Section 5 they can only operate on a restricted version of our hybrid representation. A second, and equally serious, problem with existing planning techniques is that they all assume that uncertain actions have a small number of discrete outcomes. To characterize where a rover could end up after a move operation, or in fact the effects of an action on any continuous parameter in a hybrid model, we would typically have to list all the different possible discrete locations. For many problems this kind of discrete representation is impractical since most of the uncertainty involves continuous quantities, such as the amount of time and power an activity requires. Action outcomes are distributions over these continuous quantities. There is some recent work using models with continuous states and/or action outcomes in both the MDP [3, 30, 31, 39] and POMDP [43] literature, but this has primarily been applied to reinforcement learning rather than planning problems. We will come back to this point in Section 5.
4 Just In Case Contingency Planning In this section we outline a planning approach that has successfully been applied to hybrid models for activity and route planning on a Mars rover. The approach is based on contingency planning, in which a branching plan is built so that as uncertainty is resolved at execution time, the best branch for the actual outcome that occurred can be selected. In the classical approach to contingency planning, each time an action with uncertain outcomes is added to a plan, the planner attempts to establish the goals for each different outcome of the action. Unless there are only a few discrete sources of uncertainty in a domain, this approach is completely impractical. Of the planning systems in table 1, only Just-In-Case ( JIC) contingency scheduling [15] and Mahinur [33] exhibit a principled approach to choosing what contingencies to focus on. The approach we describe here builds upon JIC. The basic idea in the JIC approach is to take a seed schedule, look for the place where it is most likely to fail, and augment the schedule with a contingent branch at that point. The process is repeated until
5
Figure 2: The JIC approach.
Figure 3: Example showing that the place where the plan is most likely to fail may not be the best branch point.
the resulting contingent schedule is sufciently robust, or until available time is exhausted. This process is illustrated in Figure 2. Conceptually, it seems straightforward to apply the JIC approach to planning problems. Using a con-ventional planner, we rst generate a seed plan assuming the expected behavior of each activity; in other words, we reason as if every action uses the expected amount of time and resources. This is the same approach taken in JIC scheduling. As with JIC scheduling, we then choose a place to insert a contingency branch. Once again, using a conventional planner, we generate a plan for the contingency branch and add 1 it to the existing plan.
4.0.1 The JIC Branch Heuristic For JIC planning, the tricky part is deciding where to insert contingency branches, and what the branch conditions should be. In Drummond et al.’s original implementation for automatic telescope scheduling, branches are added at the points with the greatest probability of failure . Given the distributions for time and resource usage this is relatively easy to calculate by statistical simulation of the plan. Unfortunately, the points most likely to fail are not necessarily good points for contingent branches. Consider the example in Figure 3 where we have a seed plan with two actions, A 1 and A 2 , leading to a goal G that has positive value. Initially we have 20 units of some resource (say power) and each of the actions consumes somewhere between 5 and 15 units of the resource. Clearly, this plan is most likely to fail after (or during) action A 2 . However, if the plan fails after (or during) action A 2 , there will not be any resources left. If all the alternative activities require some of this resource, then there is clearly no point in putting a contingent branch after A 2 . Fundamentally, the problem is that in order to select the best place to insert a branch, we need to know whether or not it is possible to accomplish anything useful at the points under consideration. More precisely, we need to know how much utility could be gained by inserting a branch at each given point. In order to do this, we need to know the value function of the mainline plan and of each possible branch. The 1 Just as with JIC scheduling, this process is not guaranteed to converge to an optimal contingent plan. However, JIC will always monotonically improve a plan until a local optimum is reached.
6
value function gives the expected future reward (utility) at each step of a plan, as a function of the resource levels. Computing the value function for a completed plan (such as the seed plan) is relatively straightforward. It may be done analytically if the resource consumptions for activities are simple distributions. However, more typically, Monte Carlo simulation is required [3, 42]. Similarly, it is easy to get an estimate of the probability distribution over resources at each step of a plan. A crucial piece of information is then the value function of the best branch plan that can be added at each point in the existing plan . If we had this information, we could easily determine the optimal branch point in the plan. We would just have to compare the relative gain in utility obtained by considering the best possible branch plan at each point and pick the branch point where this gain is maximal. Our extended JIC algorithm nds the best branch point using the following approach. For each possible branch point in the current plan: 1. Calculate the value function (as a function of available resources) of the remaining plan as well as the probability distribution of resource availability at that point (using Monte Carlo simulation). 2. Estimate the value function of the best branch that can be added to the plan at this point, using the procedure described in Section 4.1. This procedure also partitions the value function for a branch according to the set of goals contributing to the expected value; we can thus determine the set of goals responsible for the branch’s estimated value. 3. Calculate the net utility gain for adding the best branch plan, as described in Section ?? . The best overall branch point is the one with the maximum net utility gain. The branch condition is the condition that denes the region where the value function of the branch is greater than that of the current plan. We generate the contingency branch using the same planner as for the seed plan, setting the initial conditions equal to the branch condition, and providing the set of goals pursued by the optimal branch. Note that the steps for estimating the value of a branch do not actually construct the branch plan. Unfortunately, there is no easy way to calculate the exact value function for the best possible branch. The problem is that we are trying to simultaneously optimise the actual branch to be added, the position to insert it into the existing plan, and the branch conditions—the test for whether the branch should be taken or not. Computing all of this would require actually doing the planning for all possible branches. Instead we must approximate this value function.
4.1 Estimation of Branch Utility A critical part of our algorithm is to compute an estimate of the value function of possible branch plans, at each candidate branch-point of the mainline plan. It is based on a representation of the planning problem as a plan graph [4]. Graphplan is a classical planning algorithm that rst performs a reachability analysis by constructing a plan graph, and then performs goal regression within this graph to nd a plan. Our approach retains only the rst of these stages, the plan graph construction. We then perform back-propagation of utility tables in the graph to produce estimates of utility functions (instead of plans). This section provides an outline of this mechanism.
4.1.1 The Plan Graph The plan graph is a sequential graph that alternates propositional (uent) levels and action levels. Each propositional level contains the set of propositions that can be made true at that level, and a set of mutual
7
Figure 4: An example of plan graph (partial). The two numbers below each action represent, rst, its expected consumption, and second, the minimum power required to be allowed to start this action.
exclusion (mutex) constraints between pairs of these propositions. A mutex between two uents indicates that these propositions cannot both be true at the same time at this level of the graph. The rst propositional level contains all the uents that are true in the initial state of the problem (initial conditions). The action levels contain all the actions that can be applied given the previous propositional level. Each action has an arc from each uent that it consumes and an arc to each uent it produces. Figure 4 shows a part of the plan graph obtained in a simple example where the only continuous variable is power. In this problem, the mainline plan (shown in bold) consists of two actions: A which takes the uent p as precondition and produces q and r , and B which has q as precondition and g as effect. The uent g represents a goal that provides a reward (utility) of 5. For actions A and B , the expected consumption is 10 Ah, and they can be started only if the current level of resource is at least 15 Ah. Three other actions, C , D , and E , are available in the domain, but they are not included in the mainline plan. The uent g 0 represents a secondary goal with utility 1. Finally, both p and s are true and all the other uents are false in the initial conditions. There are two points of the mainline plan that are candidate branch points: at the beginning of the plan, and between A and B . The latter is characterized by the following set of propositions: p , q , r and s (all other uents being false). Our goal is to estimate the best utility gain we can get by branching at these points.
4.1.2 Utility Table Back-propagation The basic principle of our algorithm is to back-propagate utility distribution tables in the plan graph back to the initial state. Each table is attached to a single (action or proposition) node and contains a piecewise constant function giving utility as a function of resource level e (e.g. energy). It represents an estimate of the expected reward we can get by performing this action, or by having this uent true, as a function of current resource levels. The process is initialized by creating utility tables for the goals. In our example, we start with a table for g and an expected return of 5 for positive resource levels (and 0 otherwise), indicating that we obtain a reward of 5 if we can get to g with some power remaining. We then back-propagate this table in the plan graph, until it has reached the initial conditions. First, a table is created for action B , based on the table in g . Its utility function is dened by: V B ( e ) = 0 V g ( e 10)iofth e er < w1is5e;; (4)
8
Figure 5: Selecting the branch point, branch condition and goals
where V B ( e ) and V g ( e ) are the (piecewise constant) utility estimates encoded by the tables in B and g respectively. The rst line expresses the fact that we are not allowed to start B if the current energy is at or below 15 Ah. The second says that B consumes 10 Ah and leads to g , from where we can get the reward encoded by V g . Next, the table attached to B is back-propagated to the previous propositional level. Since B has only one uent, q , as precondition, the table in B is copied as is in q . Similarly, we back-propagate the table at q to A and to p . The resulting value function is V p ( e ) = V A ( e ) 0 20)ioft e he < rw2i5se;; (5) = V g ( e The process described above—that is, regressing the goal g down to uent p —is relatively simple because there is no action with multiple preconditions in the path, and we ignored the fact that action A makes r true as well. The more complex cases of actions with multiple preconditions, or multiple effects can be handled by an extension of this approach. The details can be found in [10].
4.1.3 Extracting Utility Estimates Once the utility tables have been back-propagated down to the uents representing initial conditions of the problem, we extract the utility estimates for the candidate branch points from the graph. Consider a point between two actions A and B , characterized by the set of uents { p, q, r, s } . We build a single utility table for this branch point by merging all utility tables attached to p , q , r and s nodes whose condition is included in { p, q, r, s } (that is, whose condition is true when we are at the point between A and B ). This is all the tables that represent utility apparently reachable when p , q , r and s are true simultaneously. These tables are merged by taking the max of all the functions. The resulting table is the value function estimate that we need. Given the utility estimates at the various branch points, we can now use this information to select the branch point, the branch condition and the set of goals to pursue. For a particular branch point, we compute the gain in area for the branch utility estimate over the mainline utility. This represents the net utility gain of the branch. The branch condition is composed of the points where the utility curves cross. The goals for the contingent branch correspond to the portion of the utility estimate that is greater than the utility curve of the mainline plan. For example, in Figure 5, we show the mainline utility curve and the branch estimate curve for a branch point. The shaded area represents the utility gain for the branch. The branch conditions are shown and the
9
goal corresponding to the utility gain is G3. A number of authors working in probabilistic planning have looked at the problem of estimating the utility of a plan. Haddawy et al. [22, 21] have developed the DRIPS planner, which attempts to optimize the utility of the plan it returns. However, DRIPS plans in a very restricted domain represented as an abstraction/decomposition network that constrains the plans that can be created. Comparing abstractions of plans built using this network allows DRIPS to discard plans without having to estimate the utility of a partial plan. Another related system is Blythe’s Weaver [6]. Weaver builds contingency plans in an incremental fashion very similar to the approach described here, although it only considers actions with discrete out-comes (and uncontrolled external actions). Weaver is concerned with adding contingencies to increase the probability of success of the plan, rather than the utility. It translates the current plan into a Bayesian net-work to compute the probability of success of the plan, and identies points where an action can lead to a failure as candidates for new contingent branches. Unlike with our approach, it does not need to estimate the value of the contingent branch before adding it because the discrete domain means that by necessity the new branch must handle an outcome that wasn’t in the original plan.
5 Hybrid MDP Planning Markov Decision Processes (MDPs) have been adopted as a framework for much recent research in decision-theoretic planning. Classic dynamic programming algorithms solve MDPs in time polynomial in the size of the state space. However, the size of the state space is usually very large in practice. For systems modeled with a set of propositional state variables, the state space grows exponentially with the number of variables. This problem becomes even more important for MDPs with continuous state-spaces. If the continuous space is discretized to nd a solution, the discretization causes yet another level of exponential blow up. This “curse of dimensionality” has limited the use of the MDP framework, and overcoming it has become an important topic of research. In discrete MDPs, model-minimization [ ? ] techniques have been used with considerable success to avoid this state explosion problem. Algorithms such as Structured Policy Iteration [ ? ] and SPUDD [23] operate by identifying regions of the state-space that have the same value under the optimal policy, and treating those regions as a single state for the purposes of dynamic programming. In this paper we propose an approach to extend this state aggregation to continuous problems. A particular type of structure appears in some continuous domains such as the Mars rover domain [10], which motivated this work. Figure 6 represents the optimal value function from the initial state of a simple Mars rover problem as a function of two continuous variables: the time and energy remaining. The plan for this state was computed using the algorithm described in the previous section. The shape of this value function is characteristic of the rover domain, as well as other domains featuring a nite set of goals with positive utility and resource constraints. Noticeably, it includes vast plateau regions where the expected reward is nearly constant. These represent regions of the state space where the optimal policy is the same, and the probability distribution on future history induced by this optimal policy is nearly constant. The goal of this work is to exploit this structure by grouping together sates belong ing to the same plateau, while reserving a ne discretization for the regions of the state space where it is the most useful (such as the curved hump where there is more time and energy available). We will show that for certain subclasses of Semi-MDPs, we can compute the optimal value function exactly, while exploiting the structure in the problem to perform dynamic programming at far fewer points than a naive approach would. The approach we will describe is restricted to MDPs with piecewise constant or piecewise linear reward functions, and more signicantly, to MDPs with discrete transition functions. This means that for any state and action, a nite set of states can be reached with non-zero probability.
10
Figure 6: The value function for a rover plan computed using the contingency planning algorithm.
These restrictions ensure that the nal optimal value function found by our algorithms belong to well-behaved families. In the case of discrete transition functions and piecewise-constant reward function, the optimal value function is guaranteed to be piecewise constant as well. Similarly, the use of piecewise-linear reward functions ensures a piecewise-linear value function. The restriction to discrete transition functions is a particularly strong one. However, we can approx-imate MDPs with continuous transition functions by an appropriately ne discretization of the transition function. This provides an attractive alternative to function approximation approaches [ ? , ? ] in that it ap-proximates the model but then solves the approximate model exactly, rather than nding an approximate value function for the original model. This has the advantage that the effect of the approximation can be much more easily quantied. It also contrasts with the naive approach that consists of discretizing the state space regardless of the relevance of the partition introduced. In our approach, we discretize the action outcomes and deduce from it a partition of the state space. For ease of exposition, we will assume the problems we are solving are piecewise-constant or linear, and that any approximation has already been performed. For this reason we will refer to nding optimal policies and value functions below, even when the model has been approximated. Given these assumptions, the algorithm we describe will produce as its output a partition of the state-space in which each element of the partition consists of a region where the optimal value function is constant (for a piecewise-constant problem) or is the maximum of a set of linear functions (for a piecewise linear problem). On problems that exhibit structure, the algorithm computes the optimal value function substantially faster than a naive discretization of the state space, even when it must discretize more nely than the naive approach in some regions. The reduction in the number of Bellman backups performed to compute the optimal value function more than offsets the additional cost of maintaining the structured representation. This work is a generalization of Boyan and Littman’s model of time dependent MDP (TDMDP) that features a single continuous state variable representing time [9], to multiple continuous dimension. Moving a the multi-dimensional framework raises numerous problems relative to storing and manipulating state
11
  • Univers Univers
  • Ebooks Ebooks
  • Livres audio Livres audio
  • Presse Presse
  • Podcasts Podcasts
  • BD BD
  • Documents Documents