ltfmselector 0.2.1__tar.gz → 0.2.2__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (71) hide show
  1. ltfmselector-0.2.2/.gitignore +49 -0
  2. {ltfmselector-0.2.1 → ltfmselector-0.2.2}/PKG-INFO +3 -1
  3. ltfmselector-0.2.2/doc/00Introduction.tex +5 -0
  4. ltfmselector-0.2.2/doc/01ReinforcementLearning.tex +40 -0
  5. ltfmselector-0.2.2/doc/02MDP.tex +39 -0
  6. ltfmselector-0.2.2/doc/03DQL.tex +66 -0
  7. ltfmselector-0.2.2/doc/04ExampleDQL.tex +93 -0
  8. ltfmselector-0.2.2/doc/05PatSpecFMS_AgentEnv.tex +1 -0
  9. ltfmselector-0.2.2/doc/06LTFM.tex +58 -0
  10. ltfmselector-0.2.2/doc/07LTFMSMS.tex +7 -0
  11. ltfmselector-0.2.2/doc/08Results.tex +111 -0
  12. ltfmselector-0.2.2/doc/09Discussion.tex +40 -0
  13. ltfmselector-0.2.2/doc/10Conclusion.tex +1 -0
  14. {ltfmselector-0.2.1 → ltfmselector-0.2.2}/doc/abstract.tex +2 -3
  15. ltfmselector-0.2.2/doc/figures/DoctorsvsBoard_TestSplit_and_LTFMJRMCC.pdf +1238 -8
  16. ltfmselector-0.2.2/doc/figures/LTFMvsBoard_JRMCC.pdf +0 -0
  17. ltfmselector-0.2.2/doc/figures/QValuesFluency_and_Pendulum.pdf +2085 -4
  18. {ltfmselector-0.2.1 → ltfmselector-0.2.2}/doc/figures/ReinforcementLearning.eps +0 -0
  19. {ltfmselector-0.2.1 → ltfmselector-0.2.2}/doc/main.tex +10 -8
  20. {ltfmselector-0.2.1 → ltfmselector-0.2.2}/doc/references.bib +33 -0
  21. ltfmselector-0.2.2/dqntutorial/CartPoleStates.npz +0 -0
  22. ltfmselector-0.2.2/dqntutorial/DQNTutorial_byPaszke.ipynb +947 -0
  23. ltfmselector-0.2.2/dqntutorial/DQNTutorial_byPaszke_LoggerTQDM.ipynb +914 -0
  24. ltfmselector-0.2.2/dqntutorial/ReadAnalyzeStates.ipynb +172 -0
  25. {ltfmselector-0.2.1 → ltfmselector-0.2.2}/pyproject.toml +3 -1
  26. ltfmselector-0.2.2/src/ltfmselector/logger.py +39 -0
  27. {ltfmselector-0.2.1 → ltfmselector-0.2.2}/src/ltfmselector/ltfmselector.py +44 -14
  28. ltfmselector-0.2.1/.gitignore +0 -25
  29. ltfmselector-0.2.1/doc/00Introduction.tex +0 -14
  30. ltfmselector-0.2.1/doc/01ReinforcementLearning.tex +0 -37
  31. ltfmselector-0.2.1/doc/02MDP.tex +0 -58
  32. ltfmselector-0.2.1/doc/03DQL.tex +0 -89
  33. ltfmselector-0.2.1/doc/04ExampleDQL.tex +0 -112
  34. ltfmselector-0.2.1/doc/05PatSpecFMS_AgentEnv.tex +0 -0
  35. ltfmselector-0.2.1/doc/06PatSpecFMS_Reconstruction.tex +0 -0
  36. ltfmselector-0.2.1/doc/07Results.tex +0 -15
  37. ltfmselector-0.2.1/doc/08Discussion.tex +0 -0
  38. ltfmselector-0.2.1/doc/09Conclusion.tex +0 -0
  39. ltfmselector-0.2.1/doc/llncsdoc.pdf +0 -0
  40. ltfmselector-0.2.1/doc/samplepaper.pdf +0 -0
  41. ltfmselector-0.2.1/dqntutorial/DQNTutorial_byPaszke.ipynb +0 -1031
  42. {ltfmselector-0.2.1 → ltfmselector-0.2.2}/.github/workflows/release.yaml +0 -0
  43. {ltfmselector-0.2.1 → ltfmselector-0.2.2}/LICENSE +0 -0
  44. {ltfmselector-0.2.1 → ltfmselector-0.2.2}/README.md +0 -0
  45. {ltfmselector-0.2.1 → ltfmselector-0.2.2}/doc/Makefile +0 -0
  46. {ltfmselector-0.2.1 → ltfmselector-0.2.2}/doc/figures/InversedPendulum.eps +0 -0
  47. {ltfmselector-0.2.1 → ltfmselector-0.2.2}/doc/figures/PendulumDuration.eps +0 -0
  48. {ltfmselector-0.2.1 → ltfmselector-0.2.2}/doc/figures/QValuesPendulum.eps +0 -0
  49. {ltfmselector-0.2.1 → ltfmselector-0.2.2}/doc/figures/fig1.eps +0 -0
  50. {ltfmselector-0.2.1 → ltfmselector-0.2.2}/doc/history.txt +0 -0
  51. {ltfmselector-0.2.1 → ltfmselector-0.2.2}/doc/llncs.cls +0 -0
  52. {ltfmselector-0.2.1 → ltfmselector-0.2.2}/doc/readme.txt +0 -0
  53. {ltfmselector-0.2.1 → ltfmselector-0.2.2}/doc/samplepaper.tex +0 -0
  54. {ltfmselector-0.2.1 → ltfmselector-0.2.2}/doc/splncs04.bst +0 -0
  55. {ltfmselector-0.2.1 → ltfmselector-0.2.2}/dqntutorial/EpsDecay.eps +0 -0
  56. {ltfmselector-0.2.1 → ltfmselector-0.2.2}/dqntutorial/PendulumDuration.eps +0 -0
  57. {ltfmselector-0.2.1 → ltfmselector-0.2.2}/dqntutorial/QValuesPendulum.eps +0 -0
  58. {ltfmselector-0.2.1 → ltfmselector-0.2.2}/dqntutorial/Q_TargetValuesPendulum.eps +0 -0
  59. {ltfmselector-0.2.1 → ltfmselector-0.2.2}/examples/00_Classification.ipynb +0 -0
  60. {ltfmselector-0.2.1 → ltfmselector-0.2.2}/examples/01_Classification_wCustomPredictionModels.ipynb +0 -0
  61. {ltfmselector-0.2.1 → ltfmselector-0.2.2}/examples/02_Classification_wGridSearch.ipynb +0 -0
  62. {ltfmselector-0.2.1 → ltfmselector-0.2.2}/examples/03_Regression.ipynb +0 -0
  63. {ltfmselector-0.2.1 → ltfmselector-0.2.2}/icons/icon.png +0 -0
  64. {ltfmselector-0.2.1 → ltfmselector-0.2.2}/icons/icon.svg +0 -0
  65. {ltfmselector-0.2.1 → ltfmselector-0.2.2}/src/ltfmselector/__init__.py +0 -0
  66. {ltfmselector-0.2.1 → ltfmselector-0.2.2}/src/ltfmselector/env.py +0 -0
  67. {ltfmselector-0.2.1 → ltfmselector-0.2.2}/src/ltfmselector/py.typed +0 -0
  68. {ltfmselector-0.2.1 → ltfmselector-0.2.2}/src/ltfmselector/utils.py +0 -0
  69. {ltfmselector-0.2.1 → ltfmselector-0.2.2}/tests/test_regression_tol.py +0 -0
  70. {ltfmselector-0.2.1 → ltfmselector-0.2.2}/tests/test_save_load.py +0 -0
  71. {ltfmselector-0.2.1 → ltfmselector-0.2.2}/tests/utils_fortesting.py +0 -0
@@ -0,0 +1,49 @@
1
+ # Python-generated files
2
+ dqntutorial/video*
3
+ __pycache__/
4
+ *.py[oc]
5
+ build/
6
+ dist/
7
+ wheels/
8
+ *.egg-info
9
+ *.ipynb_checkpoints/
10
+
11
+ # Virtual environments
12
+ .venv
13
+ uv.lock
14
+ .python-version
15
+
16
+ # PyTorch
17
+ runs*
18
+
19
+ # Dev
20
+ predictionModels.py
21
+ stdutils
22
+ examples/train.py
23
+ RBRHX_ModalScores.xlsx
24
+
25
+ # LaTeX
26
+ doc/figures/*converted-to.pdf
27
+ doc/*.toc
28
+ doc/*.dvi
29
+ doc/*.log
30
+ doc/*.aux
31
+ doc/*.ps
32
+ doc/*.pdf
33
+ doc/*~
34
+ doc/*.fls
35
+ doc/*.fdb_latexmk
36
+ doc/*.bbl
37
+ doc/*.blg
38
+ doc/*.glo
39
+ doc/*.ist
40
+ doc/*.acn
41
+ doc/*.bcf*
42
+ doc/*.bbl*
43
+ doc/*.run*
44
+ doc/*.acr
45
+ doc/*.alg
46
+ doc/*.glg
47
+ doc/*.gls
48
+ doc/*.out
49
+ doc/*.glsdefs
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: ltfmselector
3
- Version: 0.2.1
3
+ Version: 0.2.2
4
4
  Summary: Locally-Tailored Feature and Model Selector with Deep Q-Learning
5
5
  Project-URL: GitHub, https://github.com/RenZhen95/ltfmselector/
6
6
  Author-email: RenZhen95 <j-liaw@hotmail.com>
@@ -29,9 +29,11 @@ License-File: LICENSE
29
29
  Requires-Python: >=3.12
30
30
  Requires-Dist: gymnasium>=1.1.1
31
31
  Requires-Dist: matplotlib>=3.10.1
32
+ Requires-Dist: moviepy>=2.2.1
32
33
  Requires-Dist: numpy>=2.2.4
33
34
  Requires-Dist: openpyxl>=3.1.5
34
35
  Requires-Dist: pandas>=2.2.3
36
+ Requires-Dist: pygame>=2.6.1
35
37
  Requires-Dist: scikit-learn<1.6
36
38
  Requires-Dist: seaborn>=0.13.2
37
39
  Requires-Dist: tensorboard>=2.20.0
@@ -0,0 +1,5 @@
1
+ Poststroke gait rehabilitation requires a personalized therapy, usually designed by an interdisciplinary medical team via time-consuming assessments \cite{raab2020,liaw2025}. An automated gait assessment tool based on gait measurements and interdisciplinary knowledge could allow for faster poststroke evaluation, while providing relevant feedback via objective analysis of a patient’s status. One major challenge of using gait data for this purpose is its high dimensionality, which is usually met by carrying out feature selection on a fixed feature set. Owing to the individual uniqueness of each patient in terms of physical and functional statuses \cite{lee2020}, we present a dynamic feature and model selection approach using reinforcement learning (RL).
2
+
3
+ \cite{lee2020} has for instance shown that when performing the ``Bring a Hand to Mouth''-exercise during stroke rehabilitation, different stroke patients compensate for the affected motion in different ways. Beyond inter-patient variability, the relevant biomarkers have also been shown to evolve alongside disease severity. \cite{pistacchi2017} showed for instance how reduced step lengths appeared to be a specific feature of Parkinson's disease in its early stages. As the disease progresses to its moderate stage, gait asymmetry, double-limb support, and increased cadence becomes more characteristic, followed by freezing of gait and reduced balance in its advanced stages. Notably, research \cite{huang2016,biase2020} have highlighted the necessity of adapting the analyzed gait parameters to evolve in tandem with the disease's condition.
4
+
5
+ In this work, we apply the Deep $Q$-Learning (DQL) algorithm by \cite{mnih2015} to select an optimal set of salient gait features, coupled along with a corresponding prediction model to automatically assess gait poststroke, based on the interdisciplinary knowledge of a medical board. This dynamic approach allows for selecting patient-specific key features, which should be more beneficial over classically selecting a fixed subset of informative features \cite{lee2020,lee2021}. It is after all the therapist's goal to design a \emph{personalized} therapy plan. Moreover, simply presenting a multitude of variables can easily overwhelm a therapist and hinder one from obtaining useful insights \cite{lee2021}. This could in turn aid a clinician in saving precious time \cite{lee2021}, especially in light of the current shortage of medical staff \cite{healthcareburden}.
@@ -0,0 +1,40 @@
1
+ RL is a branch of machine learning focused on optimizing control laws or policies through sequential interactions with an environment to achieve a long-term objective \cite{mnih2015,sutton2018,brunton2022}. As shown in Figure \ref{fig:RLSchematic}, an \emph{agent} senses the \emph{state} of its \emph{environment}, and learns to take \emph{actions} that maximizes \emph{cumulative future rewards}.
2
+ \begin{figure}[h!]
3
+ \centering
4
+ \begin{overpic}[width=1.0\columnwidth]{ReinforcementLearning.eps}
5
+ % Nouns
6
+ \put(61, 34){\emph{ENVIRONMENT}}
7
+ \put(2, 26){\emph{AGENT}}
8
+ \put(1, 18.5){\emph{STATE}, $\nvec{s}$}
9
+ \put(36, 27){\emph{POLICY}}
10
+ \put(37, 25){$\policy$}
11
+ \put(35, 38){\emph{REWARD}, $R$}
12
+ % Verbs
13
+ \put(40, 0){Observe \emph{STATE}, $\nvec{s}$}
14
+ \put(48, 24.5){Perform}
15
+ \put(48, 21.5){\emph{ACTION}, $a$}
16
+ % Agent
17
+ \put(17.5, 24){$\dot{x}$}
18
+ \put(17.5, 21){$x$}
19
+ \put(17.5, 17){$\varphi$}
20
+ \put(17.5, 13){$\dot{\varphi}$}
21
+ \put(41, 21){$+F$}
22
+ \put(41, 16){$-F$}
23
+ % Environment
24
+ \put(92, 11){$x$}
25
+ \put(66, 30){$y$}
26
+ \put(73, 6.5){$x$}
27
+ \put(84, 7){$\dot{x}$}
28
+ \put(68, 12){$a=F$}
29
+ \put(84, 29){$\varphi$}
30
+ \put(88, 23){$\dot{\varphi}$}
31
+ \put(65, 22){$\nvec{s}=\begin{bmatrix}x \\ \dot{x} \\ \varphi \\ \dot{\varphi}\end{bmatrix}$}
32
+ % Parameters
33
+ \put(90, 30){$m_p$}
34
+ \put(86, 17){massless}
35
+ \put(92, 14.5){rod, $\ell$}
36
+ \put(87, 13){$m_c$}
37
+ \end{overpic}
38
+ \caption{Schematic of RL, where an agent senses its environmental state $\nvec{s}$ and performs an action $a$, according to a policy $\policy$ that is optimized through learning to maximize cumulative future rewards $R$. In recent works, a typical approach to represent the policy $\policy$ is to use a deep neural network. Such a policy is known as a \emph{deep policy network}. Figure adapted from \cite{brunton2022}.}
39
+ \label{fig:RLSchematic}
40
+ \end{figure}
@@ -0,0 +1,39 @@
1
+ The environment is represented by the state $\nvec{s}_t$ at the current time-step $t$. The agent performs an action $a_t$ according to a learned policy $\policy$, which results in the current state $\nvec{s}_t$ evolving to the next state $\nvec{s}_{t+1}$. Consequently, the agent receives an appropriate reward $R_{t+1} \in \mathbb{R}$ one time-step later \cite{sutton2018,brunton2022}. These collectively form an \emph{experience} $\boldsymbol{e}_t$, which describes the knowledge an agent has amassed from interacting with the environment, usually expressed as a tuple $\boldsymbol{e}_t = \left( \nvec{s}_t \, , a_t \, , \nvec{s}_{t+1} \, , R_{t+1} \right)$ \cite{almahamid2021}. The agent-environment interaction thereby yields a \emph{trajectory}, as shown in (\ref{eq:Trajectory}) \cite{sutton2018,brunton2022}
2
+ \begin{equation}
3
+ \begin{aligned}
4
+ \nvec{s}_0 \,, a_0 \,, R_1 \,, & & \nvec{s}_1 \,, a_1 \,, R_2 \,, & & \nvec{s}_2 \,, a_2 \,, R_3 \,, \cdots \,, \\
5
+ \boldsymbol{e}_0 \,, & & \boldsymbol{e}_1 \,, & & \boldsymbol{e}_2 \,, \cdots \,.
6
+ \end{aligned}
7
+ \label{eq:Trajectory}
8
+ \end{equation}
9
+ Formally described, the environment evolves according to a \emph{Markov decision process}, where the random variables $\nvec{s}_{t} \in \mathcal{S}$ and $R_t \in \mathcal{R}$ each have defined discrete probability distributions, that depend only on the preceding state and action \cite{sutton2018,brunton2022}. They evolve according to (\ref{eq:MDPDynamics}) and (\ref{eq:RewardFunction}), respectively.
10
+ \begin{equation}
11
+ P(\nvec{s}', r \,|\, \nvec{s}, a) = \mathrm{Pr}\left\{ \nvec{s}_{t+1} = \nvec{s}' , R_{t+1} = r \,|\, \nvec{s}_t = s, a_t = a \right\}
12
+ \label{eq:MDPDynamics}
13
+ \end{equation}
14
+ \begin{equation}
15
+ r(\nvec{s}, a) = \mathbb{E} \left[ R_{t+1} \,|\, \nvec{s}_{t}=\nvec{s}, a_t = a \right] = \sum_{r \in \mathcal{R}} r \sum_{\nvec{s}' \in \mathcal{S}} P(\nvec{s}', r \,|\, \nvec{s}, a)
16
+ \label{eq:RewardFunction}
17
+ \end{equation}
18
+ In short, the goal of RL is to maximize the expected discounted \emph{return} $G_{t} = R_{t+1} + \gamma R_{t+2} + \gamma^{2} R_{t+3} + \cdots = \sum_{k=0}^{\infty} \gamma^{k} R_{t+1+k}$, where $\gamma$ denotes the \emph{discount rate} \cite{sutton2018,brunton2022}. In episodic environments, each episode ends in a special \emph{terminal state}, when the agent encounters a \emph{termination condition} $T_{\text{end}}$ \cite{almahamid2021}. This is followed by a reset to a standard or random starting state, and the agent begins with the next episode. To learn a policy that maximizes the return $G$, the ``desirability'' of being in a given state $\nvec{s}_t$ is quantified via the \emph{value function} (\ref{eq:ValueFunction})
19
+ \begin{equation}
20
+ \begin{split}
21
+ V_{\policy} (\nvec{s}) &= \mathbb{E}_{\policy} \left[ G_{t} \left. \, \right\rvert \, \nvec{s}_t = \nvec{s} \right] \\
22
+ &= \mathbb{E}_{\policy} \left[ R_{t+1} + \gamma G_{t+1} \left. \, \right\rvert \, \nvec{s}_t = \nvec{s} \right] \\
23
+ &= \sum_{a} \policy(a \,|\, \nvec{s}) \sum_{\nvec{s}'} \sum_{r} P(\nvec{s}', r \,|\, \nvec{s}, a) \left[ r + \gamma\mathbb{E}_{\policy} \left[ G_{t+1}|\nvec{s}_{t+1}=\nvec{s}' \right] \right] \\
24
+ &= \sum_{a} \policy(a \,|\, \nvec{s}) \sum_{\nvec{s}', r} P(\nvec{s}', r \,|\, \nvec{s}, a) \left[ r + \gamma V_{\policy}(\nvec{s}') \right] \,,
25
+ \text{ for all } \nvec{s} \in \mathcal{S} \, ,
26
+ \end{split}
27
+ \label{eq:ValueFunction}
28
+ \end{equation}
29
+ which describes the expected discounted return when starting from $\nvec{s}$, and following $\policy$ thereafter \cite{sutton2018}. The optimal policy can thus be rewritten in terms of the value function as
30
+ \begin{equation}
31
+ \begin{split}
32
+ V_{\optpolicy} (\nvec{s}) &= \max_{a} \mathbb{E} \left[R_{t+1} + \gamma V_{\optpolicy}(\nvec{s}') \, \rvert \, \nvec{s}_{t}=\nvec{s} \,, a_{t}=a \right] \\
33
+ &= \max_{a} \sum_{\nvec{s}', r} P(\nvec{s}', r \,|\, \nvec{s}, a) \left[ r + \gamma V_{\optpolicy}(\nvec{s}') \right] \,.
34
+ \end{split}
35
+ \label{eq:BellmanEq}
36
+ \end{equation}
37
+ Equation (\ref{eq:BellmanEq}) is known as the \emph{Bellman equation}, which possesses an important property. It can be recursively broken down for every subsequence of steps, which implies that a control policy for a multi-step procedure must also be locally optimal for every subsequence of steps. This allows for solving a large optimization problem by locally optimizing every subsequence \cite{bellman1966,sutton2018,brunton2022}. The discount factor $\gamma$ guides the agent towards learning a behavior that balances the trade-off between immediate gratification and long-term strategic gains \cite{sutton2018}. Taking chess for example, the agent is willing to make sacrifices that result in temporary unfavorable positions, in order to achieve the ultimate goal of checkmating the opponent \cite{huegle2022}.
38
+
39
+ Classically, the value function $V$ is computed iteratively, and used to search for better policies via methods of dynamic programming \cite{sutton2018}. Classical dynamic programming is however of limited utility for two main reasons \cite{sutton2018,brunton2022}, namely $(i)$ the assumption of a perfect model, i.e. \emph{a priori} knowledge of the environmental transition dynamics $P(\nvec{s}, r \,|\, \nvec{s}, a)$, and $(ii)$ memory and computational constraints for handling large and combinatorial state spaces. One approach in dealing with this issue is to apply a function approximator to model the value function based on gathered experiences \cite{sutton2018,mnih2015}. There exist various RL algorithms in the literature \cite{almahamid2021}, each suited for different environment types. The choice of RL algorithm depends namely on the $(i)$ number of states and $(ii)$ action types. For this work involving an environment that comprises an $(i)$ \emph{unlimited} number of states, $(ii)$ an agent that performs \emph{discrete} actions, and $(iii)$ no a priori knowledge of the environment dynamics, the DQL algorithm by \cite{mnih2015} is best suited. The equations in the next sections will be formulated deterministically (i.e. $P(\nvec{s}, r \,|\, \nvec{s}, a)=1$).
@@ -0,0 +1,66 @@
1
+ To begin, the \emph{quality function}, also referred to as the \emph{action-value} function \cite{watkins1992,mnih2015,sutton2018,almahamid2021} is defined as $Q_{\policy} (\nvec{s}, a) = R(\nvec{s}, a) + \gamma V_{\policy}(\nvec{s}')$, which describes the \emph{joint-desirability} of performing the action $a$ for the given state $\nvec{s}$. Following this formulation, the agent selects the action $a$ that yields the maximum $Q$-value for the given state $\nvec{s}$ as
2
+ \begin{equation}
3
+ \policy (\nvec{s}) = \argmax_{a} Q_{\policy} (\nvec{s}, a) \, .
4
+ \label{eq:QLearningAction}
5
+ \end{equation}
6
+ The goal in \emph{Q-Learning} \cite{watkins1992,almahamid2021} is for the agent to learn an optimal policy that maximizes the action-value function
7
+ \begin{equation}
8
+ \begin{aligned}
9
+ Q_{\optpolicy} (\nvec{s}, a) = \max_{\policy} Q_{\policy} (\nvec{s}, a) &= R(\nvec{s}, a) & &+ \, \gamma \max_{a'} Q_{\optpolicy} (\nvec{s}', a') \\
10
+ &= r & &+ \, \gamma \max_{a'} Q_{\optpolicy} (\nvec{s}', a') \,,
11
+ \end{aligned}
12
+ \end{equation}
13
+ which yields the following intuition. If the optimal value $Q_{\optpolicy} (\nvec{s}', a')$ for the state $\nvec{s}'$ at the next time-step is known for all possible actions $a'$, then the optimal strategy is to simply select the action $a'$ that maximizes the value of $r + \gamma Q_{\optpolicy} (\nvec{s}', a')$ \cite{mnih2015}.
14
+
15
+ In the original $Q$-Learning, the optimal action-value function is obtained by maintaining the $Q$-values in a $Q$-Table, then updating iteratively \cite{watkins1992,almahamid2021}. \cite{mnih2015} proposed DQL, which uses a deep convolutional neural network to approximate the action-value $Q_{\policy} (\nvec{s}, a) \approx Q_{\policy} (\nvec{s}, a; \boldsymbol{\theta})$ through some parameterization $\boldsymbol{\theta}$. The neural network function approximator of the parameters $\boldsymbol{\theta}$ is referred to as the \emph{Q-network}, where $\boldsymbol{\theta}$ is updated by minimizing the loss function (\ref{eq:DQNLossFunction})
16
+ \begin{equation}
17
+ \min_{\boldsymbol{\theta}} \dfrac{1}{|\mathcal{B}|} \sum_{\boldsymbol{e} \in \mathcal{B}} \left[ \left( r + \gamma \max_{a'} Q_{\policy} (\nvec{s}', a'; \boldsymbol{\theta}) \right) - Q_{\policy} (\nvec{s}, a; \boldsymbol{\theta}) \right]^2 \, ,
18
+ \label{eq:DQNLossFunction}
19
+ \end{equation}
20
+ over a batch of samples $\mathcal{B}$ \cite{mnih2015}. The term on the right is referred to as the \emph{target values} $\left( r + \gamma \max_{a'} Q_{\policy} (\nvec{s}', a'; \boldsymbol{\theta}) \right)$. To deal with the well-known instability of using deep neural networks in RL, \cite{mnih2015} introduces two key ideas, namely $(i)$ updating the neural network over \emph{randomly sampled experiences} of the agent-environment interactions and $(ii)$ only \emph{periodically updating} the neural network towards the target values. The first idea involves storing the agent's experiences $\boldsymbol{e}_t$ at each time-step $t$ into a \emph{replay memory} $D_t = \left\{ \boldsymbol{e}_1\,, \boldsymbol{e}_2\,, \cdots\,, \boldsymbol{e}_t \right\}$, from which a batch of experiences are randomly sampled $\mathcal{B}_{D} \subseteq D$ to update the $Q$-network \cite{mnih2015}. This helps break the correlations between each consecutive experience, thus preventing undesired feedback loops during learning \cite{mnih2015}. The second idea is implemented by using a clone of the $Q$-network, termed the \emph{target network} $\hat{Q}_{\policy}$ to generate the target values, whose parameters $\boldsymbol{\theta}^{-}$ follow the parameters $\boldsymbol{\theta}$ of the $Q$-network with a slight delay. This helps the learning better converge \cite{mnih2015}. Following the suggestion by \cite{lillicrap2015}, the parameters $\boldsymbol{\theta}^{-}$ are updated ``softly'' according to $\boldsymbol{\theta}^{-} = \tau \boldsymbol{\theta} + \left(1 - \tau \right) \boldsymbol{\theta}^{-}$, where $\tau$ denotes the \emph{soft target update rate}. The $Q$-network's weights are thereby updated according to
21
+ \begin{equation}
22
+ \min_{\boldsymbol{\theta}} \dfrac{1}{|\mathcal{B}_{D}|} \sum_{\boldsymbol{e} \in \mathcal{B}_D} \left[ \left( r + \gamma \max_{a'} \hat{Q}_{\policy} (\nvec{s}', a'; \boldsymbol{\theta}^{-}) \right) - Q_{\policy} (\nvec{s}, a; \boldsymbol{\theta}) \right]^2 \, .
23
+ \label{eq:DQNLossFunction2}
24
+ \end{equation}
25
+ To promote exploration, the agent's action is selected according to an \emph{$\epsilon$-greedy} algorithm, where the parameter $\epsilon$ denotes the probability of the agent performing a random action instead of the maximizing action according to (\ref{eq:QLearningAction}) \cite{mnih2015,brunton2022}. As the $Q$-function improves over the course of training, $\epsilon$ decays exponentially according to $\epsilon = \left( {\epsilon}_{\text{initial}} - {\epsilon}_{\text{final}} \right) e^{-\frac{t_c}{{\epsilon}_{\text{decay}}}} + {\epsilon}_{\text{final}}$, allowing the agent to increasingly choose the maximizing action \cite{brunton2022}. $t_c$ denotes the cumulative time-steps over episodes, whereas ${\epsilon}_{\text{initial}}$, ${\epsilon}_{\text{final}}$, and ${\epsilon}_{\text{decay}}$, the initial value, final value, and decay rate of $\epsilon$, respectively.
26
+
27
+ % Algorithm \ref{alg:DQN} shows how DQL is implemented in this work, combined with experience replay and an $\epsilon$-greedy algorithm for selecting actions. The $Q$-network is updated at every time-step $t$, provided the memory $D$ contains at least the number of user-specified batch size for training $|\mathcal{B}_D|$. Moreover, the memory $D$ is implemented in practice as a finite-sized cache which stores only the $N$ most recent experiences, discarding the oldest samples as new ones are added \cite{mnih2015,lillicrap2015}.
28
+ % \begin{algorithm}[!t]
29
+ % \caption{DQL with experience replay, combined with an $\epsilon$-greedy algorithm for promoting random exploration \cite{mnih2015}. The notations $\nvec{s}_{k,t}$, $a_{k,t}$, $\nvec{s}_{k,t+1}$, $r_{k,t}$, and $y_{k,t}$ denote the state, action, next state, reward, and target values of the $k$-th episode, at time-step $t$ respectively.}
30
+ % \label{alg:DQN}
31
+ % \begin{algorithmic}
32
+ % \State Initialize number of episodes $K$
33
+ % \State Initialize replay memory $D$ with capacity $N$
34
+ % \State Initialize discount rate $\gamma$
35
+ % \State Initialize batch size $|\mathcal{B}_{D}|$ for updating parameters $\boldsymbol{\theta}$
36
+ % \State Initialize $Q$-network $Q_{\policy}$ with random weights $\boldsymbol{\theta}$
37
+ % \State Initialize target network $\hat{Q}_{\policy}$ with weights $\boldsymbol{\theta}^{-} = \boldsymbol{\theta}$
38
+ % \State Initialize soft target update rate $\tau$
39
+ % \State Initialize $\epsilon$ with parameters ${\epsilon}_{\text{initial}}$, ${\epsilon}_{\text{final}}$, and ${\epsilon}_{\text{decay}}$ for random exploration
40
+ % \State Initialize counter for cumulative time-steps over episodes $t_c = 1$
41
+ % \For{$k := 1$ to $K$} \Comment for each $k$-th episode
42
+ % \State Initialize time-step $t=1$
43
+ % \State Initialize initital state $\nvec{s}_{k,t=1}$
44
+ % \While{$T_{\text{end}}$ is false} \Comment termination condition for $k$-th episode not fulfilled
45
+ % \State With probability of $\epsilon$ select a random action $a_{k,t}$,
46
+ % \State $\quad$ otherwise $a_{k,t} = \max_{a_{k,t}} Q(\nvec{s}_{k,t}, a_{k,t})$
47
+ % \State Execute action $a_{k,t}$, and observe reward $r_{k,t}$ and next state $\nvec{s}_{k,t+1}$
48
+ % \State Store episode $\boldsymbol{e}_{k,t} = \left( \nvec{s}_{k,t} \, , a_{k,t} \, , \nvec{s}_{k,t+1} \, , r_{k,t} \right)$ in replay memory $D$
49
+ % \If{$|D| \geq |\mathcal{B}_{D}|$} \Comment if number of stored experiences are at least batch size
50
+ % \State Sample minibatch of random episodes $\boldsymbol{e}_{j} = \left( \nvec{s}_{j} \, , a_{j} \, , \nvec{s}_{j} \, , r_{j} \right)$ from $D$
51
+ % \If{$T_{end}$ is true} \Comment termination condition for $k$-th episode fulfilled
52
+ % \State $y_{j} = r_{j}$
53
+ % \Else
54
+ % \State $y_{j} = r_{j} + \gamma \max_{a_{j}'} \hat{Q}_{\policy} (\nvec{s}_{j}', a_{j}'; \boldsymbol{\theta}^{-})$
55
+ % \EndIf
56
+ % \State Perform a gradient descent step on $\left( y_{j} - Q_{\policy} (\nvec{s}_j, a_j; \boldsymbol{\theta}) \right)^2$
57
+ % \State $\quad$ with respect to $Q$-network parameters $\boldsymbol{\theta}$
58
+ % \EndIf
59
+ % \State Update parameters of target network $\boldsymbol{\theta}^{-}$ towards $\boldsymbol{\theta}$ according to (\ref{eq:TargetUpdates})
60
+ % \State Update $\epsilon$ according to (\ref{eq:EpsilonExpDecay})
61
+ % \State Update time-step counter $t = t + 1$
62
+ % \State Update cumulative time-step counter $t_c = t_c + 1$
63
+ % \EndWhile
64
+ % \EndFor
65
+ % \end{algorithmic}
66
+ % \end{algorithm}
@@ -0,0 +1,93 @@
1
+ Consider the classical example of balancing an inversed pendulum on a cart by applying a series of forces to the cart. The \emph{environment} at a given time-step $t$ is represented by the state $\nvec{s}_t = \begin{smallmatrix} \begin{bmatrix} x_{t} & \dot{x}_{t} & \varphi_{t} & \dot{\varphi}_{t} \end{bmatrix}^T \end{smallmatrix}$, where the variables $x_{t}$, $\dot{x}_{t}$, $\varphi_{t}$, and $\dot{\varphi}_{t}$ denote the cart's position, and cart's velocity in the $x$-direction, the pendulum's angle with respect to the vertical, and the pendulum's angular velocity, at the time-step $t$, respectively. The available choices of action $a \in \mathcal{A} = \left\{ -F \,, +F \right\}$ are applying a constant force $F$ to the cart in either the right or left direction. According to the performed action $a_{t} = {\policy}(\nvec{s}_{t} ; \boldsymbol{\theta}) = \argmax_{a} Q_{\policy} (\nvec{s}_{t}, a; \boldsymbol{\theta})$, the environmental state evolves to the next state $\nvec{s}_{t+1}$, as governed by the dynamical equations of the cart and pendulum as shown in (\ref{eq:AngularAccCartPole}) and (\ref{eq:AccCartPole}). Frictional effects are neglected for the sake of simplicity.
2
+ \begin{equation}
3
+ \begin{split}
4
+ \ddot{\varphi}_{t} &= \frac{a_{t}\cos{{\varphi}_t} + m_p \dot{{\varphi}_t}^2 \ell \sin{{\varphi}_t} \cos{{\varphi}_t} - (m_c + m_p)g\sin{{\varphi}_t}}
5
+ { \dfrac{4}{3} \ell (m_c + m_p) - m_p \ell \cos{{\varphi}_t}^2}
6
+ \end{split}
7
+ \label{eq:AngularAccCartPole}
8
+ \end{equation}
9
+ \begin{equation}
10
+ \ddot{x}_{t} = \frac{1}{\cos{{\varphi}_t}} \left[ \dfrac{4}{3} \ell \ddot{{\varphi}_t} + g\sin{{\varphi}_t} \right]
11
+ \label{eq:AccCartPole}
12
+ \end{equation}
13
+ The mass of the cart is denoted by $m_c$, and the pendulum modelled by a massless rod of length $\ell$, with a point mass $m_p$ fixed on one end as shown in Figure \ref{fig:RLSchematic}, with the other end attached to the cart by a revolute joint. For this example, the environment evolves deterministically and the next state $\nvec{s}_{t+1}$ can be obtained via numerial integration (e.g. explicit Euler method). The reward function is defined as shown in (\ref{eq:RewardFunctionCartPole})
14
+ \begin{equation}
15
+ R(\nvec{s}_{t+1}) =
16
+ \begin{cases}
17
+ +1 & \text{if } |{\varphi}_{t+1}| < \varphi^* \\
18
+ 0 & \text{if } |{\varphi}_{t+1}| \geq \varphi^*
19
+ \end{cases} \,,
20
+ \label{eq:RewardFunctionCartPole}
21
+ \end{equation}
22
+ and the termination condition in (\ref{eq:TerminationConditionCartPole})
23
+ \begin{equation}
24
+ T_{\text{end}} =
25
+ \left\{
26
+ \begin{array}{rll}
27
+ \text{true} & \text{if } |{\varphi}_{t+1}| \geq \varphi^* & \text{(pendulum falls over)} \\
28
+ \text{true} & \text{if } |{x}_{t+1}| \geq x^* & \text{(positional limit of cart)} \\
29
+ \text{true} & \text{if } t = 500 & \text{(end simulation due to time-constraint)} \\
30
+ \text{false} & \text{otherwise} & \text{(pendulum kept balanced)}
31
+ \end{array}
32
+ \right. \,,
33
+ \label{eq:TerminationConditionCartPole}
34
+ \end{equation}
35
+ where $\varphi^*$ and $x^*$ denote thresholds for the angle of the pendulum with respect to the vertical and position of the cart in the $x$-direction, respectively. Upon implementation, the duration of the pendulum kept upright increases over the course of training and even ultimately reaches the maximum user-defined number of time-steps. One can see how its learning eventually converges to an optimal policy $\optpolicy$, as implied in Figure \ref{fig:QValuesProgression} with the converging $Q$-Values.
36
+
37
+ % To implement this example, the environmental parameters, as well as the DQL agent hyperparameters are initialized with values as shown in Table \ref{tab:CartPoleParameters}. The policy and target networks were implemented as multilayer perceptron (MLP) with two hidden layers, each with 128 neurons. The agent is then subsequently trained according to Algorithm \ref{alg:DQN} over 750 episodes, where the parameters $\boldsymbol{\theta}$ of the $Q$-network are optimized using an AdamW optimizer \cite{loshchilov2017}, with the learning rate $l_r$. The learning was carried out on a 5.3 GHz Intel\textsuperscript{\textregistered{}} Core\texttrademark{} i9-10900K CPU, and implemented with the deep learning framework PyTorch \cite{pytorch}, as well as other libraries for applications in science and data analysis (e.g. pandas \cite{pandas}, SciPy \cite{scipy}, NumPy \cite{numpy}) in the Python programming language.
38
+
39
+ % As shown in Figure \ref{fig:PendulumDuration}, the duration (i.e. total time-steps) of the pendulum kept balanced increases over the course of training and even ultimately reaches the maximum permitted number of time-steps as set in (\ref{eq:TerminationConditionCartPole}). One can also observe how the agent progressively improves, and its learning eventually converges to an optimal policy $\optpolicy$, as implied in Figure \ref{fig:QValuesProgression} with the converging $Q$-Values.
40
+ % \begin{table}[H]
41
+ % \centering
42
+ % \begin{tabular}{p{0.65\textwidth}p{0.055\textwidth}p{0.1\textwidth}}
43
+ % \hline
44
+ % \multicolumn{3}{l}{\textbf{Environmental Parameters}} \\
45
+ % \hline
46
+ % Mass of cart & $m_c$ & 1.0 \si{kg} \\
47
+ % Mass of point mass on end of pole & $m_p$ & 0.1 \si{kg} \\
48
+ % Length of pole & $\ell$ & 1.0 \si{m} \\
49
+ % Magnitude of force applied to cart in $x$-direction & $F$ & 10 \si{N} \\
50
+ % Gravitational acceleration & $g$ & 9.8 \si{m/s^2} \\
51
+ % Threshold angle with respect to vertical & $\varphi^*$ & \ang{12} \\
52
+ % Threshold position of cart in $x$-direciton & $x^*$ & 2.4 \si{m} \\
53
+ % Step size & $\Delta t$ & 0.02 \si{s} \\
54
+ % \hline
55
+ % \multicolumn{3}{l}{\textbf{Agent Hyperparameters}} \\
56
+ % \hline
57
+ % Number of episodes & $K$ & 750 \\
58
+ % Number of experiences stored in replay memory & $N$ & 10000 \\
59
+ % Discount rate & $\gamma$ & 0.99 \\
60
+ % Batch size of experiences drawn from replay memory & $|\mathcal{B}_{D}|$ & 128 \\
61
+ % Learning rate of AdamW optimizer & $l_r$ & \num{1e-4} \\
62
+ % Soft target update rate & $\tau$ & 0.005 \\
63
+ % Initial probability $\epsilon$ for random exploration & $\epsilon_{\text{initial}}$ & 0.9 \\
64
+ % Final probability $\epsilon$ for random exploration & $\epsilon_{\text{final}}$ & 0.05 \\
65
+ % Decay rate of probability $\epsilon$ for random exploration & $\epsilon_{\text{decay}}$ & 1000 \\
66
+ % \hline
67
+ % \end{tabular}
68
+ % \caption{Environmental parameters for the example of balancing an inversed pendulum on a cart, and the learning hyperparameters of the DQL agent.}
69
+ % \label{tab:CartPoleParameters}
70
+ % \end{table}
71
+ % \begin{figure}[H]
72
+ % \centering
73
+ % \vspace{-1.0em}
74
+ % \includegraphics[width=1.0\textwidth]{PendulumDuration}
75
+ % % Matptlotlib Customized Settings
76
+ % % figsize=(6.25, 3.5)
77
+ % % loc='left', fontsize='large'
78
+ % \vspace{-1.5em}
79
+ % \caption{Duration of pendulum kept upright over training episodes.}
80
+ % \label{fig:PendulumDuration}
81
+ % \end{figure}
82
+ % \begin{figure}[H]
83
+ % \centering
84
+ % \vspace{-1.0em}
85
+ % \includegraphics[width=1.0\textwidth]{QValuesPendulum}
86
+ % % Matptlotlib Customized Settings
87
+ % % figsize=(5.5, 3.5)
88
+ % % ax.ticklabel_format(style='sci', axis='x', scilimits=(0,0))
89
+ % % loc='left', fontsize='large
90
+ % \vspace{-1.5em}
91
+ % \caption{Progression of $Q$-values over the course of training.}
92
+ % \label{fig:QValuesProgression}
93
+ % \end{figure}
@@ -0,0 +1 @@
1
+ Drawing some inspiration from the described example, DQL is applied in this work to select an optimal feature subset of extracted gait features, coupled along with a corresponding prediction (supervised learning) model to automatically assess gait poststroke. Earlier works by \cite{lee2020,lee2021} have applied DQL to develop a clinical decision support system (CDSS) that automatically assesses a patient's ability in performing functional exercises, while delivering patient-specific relevant features for each corresponding task. In contrast, the method developed here is $(i)$ applied to the context of gait-assessment poststroke and $(ii)$ extended to include model selection. Model selection in this work will cover both the selection of a \emph{learning algorithm} and the subsequent \emph{hyperparameter tuning}. The issue of model selection is often described as more an art than a strict science \cite{raschka2020}, where there exist no single ``best'' optimization algorithm across all problem spaces or datasets. Consequently, the selection of a model should be guided by the characteristics of the dataset. The learning algorithm and its corresponding optimal hyperparameters will be referred to as the \emph{prediction model} (PM) in the remainder of this paper.
@@ -0,0 +1,58 @@
1
+ The environment is formulated as an episodic partially observable Markov Decision Process (POMDP), where an agent gradually learns by autonomously exploring different combinations of feature subsets and PMs to assess a patient's gait poststroke based on the extracted gait features. At each $k$-th episode, a randomly selected sample yields a $D$-dimensional \emph{sample} $\sample_{k}$, comprised of the extracted gait features, alongside the corresponding medical-board's gait assessment $y_k$. The \emph{feature values} $x_{k,d}$ of the feature ${\feature}_{d} \in \fset$ make up the elements of $\sample_{k}$. The environment for a given time-step $t$ of the episode $k$, is represented by the state as formulated in (\ref{eq:ltfmstate})
2
+ \begin{equation}
3
+ \begin{split}
4
+ \nvec{s}_{t} &= \begin{bmatrix} {\nvec{\mathtt{X}}_t}^T & {\nvec{\mathtt{F}}_t}^T & g_t \end{bmatrix}^T \\
5
+ &= \begin{bmatrix} \mathtt{X}_{1,t} & \mathtt{X}_{2,t} & \cdots & \mathtt{X}_{D,t} & \mathtt{F}_{1,t} & \mathtt{F}_{2,t} & \cdots & \mathtt{F}_{D,t} & \mathtt{G}_{t} \end{bmatrix}^T \,.
6
+ \end{split}
7
+ \label{eq:ltfmstate}
8
+ \end{equation}
9
+ The episodic index will be dropped here for purposes of brevity. The vectors $\nvec{\mathtt{X}}_t$ and $\nvec{\mathtt{F}}_t$ denote the \emph{observed values} and \emph{observed features}, respectively. The \emph{selected PM} is denoted by $\mathtt{G}_{t}$. The agent is allowed to perform actions $a \in \mathcal{A}$, where the set $\mathcal{A}$ comprises three types of actions as shown in (\ref{eq:ltfmactions})
10
+ \begin{equation}
11
+ \mathcal{A} = \begin{Bmatrix}
12
+ \overbrace{a_0}^{\text{make prediction}} &
13
+ \underbrace{\begin{matrix} a_1 & a_2 & \cdots & a_D \end{matrix}}_{\text{recruit feature}} &
14
+ \underbrace{\begin{matrix} a_{D+1} & a_{D+2} & \cdots & a_{D+|\mathcal{G}|} \end{matrix}}_{\text{select PM}}
15
+ \end{Bmatrix} \,.
16
+ \label{eq:ltfmactions}
17
+ \end{equation}
18
+ The action $a_0$ fits the chosen PM $\mathtt{G}$ to the training dataset, was has been trimmed to only include the selected subset of features $\fsubset$. The fitted PM is then used to predict $\hat{y}_k$, followed by computing the absolute error between the predicted and medical-board's gait assessment $\left| \hat{y}_k - y_k \right|$, which will be used to formulate the reward function. Actions $\left\{ a_{d} \,|\, 1 \leq d \leq D \right\}$ recruit feature ${\feature}_d$ to the subset of selected features $\fsubset$, whereas actions $\left\{ a_{j} \,|\, D+1 \leq j \leq D+|\mathcal{G}| \right\}$ chooses a PM $g_{j-D} \in \pmset$. The set ${\pmset} = \begin{Bmatrix} g_1 & g_2 & \cdots & g_{|\mathcal{G}|}\end{Bmatrix}$ denote the set of possible PMs from which the agent can choose from. The dynamics of the POMDP evolve deterministically according to (\ref{eq:ltfmdynamics}).
19
+ \begin{align}
20
+ P(\nvec{s}, a) &=
21
+ \begin{cases}
22
+ \mathtt{X}_{d} = x_{d} \text{ and } \mathtt{F}_{d} = 1 & \text{if } a = a_{d} \text{, where } 1 \leq d \leq D \\
23
+ \mathtt{G} = g_{j-D} & \text{if } a = a_{j} \text{, where } D+1 \leq j \leq D+|\mathcal{G}| \\
24
+ T_{\text{end}} & \text{if } a = a_0 \\
25
+ \end{cases}
26
+ \label{eq:ltfmdynamics}
27
+ \end{align}
28
+ The episode begins with no features recruited and the initial state $\nvec{s}_0$ as shown in (\ref{eq:ltfminitialstate}).
29
+ \begin{equation}
30
+ \begin{split}
31
+ \nvec{s}_0 &= \begin{bmatrix} {\nvec{\mathtt{X}}_0}^T & {\nvec{\mathtt{F}}_0}^T & \mathtt{G}_{0} \end{bmatrix}^T \\
32
+ &= \begin{bmatrix} \mathtt{X}_{1,0} & \mathtt{X}_{2,0} & \cdots & \mathtt{X}_{D,0} & \mathtt{F}_{1,0} & \mathtt{F}_{2,0} & \cdots & \mathtt{F}_{D,0} & \mathtt{G}_{0} \end{bmatrix}^T \\
33
+ &= \begin{bmatrix} \bar{x}_{1} & \bar{x}_{2} & \cdots & \bar{x}_{D} & 0 & 0 & \cdots & 0 & \mathtt{G}_{0} \end{bmatrix}^T \,,
34
+ \end{split}
35
+ \label{eq:ltfminitialstate}
36
+ \end{equation}
37
+ where $\bar{x}_d$ denotes the average feature value of the feature ${\feature}_d$ computed from the training dataset. The observed feature $\mathtt{F}_d$ assumes the value $0$ to indicate feature's ${\feature}_d$ exclusion from the set of selected features $\fsubset$, and the PM $\mathtt{G}_{0}$ is initialized with a randomly chosen prediction model $g$ from the set $\mathcal{G}$. Consider a POMDP comprised of three features and two options of PMs as an example. At the current time-step $t={\xi}$, the recruited features include $\fsubset = \left\{ {\feature}_{1} \,, {\feature}_{3} \right\}$, and the PM $\mathtt{G}_{\xi} = g_1$ chosen. This would consequently yield the state $\nvec{s}_{\xi}$
38
+ \begin{equation}
39
+ \begin{split}
40
+ \nvec{s}_{\xi} &= \begin{bmatrix} \mathtt{X}_{1,\xi} & \mathtt{X}_{2,\xi} & \mathtt{X}_{3,\xi} & \mathtt{F}_{1,\xi} & \mathtt{F}_{2,\xi} & \mathtt{F}_{3,\xi} & \mathtt{G}_{\xi} \end{bmatrix}^T \\
41
+ &= \begin{bmatrix} x_{1} & \bar{x}_{2} & x_{3} & 1 & 0 & 1 & g_1 \end{bmatrix}^T \,.
42
+ \end{split}
43
+ \label{eq:ltfmexamplestate}
44
+ \end{equation}
45
+ The reward function is as given in (\ref{eq:ltfmreward}).
46
+ \begin{align}
47
+ R(\nvec{s}, a) &=
48
+ \begin{cases}
49
+ -\lambda c \left( {\feature}_{d} \right) & \text{if } a = a_{d} \text{, where } 1 \leq d \leq D \\
50
+ -c_{\fsubset} & \text{if } a = a_{d} \in \fsubset \text{, where } 1 \leq d \leq D \\
51
+ -\lambda_{g} c_{g} \left( g_{j-D} \right) & \text{if } a = a_{j} \text{, where } D+1 \leq j \leq D+|\mathcal{G}| \\
52
+ 0 & \text{if } a = a_0 \text{, with } \left| \hat{y}_k - y_k \right| < \Delta \\
53
+ -\left| \hat{y}_k - y_k \right| & \text{if } a = a_0 \text{, with } \left| \hat{y}_k - y_k \right| \geq \Delta \\
54
+ -c_{g, \emptyset} & \text{if } a = a_0 \text{, with } \fsubset = \emptyset
55
+ \end{cases}
56
+ \label{eq:ltfmreward}
57
+ \end{align}
58
+ The penalties of recruiting a feature ${\feature}_d$ and choosing a PM $g_{j-D}$ are denoted by $c \left( {\feature}_{d} \right)$ and $c_{g} \left( g_{j-D} \right)$, respectively. They are multiplied by their respective factors $\lambda$ and $\lambda_{g}$. The agent is penalized by $c_{\fsubset}$ if it decides to select a feature ${\feature}_d$ which has already been previously recruited into $\fsubset$. When the agent decides to make a prediction, it is penalized by $\left| \hat{y}_k - y_k \right|$ if the absolute error is larger or equal a user-determined error tolerance $\Delta$. Otherwise, the agent is not penalized. However, if the agent decides to make a prediction without recruiting any features, a penalty of $c_{g, \emptyset}$ is incurred. The chosen PM is then fitted with the entire training dataset, which is in turn used to perform an inference on a ``background'' example, comprised of average feature values $\bar{\nvec{\mathtt{X}}} = \left[ \bar{x}_{1} \ \bar{x}_{2} \ \cdots \ \bar{x}_{D}\right]$ and deliver a prediction $\hat{y}_k$.
@@ -0,0 +1,7 @@
1
+ The dataset used for modeling consists of 100 hemiparetic stroke patients which received a clinical examination and a full-body instrumented gait analysis. An interdisciplinary board of medical experts assigned each patient a Stroke Mobility Score (SMS) \cite{raab2020}, a multiple-cue clinical observational score comprised of six sub-scores each pertaining to a functional criterion of gait. The medical-board's gait-assessments (see Figure \ref{fig:LTFMvsBoard} were computed at subscore level as the mode of all individual recommendations. If the mode could not be defined, the subscore not in contention for the highest count was used as a tiebreaker. From the measurements, 904 measured stride pairs of 100 patients were obtained, 680 gait features extracted, and the dataset split 70/30 for training and testing. As a preprocessing step, expert knowledge was used to trim the features accordingly, followed by filtering out statistically non-discriminatory features. Detailed descriptions of these steps can be found in \cite{liaw2025}.
2
+
3
+ For each SMS subscore, the agent was trained on a batch size $|\mathcal{B}_{D}|=256$, sampled from the replay memory that stores the last $N=10000$ experiences, over 1500 episodes. The reward function was formulated with the penalties $c \left( {\feature}_d \right) = c_{g} \left( g_{j-D} \right) = 0.01$ and their respective factors $\lambda = \lambda_{g} = 1.0$, $c_{\fsubset} = 1.0$, $c_{g, \emptyset} = 5.0$, and $\Delta = 0.5$. The set $\mathcal{G}$ included random forest (RF), support vector machine (SVM), and multilayer perceptron (MLP) regression models. The RF and SVM regression models were implemented using scikit-learn \cite{scikit}, while the MLP regression model was similarly implemented using PyTorch \cite{pytorch}.
4
+
5
+ The optimal hyperparameters of each PM are selected by evaluating each hyperparameter combination (grid search) with a 3-fold cross-validation to select the hyperparameters that maximize the cross-validation estimate of the $R^2$. The SVM was implemented with a radial basis function as the kernel and grid-searched across the regularization parameter $C=\left\{ 0.1 \,, 0.5 \,, 1.0\right\}$, and the RF with 100 estimators and grid-searched across the maximum depth of trees $\left\{ 3 \,, 4 \,, 5 \right\}$. The MLP was implemented with four hidden layers with the number of units $\left\{ 32 \,, 32 \,, 16 \,, 8 \right\}$, trained with a batch size of 128 across 300 training epochs, AdamW learning rate of \num{1e-4}, and grid-searched across the weight decay $\lambda = \left\{ 0.01 \,, 0.1 \right\}$ and dropout probability $\left\{ 0.01 \,, 0.05 \right\}$.
6
+
7
+ The remaining hyperparameters not mentioned are left in their default values as implemented in scikit-learn and PyTorch. The policy and target networks were implemented as MLPs with two hidden layers, each with 1024 neurons, with the discount rate $\gamma = 0.99$ and soft update rate learning rate $\tau = $ \num{5e-4}. The parameters of the $Q$-network are optimized using an AdamW optimizer with the rate $l_r = $ \num{1e-5}. For the $\epsilon$-greedy algorithm, the agent selects a random action with the probability $\epsilon = \left( 0.9 - 0.05 \right) e^{-\frac{t_c}{1000}} + 0.05$. The algorithm was implemented using PyTorch \cite{pytorch}, alongside other standard Python packages such as NumPy \cite{numpy} and pandas \cite{pandas}. The training was performed on a computing system equipped with a 3.6 GHz AMD Ryzen\texttrademark{} 7 3700X CPU and an NVIDIA\textsuperscript{\textregistered{}} GeForce\textsuperscript{\textregistered{}} GTX 1650 GPU.
@@ -0,0 +1,111 @@
1
+ The agent was trained on the training dataset and evaluated on the test dataset at patient-level. The performance of the resulting predictions for the SMS subscores and their summed SMS on the test dataset in terms of the $R^2$ are as shown in Table \ref{tab:ltfmresults}. This performance is comparable to the agreement of each expert recommendation with the collective decisions of the medical-board (see Figure \ref{fig:LTFMvsBoard}). For each evaluation of the test data, the agent was able to optimally select a feature set and a PM for a given stride-pair, yielding excellent predictive performance and delivering patient-specific key features, which could help medical experts identify personalized key therapeutic targets.
2
+ \begin{table}[H]
3
+ \caption{Performance of RL agents on the test dataset in terms of $R^2$, and corresponding ICC\textsubscript{1.1} of the medical-board assessments.}
4
+ \label{tab:ltfmresults}
5
+ \begin{tabular}{|l|l|l|l|}
6
+ \hline
7
+ \textbf{SMS} & \textbf{Feature subset} & \textbf{$R^2$} & \textbf{ICC\textsubscript{1.1}} \\
8
+ \textbf{subscore} & $|\fsubset|$ & & \\
9
+ \hline
10
+ Trunk-SMS & 241 of 680 features & 0.59 $\,$ & 0.65 \\
11
+ Leg-SMS & 188 of 356 pre-selected features & 0.56 $\,$ & 0.73 \\
12
+ Arm-SMS & 99 of 330 pre-selected features & 0.54 $\,$ & 0.72 \\
13
+ Speed-SMS & 31 of 32 pre-selected features & 0.78 $\,$ & 0.72 \\
14
+ Fluency-SMS & 263 of 680 features & 0.73 $\,$ & 0.72 \\
15
+ Stability-SMS & 238 of 680 features & 0.82 $\,$ & 0.83 \\
16
+ \hline
17
+ \multicolumn{2}{|l|}{Combination of subscore models to predict the SMS} & 0.83 & 0.88 \\
18
+ \hline
19
+ \end{tabular}
20
+ \end{table}
21
+ \begin{figure}[!h]
22
+ \includegraphics[width=\textwidth]{DoctorsvsBoard_TestSplit_and_LTFMJRMCC.pdf}
23
+ \caption{Scatterplots showing how the individual experts (left) and RL agent (right) compare with the medical-board's gait assessment (abscissa) in terms of the SMS.} \label{fig:LTFMvsBoard}
24
+ \end{figure}
25
+ To obtain a general overview of the salient features for the varying degrees of gait impairment, inference is performed on the training dataset, and the frequency of recruited features summed per SMS subscore. The intuititon here is that patients assigned the same SMS subscore should exhibit similar underlying physical characteristics, represented by a subset of key features. Upon observing a patient, a well-trained agent then recruits the features that make up the underlying pattern learned from patients of similar physical status. Consequently, the recruitment frequency of a feature can serve as an indication of the feature's representativeness with respect to a given SMS subscore group. Taking Fluency-SMS as an example, the top seven features and progression of the averaged $Q$-values per sampled batch during training are as shown in Table \ref{tab:featuresRecruitmentFrequency} and Figure \ref{fig:QValuesProgression}, respectively.
26
+ \begin{table}[H]
27
+ \caption{Top features by recruitment count for predicting SMS-Fluency. NAV denotes the normalized angular velocity, as described in \cite{liaw2025}.}
28
+ \label{tab:featuresRecruitmentFrequency}
29
+ \begin{tabular}{|p{0.075\textwidth}|p{0.23\textwidth}|p{0.23\textwidth}|p{0.23\textwidth}|p{0.23\textwidth}|}
30
+ \hline
31
+ \multirow{2}{*}{\textbf{Rank}} & \multicolumn{4}{c|}{\textbf{SMS-Fluency}} \\ \cline{2-5}
32
+ & \textbf{Score 0} & \textbf{Score 1} & \textbf{Score 2} & \textbf{Score 3} \\ \hline
33
+ 1 & Thorax Rotation & Shoulder Flex./Ex. & Pelvis Tilt & Pelvis Tilt \\
34
+ & Angle contra. & NAV ipsi. & NAV ipsi. & NAV ipsi. \\
35
+ & (Swing min.) & (Stride median) & (Stride min.) & (Stride min.) \\ \hline
36
+ 2 & Ankle Dorsiflexion & Spine Rotation & Shoulder Flex./Ex. & Shoulder Flex./Ex. \\
37
+ & Angle contra & NAV ipsi. & NAV ipsi. & NAV ipsi. \\
38
+ & (Stance max.) & (Stride max.) & (Stride median) & (Stride median) \\ \hline
39
+ 3 & Shoulder Flex./Ex. & Ankle Dorsiflexion & Ankle Dorsiflexion & Ankle Dorsiflexion \\
40
+ & NAV ipsi. & Angle contra. & Angle contra. & Angle contra. \\
41
+ & (Stride median) & (Stance max.) & (Stance max.) & (Stance max.) \\ \hline
42
+ 4 & Pelvis Tilt & Spine Side Tilt & Spine Side Tilt & Spine Side Tilt \\
43
+ & NAV ipsi. & Angle contra. & Angle contra. & Angle contra. \\
44
+ & (Stride min.) & (Stance max.) & (Stance max.) & (Stance max.) \\ \hline
45
+ 5 & Spine Side Tilt & Pelvis Rotation & Pelvis Tilt & Pelvis Tilt \\
46
+ & Angle contra. & Angle contra. & NAV contra. & NAV contra. \\
47
+ & (Stance max.) & (Stride min.) & (Stance max.) & (Stance max.) \\ \hline
48
+ 6 & Thorax Tilt & Pelvis Tilt & Elbow Flex./Ex. & Elbow Flex./Ex. \\
49
+ & NAV ipsi. & NAV ipsi. & Angle ipsi. & Angle ipsi. \\
50
+ & (Stride max.) & (Stride min.) & (Stride median) & (Stride median) \\ \hline
51
+ 7 & Pelvis Tilt & Ankle Dorsiflexion & Spine Rotation & Spine Rotation \\
52
+ & NAV contra. & Angle contra. & NAV ipsi. & NAV ipsi. \\
53
+ & (Stance max.) & (Stride max.) & (Stride max.) & (Stride max.) \\ \hline
54
+ \end{tabular}
55
+ \end{table}
56
+ \begin{figure}[!h]
57
+ \includegraphics[width=0.75\textwidth]{QValuesFluency_and_Pendulum.pdf}
58
+ \caption{Averaged $Q$-Values per sampled batch at each training iteration for the example of balancing an inversed pendulum (top) and predicting the Fluency-SMS (bottom).} \label{fig:QValuesProgression}
59
+ \end{figure}
60
+
61
+ % %% Stride-Pair 1::
62
+ % %% Stride-Pair 2::
63
+ % Consider the following two stride-pairs, which will be referred to as SPI and SPII from the test dataset. SPI is a measured stride-pair of a critically affected patient (SMS of XX), and SPII, that of a mildly affected patient (SMS of XX). Both these handpicked examples have correctly predicted SMS, but not all predicted SMS subscores are necessarily correct. The true SMS subscores are displayed in brackets, next to the predicted subscores in Table \ref{tab:handpickedLTFMExamples} and as one can see, the selected key features vary depending on the physical status of a patient. The features listed in Table \ref{tab:handpickedLTFMExamples} are sorted in accordance to its relevance in terms of distinguishing samples of the selected stride-pair's assigned subscore from the remaining three subscores, quantified by the mutual information (cite??).
64
+
65
+ % \begin{table}[H]
66
+ % \centering
67
+ % \begin{tabular}{@{}l@{\hspace{0.25em}}|p{0.4\textwidth}p{0.4\textwidth}}
68
+ % \hline
69
+ % \textbf{Sub-} & \textbf{<Example A>} & \textbf{<Example B>} \\
70
+ % \textbf{score} & $\quad$ \textbf{SMS: XX (critical)} & $\quad$ \textbf{SMS: XX (mild)} \\
71
+ % \hline
72
+ % \multirow{5}{*}{$\hspace{0.95em}$\makebox[0.75em]{\rotatebox{90}{\textbf{Trunk-SMS }}}}
73
+ % & \textbf{Predicted score: XX} (XX) & \textbf{Predicted score: XX} (XX) \\ \cline{2-3}
74
+ % & Feature A.2 & Feature B.2 \\
75
+ % & Feature A.3 & Feature B.3 \\
76
+ % & Feature A.4 & Feature B.4 \\
77
+ % & $\cdots$ (XX features in total) & $\cdots$ (XX features in total) \\ \hline
78
+ % \multirow{5}{*}{$\hspace{0.95em}$\makebox[0.75em]{\rotatebox{90}{\textbf{Leg-SMS }}}}
79
+ % & \textbf{Predicted score: XX} (XX) & \textbf{Predicted score: XX} (XX) \\ \cline{2-3}
80
+ % & Feature A.2 & Feature B.2 \\
81
+ % & Feature A.3 & Feature B.3 \\
82
+ % & Feature A.4 & Feature B.4 \\
83
+ % & $\cdots$ (XX features in total) & $\cdots$ (XX features in total) \\ \hline
84
+ % \multirow{5}{*}{$\hspace{0.95em}$\makebox[0.75em]{\rotatebox{90}{\textbf{Arm-SMS }}}}
85
+ % & \textbf{Predicted score: XX} (XX) & \textbf{Predicted score: XX} (XX) \\ \cline{2-3}
86
+ % & Feature A.2 & Feature B.2 \\
87
+ % & Feature A.3 & Feature B.3 \\
88
+ % & Feature A.4 & Feature B.4 \\
89
+ % & $\cdots$ (XX features in total) & $\cdots$ (XX features in total) \\ \hline
90
+ % \multirow{5}{*}{$\hspace{0.95em}$\makebox[0.75em]{\rotatebox{90}{\textbf{Speed-SMS }}}}
91
+ % & \textbf{Predicted score: XX} (XX) & \textbf{Predicted score: XX} (XX) \\ \cline{2-3}
92
+ % & Feature A.2 & Feature B.2 \\
93
+ % & Feature A.3 & Feature B.3 \\
94
+ % & Feature A.4 & Feature B.4 \\
95
+ % & $\cdots$ (XX features in total) & $\cdots$ (XX features in total) \\ \hline
96
+ % \multirow{5}{*}{$\hspace{0.95em}$\makebox[0.75em]{\rotatebox{90}{\textbf{Fluency-SMS }}}}
97
+ % & \textbf{Predicted score: XX} (XX) & \textbf{Predicted score: XX} (XX) \\ \cline{2-3}
98
+ % & Feature A.2 & Feature B.2 \\
99
+ % & Feature A.3 & Feature B.3 \\
100
+ % & Feature A.4 & Feature B.4 \\
101
+ % & $\cdots$ (XX features in total) (XX) & $\cdots$ (XX features in total) (XX) \\ \hline
102
+ % \multirow{5}{*}{$\hspace{0.95em}$\makebox[0.75em]{\rotatebox{90}{\textbf{Stability-SMS }}}}
103
+ % & \textbf{Predicted score: XX} & \textbf{Predicted score: XX} \\ \cline{2-3}
104
+ % & Feature A.2 & Feature B.2 \\
105
+ % & Feature A.3 & Feature B.3 \\
106
+ % & Feature A.4 & Feature B.4 \\
107
+ % & $\cdots$ (XX features in total) & $\cdots$ (XX features in total) \\ \hline
108
+ % \end{tabular}
109
+ % \caption{Two examples from the test dataset, each with their respective subset of features recruited by the agent to predict each SMS subscores}
110
+ % \label{tab:handpickedLTFMExamples}
111
+ % \end{table}
@@ -0,0 +1,40 @@
1
+ The SMS and the Stability-SMS models perform very well ($R^2$ $> 0.8$), while the other models perform well ($R^2$ $> 0.5$). Unsurprisingly, the performances correlate strongly [Cohen \cite{cohen1988}, $p=0.08$, $r=0.71$] with the corresponding interrater reliabilities ICC\textsubscript{1.1} of the medical-board. Though the individual subscore models do not perform perfectly well, they compensate for one another when formulating the SMS. There are nonetheless, a few shortcomings and issues that should be addressed. The first is the number of features recruited by the agent to make a prediction. While penalizing the agent everytime it recruits a feature has helped produce a reduced feature subset, one can nonetheless see in Table \ref{tab:featuresRecruitedReduction} that the number of recruited features is still considerably high.
2
+ \begin{table}[H]
3
+ \caption{Number of original features, average number of features recruited by the agent (rounded to closest integer), and the reduction in percentage.}
4
+ \label{tab:featuresRecruitedReduction}
5
+ \begin{tabular}{|l|l|l|l|}
6
+ \hline
7
+ \textbf{SMS} & \multicolumn{3}{c|}{\textbf{Number of features}} \\ \cline{2-4}
8
+ \textbf{Subscore} & \textbf{Original} $|\fset|$ & \textbf{Average} $|\fsubset|$ & \textbf{Reduction} $[\%]$ \\ \hline
9
+ Trunk-SMS & 241 & 140 & 41.9 \\
10
+ Leg-SMS & 188 & 113 & 39.9 \\
11
+ Arm-SMS & 99 & 59 & 40.4 \\
12
+ Speed-SMS & 31 & 21 & 32.2 \\
13
+ Fluency-SMS & 253 & 141 & 44.3 \\
14
+ Stability-SMS & 238 & 133 & 44.1 \\ \hline \hline
15
+ \multicolumn{3}{|c|}{\textbf{Average reduction of number of features}} & 40.5 \\ \hline
16
+ \end{tabular}
17
+ \end{table}
18
+ This could be counterintuitive and pose the risk of overwhelming clinicians with information \cite{lee2020,lee2021}. It would thus be beneficial to reduce the number of recruited features to an amount, that lies within established cognitive processing limits (e.g. $7 \pm 2$) \cite{miller1994}. While this might exclude a few key features, interpretability is often prioritized over predictive accuracy in developing CDSSs, as they should serve as an aid rather than a replacement for human experts. Secondly, the agent was also observed to alternate between PM numerous times within an episode before finally deciding to make a prediction. This seems to reflect ``undecisiveness'', akin to that of a human ML practitioner. Thirdly, medical diagnostic erros are often asymmetrical. While an overpessimistic result (a false positive) may lead to unnecessary follow-up testing or temporary patient anxiety, the consequences of overoptimistic results (a false negative) can result in the catastrophic delay of life-saving treatment, or doctors recommending futile, aggressive care instead of more beneficial palliative care \cite{christakis2000}.
19
+
20
+ A key highlight of this RL-based approach is the versatility it offers by reformulating the reward function to account for these issues. For instance, to encourage the agent to limit the number of recruited features to around seven, the penalty of recruiting a feature can be progressively increased after the number of recruited features $|\fsubset|$ exceeds a user-defined threshold. If there exists a feature that possesses an undesired characteristic such as being difficult to interpret or challenging to obtain, one could also set a higher penalty for that feature. Similarly, one could also penalize the agent more for choosing a MLP, over less computationally taxing ones like SVM. To steer the agent away from making overoptimistic diagnostic errors, errors where the predicted medical score is lower than that of the true score can be weighed more heavily. Shown in (\ref{eq:HypotheticalRewardFunction}) is a hypothetical reward function, that could account for the aformentioned issues.
21
+ \begin{align}
22
+ R(\nvec{s}, a) &=
23
+ \begin{cases}
24
+ -\lambda_{d} c \left( {\feature}_{d} \right) & \text{if } a = a_{d} \text{, where } 1 \leq d \leq D , \,\, \lambda_{d} \in \nvec{\lambda}_{\fset}\\
25
+ -\lambda c^{*}(|\fsubset|) & \text{if } a = a_{d} \text{ and } |\fsubset| > |{\fsubset}^{*}| \text{, where } 1 \leq d \leq D \\
26
+ -c_{\fsubset} & \text{if } a = a_{d} \in \fsubset \text{, where } 1 \leq d \leq D \\
27
+ -\lambda_{j} c_{g} \left( g_{j-D} \right) & \text{if } a = a_{j} \text{, where } D+1 \leq j \leq D+|\mathcal{G}| , \,\, \lambda_{g} \in \nvec{\lambda}_{\pmset}\\
28
+ 0 & \text{if } a = a_0 \text{, with } \left| \hat{y}_k - y_k \right| < \Delta \\
29
+ -\left| \hat{y}_k - y_k \right| & \text{if } a = a_0 \text{, with } \left| \hat{y}_k - y_k \right| \geq \Delta \text{ and } \hat{y}_k > y_k \\
30
+ -\lambda_{y} \left| \hat{y}_k - y_k \right| & \text{if } a = a_0 \text{, with } \left| \hat{y}_k - y_k \right| \geq \Delta \text{ and } \hat{y}_k < y_k\\
31
+ -c_{g, \emptyset} & \text{if } a = a_0 \text{, with } \fsubset = \emptyset
32
+ \end{cases}
33
+ \label{eq:HypotheticalRewardFunction}
34
+ \end{align}
35
+ The penalty factors for recruiting a feature and PM can be stored in their respective lookup tables $\nvec{\lambda}_{\fset} = \left[ {\lambda}_{1} \ {\lambda}_{2} \ \cdots \ {\lambda}_{d} \ \cdots \ {\lambda}_{D}\right]$ and $\nvec{\lambda}_{\pmset} = \left[ {\lambda}_{1} \ {\lambda}_{2} \ \cdots \ {\lambda}_{j} \ \cdots \ {\lambda}_{|\pmset|}\right]$, where the each penalty $\lambda$ is user-defined. To steer the agent towards selecting a user-desired number of features $|{\fsubset}^{*}|$, the penalty of recruiting a feature can be progressively increased by $c^{*}(|\fsubset|)$, as a function of the number of recruited features $|\fsubset|$. In the case of a prediction error larger than the error tolerance $\Delta$, the penalty for predicted scores larger than that of the true score $\hat{y}_k > y_k$ is the absolute difference, whereas the penalty for predicted scores smaller than the true score $\hat{y}_k < y_k$ is the absolute difference multiplied by a user-defined factor $\lambda_y$.
36
+
37
+ One limitation that should be accounted for is the unbalanced datasets on which the agent is trained on. This could especially be problematic during training because a sample is randomly chosen at every iteration. This stochastic sampling might lead to an agent that is heavily biased toward the majority class, failing to generalize to rarer but potentially more critical scenarios. However, to account for this, each sample is weighted to account for two source of biases as described in \cite{liaw2025}. A second issue would be a technical one, namely the sheer scale of the required iterations. This attached computational cost is further multiplied when one factors in hyperparameter tuning. Therefore, to effectively explore the hyperparameter space and ensure model robustness, one should leverage cloud supercomputers, that enables massive parallelization to manage the vast search space.
38
+
39
+ One lesson learned from applying DQL in balancing an inversed pendulum, as described in the earlier section, is the importance of beginning with a reasonably ``good'' initial estimate for solution convergence. Given the simple implementation described earlier, the controller would very unlikely ever swing the pendulum into its upright position, if the pendulum was initially simply hanging from the revolute joint. Similarly here, the agent would very unlikely find a good subset of features, if it was presented with all 680 features. Moreover, if the original set of features were all included, the number of selected features by the agent would be tremendously overwhelming to effectively help a clinician.
40
+ % Run a round of ltfm with all 680 feature to prove the point.
@@ -0,0 +1 @@
1
+ In conclusion, the gait assessments in terms of the SMS, prescribed by an interdisciplinary medical board was well reproduced from gait data by training an agent that can dynamically select salient features, and a corresponding PM. More importantly, the agent was able to select a subset of features specific to a patient. These patient-specific features could potentially aid clinicians in better designing personalized therapy, especially with earlier research \cite{lee2020,pistacchi2017,huang2016,biase2020} indicating the importance of accounting for inter-patient variability, and the relevant biomarkers evolving as a function of disease progression. These synergistic interactions between system, and experts may improve the quality in diagnosis and help objectify therapeutic targets.
@@ -1,5 +1,4 @@
1
- % Abstract should be limited to 150--250 words.
2
- Designing personalized therapy for poststroke gait rehabilitation often involves the effort of an interdisciplinary medical team and tedious assessments. An automated gait assessment tool based on gait measurements and interdisciplinary knowledge could help experts with faster gait assessments, while providing objective feedback. Gait measurements are however high-dimensional, making the development of such tools challenging. Inspired by the application of Deep Q-Learning in solving physical problems, this study presents a method for dynamic feature and model selection. The search space is formulated as a partially observable Markov Decision Process, where the agent iteratively explores the 680 extracted gait features and various prediction models, to learn optimal patient-specific combinations of feature subsets and prediction models. The model was developed using a dataset of 904 stride pairs from 100 hemiparetic stroke patients. Each patient was evaluated by an interdisciplinary board using the Stroke Mobility Score, a multiple-cue clinical observational score comprised of six subscores, each pertaining to a functional criterion of gait. The agent was trained to approximate optimal decision-making, receiving rewards for accurate predictions and efficient feature selection. Results demonstrated excellent predictive performance, achieving a coefficient of determination ($R^2$) of 0.83 on the test set. Crucially, the tool identifies patient-specific key features, that could help clinicians by highlighting specific therapeutic targets tailored to individual needs, thus potentially providing a solution for personalized poststroke therapy.
1
+ Designing personalized therapy for poststroke gait rehabilitation often involves the effort of an interdisciplinary medical team and tedious assessments. An automated gait assessment tool based on gait measurements and interdisciplinary knowledge could help experts with faster gait assessments, while providing objective feedback. The high dimensionality of gait data however, makes it challenging for subsequent analyses. Inspired by the application of Deep Q-Learning in solving multi-step procedures, this study presents a method for dynamic feature and model selection. The search space is formulated as a partially observable Markov Decision Process, where the agent iteratively explores the 680 extracted gait features and various prediction models, to learn optimal patient-specific combinations of feature subsets and prediction models. The model was developed using a dataset of 904 stride pairs from 100 hemiparetic stroke patients. Each patient was evaluated by an interdisciplinary medical board using the Stroke Mobility Score, a multiple-cue clinical observational score comprised of six functional subscores. The agent was trained to approximate optimal decision-making, receiving rewards for accurate predictions and efficient feature selection. Results demonstrated excellent predictive performance, achieving a coefficient of determination ($R^2$) of 0.83 on a held-out test dataset. More importantly, the tool identifies patient-specific key features, that could help highlight specific therapeutic targets for designing personalized poststroke therapy.
3
2
 
4
3
  %% ORIGINAL
5
4
  % Designing personalized therapy for poststroke gait rehabilitation often involves the effort of an interdisciplinary medical team and tedious assessments. An automated gait-assessment tool based on gait measurements and interdisciplinary knowledge could help experts with faster gait assessments, while providing objective feedback. However, developing such a tool based on gait data can be challenging due to the high dimensionality of the training datasets typically derived from gait measurements.
@@ -14,4 +13,4 @@ Designing personalized therapy for poststroke gait rehabilitation often involves
14
13
  %
15
14
  % The agent is then rewarded accordingly for each action, before transitioning onto a next state where the process is repeated iteratively. Over the course of many iterations, the agent eventually learns to select an optimal set of actions, given a set of features of a stride pair measurement.
16
15
  %
17
- % The agent is trained using a Deep Q-Learning algorithm that approximates the Bellman equation by training a deep neural network iteratively on a batch of randomly chosen transitions. The trained agent tested on the test dataset yielded excellent predictive performance, showing a coefficient of determination of 0.85, while delivering patient-specific key features. The delivered patient-specific key features could help clinicians focus on key therapeutic targets, specifically tailored to a patient's needs.
16
+ % The agent is trained using a Deep Q-Learning algorithm that approximates the Bellman equation by training a deep neural network iteratively on a batch of randomly chosen transitions. The trained agent tested on the test dataset yielded excellent predictive performance, showing a coefficient of determination of 0.85, while delivering patient-specific key features. The delivered patient-specific key features could help clinicians focus on key therapeutic targets, specifically tailored to a patient's needs.