Summer 2022 | Emory REU Computational Mathematics for Data Science

Comparing Reinforcement Learning to Optimal Control Methods on the Continuous Mountain Car Problem

Fri, 05 Aug 2022 00:00:00 +0000

What is the best way to get a car out of the bottom of a hill?

Introduction

This blog post was written by Jacob Mantooth,Dewan Chowdhury, Arjun Sethi-Olowin and published with minor edits.The team was advised by Dr.Lars Ruthotto.In addition to this post, the team has also given a midterm presentation, filmed a poster blitz video, created a poster, published code, and written a paper.

The word around town is that reinforcement learning is the top dog and has the answers to all our problems. We wanted to see if that really was the case, so this summer we took a trip to Emory University where we looked at the continuous mountain car problem to see if reinforcement learning really, was the best. The continuous mountain car problem is an example of an optimal control problem. In the image below you can see an example of what the continuous mountain car problem looks like.

You may be asking yourself what is an optimal control problem? An optimal control problem is problem that consists of controlling a dynamical system to minimize (or maximize) a given objective function. In our case the continuous mountain can be modeled as a ODE where $u$ is some controllable function. In the continuous mountain car problem our control,$u$, is whether the car accelerates left or right. In an optimal control problem, we seek to optimize some objective function, in our case we will minimize the objective function. Our two objective functions are the running cost, which penalizes the car for acceleration. While our other objective function is the terminal cost which penalizes the car for not reaching the goal, the top of the hill, in time.

Why This Problem?

You may be wondering why choose the Continuous Mountain Car Problem? Here are a couple of reasons why we picked this example,

Established benchmark for RL models
2-D state-space allows for good plots and visualizations
Both RL and optimal control problem
Finite horizon (time)
Continuous state and motion

The whole reason we are doing this is because we want to compare three different ways of solving the continuous mountain car problem and see which one really is the best. Our three approaches are

Local solution using numerical ODE solvers and nonlinear optimization (baseline)
Reinforcement learning with actor-critic algorithm (data-based approach)
Optimal control using both model and data

Our Three Approaches

A Local Method

Our first method that we looked at during this REU was the local method. We tried to find the optimal control $u_h$ by formulating an optimization problem.

We first discretize the control, state and the Lagrangian.

Setting $z_h^{(0)}=z_t$ and $\ell_h^{(0)}=0$, allows us to use a forward Euler scheme for some control $u$.

We then approximate our objective function which yields the optimization problem

To solve our optimization problem, we used gradient descent. By taking an initial guess for $u_h$ and repeatedly updating $u_h$ using the gradient of the objective function and step size $\alpha$

$(u_h)_0 = \vec0$

$\vdots$

$(u_h)_6 = (u_h)_5 - \alpha( \nabla J((u_h)_5))$

$\vdots$

$(u_h)_* = (u_h)_19 - \alpha( \nabla J((u_h)_19))$

Below is a nice visual example of what all this math means. When the tail reaches the dotted line, it means our car has reached the top of the hill.

The graph shows us the position vs velocity of the car. In the graph the black dot represents t and the tail of the plot, when x-position is .45 is time T. In the graph we see the color change from red-blue, in our plot the blue color is when the car is accelerating, and control is positive but red otherwise.

Our next goal was how do we create a nice visualization of what the actual solution looks like.

We see in this video what our optimal local solution looks like. A couple of things that should be noted is, if we move the car to a new position then this local solution may no longer work. The same can be said if we slowed down/speed up a bit then this solution may not even let the car get to the top of the mountain. Another downside of the local is that it is a non-linear and non-convex problem which makes this method slow. This local solution will serve as a baseline so we can compare other methods to something to see which one is really the best.

Global Methods

Now that we have established a baseline, we will discuss our other two methods. Our other two methods that we will be looking at are global methods, the first being reinforcement learning method and the other being optimal control method. You may be asking yourself what is the difference? Reinforcement learning is more of a data driven approach while the optimal control method is a hybrid approach, using both a model and data.

Reinforcement Learning

Our first stop in exploring global methods is reinforcement learning. We will be using reinforcement learning with actor-critic algorithm. This approach is completely data-based approach. In reinforcement learning it has no knowledge of the model, it only considers the objective function. In reinforcement learning we would like to maximize a reward, so in our case we will maximize negative cost. Reinforcement learning is stochastic in two ways with initial position and action space which allows for exploration. In Reinforcement learning we are trying to estimate an optimal control policy. One of the big things that we have yet to discuss is, what is actor-critic algorithm ? The actor-critic (AC) architecture for RL is well-suited for a continuous action-space as in the continuous mountain car problem 1 . In the actor-critic algorithm the critic must learn about and critique whatever policy is currently being followed by the actor. We worked in the OpenAI gym mountain car environment, so we were able to find preexisting code for our project. We also were able to adapted the TD advantage actor-critic algorithm adapted from here. The thing is our preexisting code was not the same as our problem, so we had to modify it some. After some modification to the code, the following video is the results that we were able to get after many training cycles.

In this video we see that RL gave us a sub optimal solution compared to the local solution. You may also notice that in the reinforcement learning method our car takes an extra swing backwards to get to the top of the hill. In our reinforcement learning method it took 1000’s episodes just to get the car to our goal. We saw that reinforcement learning is very fragile, a couple of changes saw our success rate go from 70% to barely making it all. The picture below is position vs velocity of the car.

as you can see compared to the local solution, we see that the RL solution is very sub optimal solution.

Optimal control method

Our last two methods were vastly different with reinforcement learning using a data driven approach and the local method using a model-based approach. We will now be looking at the optimal control method which combines both model and data driven approaches. In this approach, we aim to adhere to the method discussed here. We will estimate the corresponding value function with neural network approximators utilizing feedback from the Hamilton-Jacobi-Bellman equation and Hamiltonian.

Using OC method we were able to produce the following

Once again, we created a position vs velocity of the car graph. As you can see this graph is a sub optimal solution compared to the local method. Compared to the RL method, we see how much better OC was for our problem. We see through testing of the RL method that there are some draw backs to forgetting the model and just being purely driven by data.

Our Experiences

Week 1

In week one we decided to make a game plan for the following weeks. We would work on the local method for just a week since it was basically finished. For the other two methods we would spend two weeks on each method. Lastly, we would save the last week to wrap up all three methods and anything else that is left over. During the first week we wanted to look at the local method and explore it some.

Week 2 & 3

We decided to spend two weeks to look at reinforcement learning, during these two week we were able to produce a PowerPoint in beamer for our mid-week presentation. In week two and three we looked at our first global method. Dr.Ruthotto handed us some pre written code to play around with. A thing that should be noted is that this prewritten code only worked maybe 50% of the time. Looking deeper into the code we realized that we would have to mess around with the code to get it to match our problem. After we made these couple of changes in our code, we saw how fragile reinforcement learning is, instead of working 50% of the time our code barely worked at all. During this week we were also able to produce a rendering of both of our local and RL methods, adding a nice touch to our presentation that we gave.

Week 4 & 5

During week 4 and 5 we looked at optimal control method. During week 4 we were able to produce a rough draft of the website while also taking a deeper look into optimal control method. Week five we created a rough and final draft of our poster. While working on our poster we were able to produce a graph for OC so we can compare it to our other methods.

Week 6

We wrapped up any unfinished work including our paper, OC method and this website. After struggling with code for method three we finally found that our OC method had better results than RL. During week six we also gave a poster talk to the Emory staff and students. Our group ended up winning best poster and we each won some amazon gift cards. We will continue working on method three to try and get it to work for 150 steps, we also look to fine tune our code for method three and two.

More About Our Team

Reference

U. M. Ascher and C. Greif. “Chapter 9 (Optimization).” A First Course on Numerical Methods. SIAM. SIAM,2011

U. M. Ascher and C. Greif. “Chapter 14 (Numerical time integrators).” A First Course on Numerical Methods. SIAM. SIAM,2011

Data assimilation for Glacier Modeling

Mon, 27 Jun 2022 11:36:49 -0400

This post was written by Emily Corcoran, Hannah Park-Kaufmann, and Logan Knudsen. The project was advised by Dr. Talea Mayo. Our team has also created a midterm presentation, blitz video, poster, paper, and has published their code.

Glaciers

Research has shown that climate change will likely impact storm surge inundation and make modeling this process more difficult. Sea-level rise caused by climate change plays a part in this impact. To better model sea-level rise, glaciers can be modeled. Marine-Terminating Glaciers have a natural flow towards the ocean, which contributes to sea level rise. By the year 2300, the Antarctic ice sheet is projected to cause up to 3 meters of sea level rise globally. Due to the severe impacts of glacial melting, modeling changes in ice sheets is an important task. There are challenges to modeling sea level rise, as ice sheet instability leads to significant sea-level rise uncertainty.

Image by W. Bulach, used under Creative Commons Attribution-Share Alike 4.0 International License

Modeling Glaciers

Our group is collaborating with Dr. Robel, a glaciologist, climate scientist, and applied mathematician from Georgia Tech, and working with the glacier model described in his 2018 paper. This ice sheet model aims to describe the changes in ice mass of marine-terminating glaciers, which may be impacted over time by climate change.

Image used with permission from Dr. Alexander Robel

A glacier can be represented with a simplified box model that has a length $L$, precipitation $P$, and height and flux at the grounding line $h_g$ and $Q_g$. This model is the best approximation for one variable and describes the dominant mode of the glacial system.

The two-stage model that our group is using incorporates a nested box into the system. This new box has a thickness, $H$, and an interior flux, $Q$. The change in length and height of the glacier can be described with these differential equations: $$\ \dfrac{dH}{dt}=P-\dfrac{Q_g}{L}-\dfrac{H}{h_gL}(Q-Q_g)$$ $$\ \dfrac{dL}{dt}=\dfrac{1}{h_g}(Q-Q_g)$$

Sensitivity Analysis

Sensitivity analyses study how various sources of uncertainty in a mathematical model contribute to the model’s overall uncertainty. This allows us to understand the model better.

Why do we do this? Why do we care about the uncertainty of a model? And where do the uncertainties even come from? In this ice sheet model, just like in any model, there are always going to be simplifications, and these lead to uncertainties. We need to have a good idea of which uncertainties matter the most, so that we better know the limits of where our model does a good job of simulating the real world.

The basic idea is this: We check sensitivity by using different distributions for the input parameters. If the outputs vary significantly, then the output is sensitive to the specification of the input distributions. Hence these should be defined with particular care. We can also look at the sensitivity of the model parameters to inform which parameter we’re going to work with in the data assimilation. We want to be working with the most sensitive parameter, because it has the most promise for things we vary later on to matter, in questions like: “if your data is from billions of years ago does that matter? Is it important to have your data from the last 60 years?” or “how much will noise impact the predictions?”

The uncertain model parameters we considered are: initial conditions, sill parameters, and SMB values. For consistency’s sake, we vary each parameter by +-10 percent of the nominal values originally given in our model code. Below you can see three graphs, one for each group of parameters varied, for each “time vs H(t)” (Height of the glacier at time) and “time vs L(t)” (Length of the glacier at time).

Looking at the distributions, we see that varying initial conditions (Leftmost) seems to produce the greatest spread, but the slopes of the lines there are all very similar. Varying the sill parameters (Middle) produces a lesser spread than varying initial conditions, however there is a greater variation in the slope of the lines. Finally, when varying the smb data (Rightmost) the result actually doesn’t change that much and is quite stable. Thus, according to our analysis, the model is the least sensitive to SMB parameters, and between initial and sill parameters judgement varies depending on what you care about more - spread or slope.

Data Assimilation

Data assimilation is a method to move models closer to reality using real world observations by readjusting the model state at specified times.

Image used with permission from Dr. Talea Mayo.

In this example we have used the ensemble Kalman filter method (ENKF) in order to perform our data assimilation. In basic terms, we initialize an ensemble( or a series of model runs with perturbed initial conditions) and performed data assimilation on each of the ensemble members, then to get our final analysis we took the mean of the ensemble.

The program used to model the glacier behavior and assimilate the data begins with choosing a set of initial conditions. Once the initial conditions are input to the model, which after taking a step using a Runge-Kutta 4th Order Method, is plugged into a Data Assimilation Method. Our main method is ENKF as previously mdentioned. Finally, the analyzed data from the assimilation is output and plugged back into the model. It should be noted that at sometimes the forecast output for the model is the same as the analyzed data.

Square Difference

The error measure we use in the best ensemble size and observation scheme is the square difference, $d^2$, which we define $$\ d_t^2 = \left(x_t - x^a_t\right)^2 $$ where $x_t$ is the true state from the truth simulation at time $t$ and $x^a_t$ is the analysis state at time $t$.

Ensemble Size

In the interest of lowering computational costs, we use the square difference in order to minimize ensemble size while also minimizing error. To do this, we choose an ensemble size, calculate the square difference at each $t$, and then calculated the mean of all these square differences. We ran this calculation for ensembles sizes from 2 up to 75, and found that ensembles of size 7-10 were ideal as they were at the point where the average square difference hovers around the same value.

Observation Scheme

We ran the model for various observation schemes to find the best observation scheme, i.e. the times frames and frequencies which can produce a sufficiently small average square difference over the course of the model run. We applied this process to our model and found that for before 1900 the best observation frequency, while still using small number of observations, would be every 19 years for a total of 100 observations. Similarly, for the time frame of 1950-2300 we found that yearly observations for a total of 350 observations is the best frequency.

Model Runs

Using the facts we established in the previous two sections, we ran the model using EnKF for the time frame of 0-2022 in order to project $H$ and $L$ into the future up to the year 2300. The following plots show the results of this experiment, which we will use to help calculate $Q$ and $Q_g$ over time, and in turn use it to calculate sea level rise.

Sea Level Rise

Using the formulas for $Q$ and $Q_g$ we can calculate the volume lost across the grounding line $$\ V_{gz} = W(Q-Q_g)t $$ We then used the to calculate the volume out at all times and add it up to get accumulated volume loss. We then assume that the width of the glacier is 50 km(at least in the case we show here). To convert this to sea level rise, note that 394.67 km$^3$ of ice is equivalent $1$ mm of sea level and get the following projection of sea level rise.

Next Steps

Using data assimilation can help to inform the glacier modelers and glaciologists who collect data about how to collect data in an efficient way. This can help researchers to more efficiently utilize funding and avoid unnecessarily expensive data collection that does not significantly improve glacier models. Data assimilation should be explored within more complicated glacier models, as the model used here is quite simplified. If more research is performed on this technique, it could greatly improve the practice of glacier modeling. Data assimilation can also be used for many geophysical modeling tasks, such as weather forcasting and hurricane storm surge modeling. Going forward, we plan to integrate the output of the glacier model into the ADCIRC hurricane storm surge model to predict the impact of glacier model on storm surge inundation.

References

The long future of Antarctic melting
Marine ice sheet instability amplifies and skews uncertainty in projections of future sea-level rise
Projected climate change impacts on hurricane storm surge inundation in the coastal United States
Response of marine-terminating glaciers to forcing: time scales, sensitivities, instabilities, and stochastic dynamics

About the Team

Emily Corcoran

Emily Corcoran is a junior at New Jersey Institute of Technology, majoring in Mathematical Sciences with a concentration in Applied Statistics and Data Analysis. Before this REU, she has worked as a research assistant in her school’s Visual Perception Lab. She is a student in the Albert Dorman Honors College and is an active member of NJIT’s school yearbook and Knit ’n Crochet club. When she is not in class, she can be found reading, listening to music, or attending a local play.

Logan Knudsen

Logan Knudsen is a senior at Texas A&M University majoring in Mathematics with minors in Oceanography and Meteorology. Before this REU, he has worked doing research on Data Analysis using Benford’s Law and as a Teaching Assistant. Logan is currently the President of Texas A&M’s Math Club and a member of student radio, KANM. When not in class, he can be found reading, playing the guitar or playing video games with his friends.

Hannah Park-Kaufmann

Hannah Park-Kaufmann is a junior at Bard College and Conservatory, majoring in Mathematics and Piano Performance. Before this REU, she conducted research on Numerical Semigroups and Polyhedra, and on Identifying Universal Traits in Healthy Pianistic Posture using Depth Data. She tutors math in the Bard Prison Initiative (BPI). When not doing math, she can be found playing piano, reading scores, reading literature and/or eating.

Fast Training of Implicit Networks with Applications in Inverse Problems

Mon, 27 Jun 2022 11:36:49 -0400

This post was written by Linghai Liu, Shuaicheng Tong, and Lisa Zhao and published with minor edits. The team was advised by Dr. Samy Wu Fung. In addition to this post, the team has also given a midterm presentation, filmed a poster blitz video, created a poster, published code, and written a paper.

What are Inverse Problems?

Inverse problems consist of recovering a signal $x^\ast$ (e.g. an image, a parameter of a PDE, etc.) from indirect, noisy measurements $d$. These problems arise in many applications such as medical imaging, computer vision, geophysical imaging, etc.

This measurement process is usually modeled as an operator $\mathcal{A}$, satisfying the following equation: $$ d = \mathcal{A} x^\ast + \boldsymbol{\varepsilon}, $$ where $\mathcal{A}$ is a mapping from signal space $\mathbb{R}^n$ of original images to measurement space $\mathbb{R}^m$. Since our project deals with image deblurring, we have the following variables:

$d \in \mathbb{R}^{n}$: blurred image with noise
$x^\ast \in \mathbb{R}^{n}$: original image
$\boldsymbol{\varepsilon} \in \mathbb{R}^{m}$: random unknown noise

Solving Inverse Problems from a Classical Approach

Using direct inverse we have: $$ d = \mathcal{A} x^\ast + \boldsymbol{\varepsilon} \Longrightarrow x^\ast = \mathcal{A}^{-1} d - \mathcal{A}^{-1} \boldsymbol{\varepsilon} $$ However, since $\boldsymbol{\varepsilon}$ is unknown, directly inverting may end up amplifying this noise factor⁠. Because of this noise corruption, the reconstructed image ends up being unrecognizable.

To better visulaize this, we have the following set of pictures:

Original Image

Blurred Noisy Image

Direct Inverse

In order to minimize the noise factor, we want to formulate a regularized optimization problem.

We essentially want to find the minimium distance between the reconstructed image and the observed blurred image, plus a regularizer $R(x)$.

This regularizer is chosen based on prior knowledge of the data; this can often lead to inaccuracies—meaning the reconstructed image will be a bit blurry.

For example, using a gradient descent scheme where we handpick a regularizer to help stabilize the reconstruction, we have the following set of pictures:

Original Image

Blurred Noisy Image

Gradient Descent

We see that the reconstructed image using gradient descent is a huge improvement from direct inverse; however, there are still blurry areas we can improve on. In search of a better method, we turn towards implicit learning.

Implicit Deep Learning

The issue with the classical approach is that the regularizer is chosen heristically. To combat this, our approach now is to utilize data to learn and train the regularizer.

To do this we mimic gradient descent, but replace the gradient of the regularizer, $\lambda \nabla_x R$, with a trainable network.

However, this creates some problems concerning memory cost and the number of layers, $K$, in our neural network. The memory grows linearly as $K$—chosen heuristically—increases.

With implicit deep learning we send $K \to \infty$ until we find a fixed point of a single layer $T_\Theta(\cdot)$.

Implicit Backpropagation

Suppose now we have found a fixed point $x^\ast$ for a single layer. Then, $$ x^\ast = T_\Theta (x^\ast) $$

Using implicit differentiation on the equation above we have,

$$\frac{d x^\ast}{d \Theta} = \left( I - \frac{d T_\Theta (x^\ast)}{d x^\ast}\right)^{-1} \frac{\partial T_\Theta (x^\ast)}{\partial \Theta}$$

However, solving this is very expensive because of the inverse term.

To circumvent this issue, we use a recently proposed method called Jacobian-Free Backpropagation.

Jacobian-Free Backpropagation (JFB)

The goal of JFB is to alleviate memory requirement and avoid high computational cost in implicit networks.

The key idea is to replace the problematic Jacobian $$\left( I - \frac{d T_\Theta (x^\ast)}{d x^\ast}\right)$$ with the identity matrix $I$.

For a comparison, if we were to calculate the true gradient using implicit networks we have the following equation:

$$\nabla_\Theta \ell = \frac{d \ell}{d x^\ast} \left( I - \frac{d T_\Theta (x^\ast)}{d x^\ast}\right) ^{-1} \frac{\partial T_\Theta (x^\ast)}{\partial \Theta}$$

Using JFB to approximate the gradient we only need to solve: $$p_\Theta = \frac{d \ell}{d x^\ast} \frac{\partial T_\Theta (x^\ast)}{\partial \Theta}$$ which is a descent direction for the loss $\ell$.

Utilizing JFB, we avoid computing the Jacobian term. As a result, implicit networks are trained faster and more easily implemented.

Note: the JFB approach relies on a set of conditions to be true:

$T_\Theta$ is contraction mapping with Lipschitz constant $\gamma$
$T_\Theta$ is continuously differentiable w.r.t. $\Theta$
$M := \frac{\partial T_\Theta}{\partial \Theta}$ has full column rank
$M$ is well-conditioned, i.e., $\kappa (M^T M) < \frac{1}{\gamma}$

Numerical Experiments

In our project we used the CelebA dataset, which consist of annotated celebrity faces. The images are categorized into various sections based on specific features that the celebrities have. For example, whether or not they have bangs, wear glasses, have a pointy nose, etc.

Results

The results are as follows

From the graph we see that the loss is decreasing as the number of epochs increases. An epoch is one complete pass of the entire dataset through our algorithm.

Note: Two metrics are commonly used for assessing the quality of reconstructed images: the peak-signal-to-noise ratio (PSNR, a positive number, best at $+\infty$) and the structural similarity index measure (SSIM, also positive, best at $1$).

Acknowledgements

We sincerely thank the guidance of our mentor, Dr. Samy Wu Fung, and other mentors at Emory University for the opportunity.

More About the Team

Linghai Liu is a rising senior at Brown University, double concentrating in applied mathematics - computer science and mathematics. His main interests lie at the intersection of statistical theory, machine learning, and optimization. Outside of work, he enjoys reading novels and watching animes.

Shuaicheng Tong is a rising junior at the University of California, Los Angeles, majoring in applied mathematics and minoring in statistics. He is interested in optimization and machine learning. He volunteers at the UCLA Statistics Club where he tutors mathematics and statistics. Outside of school, he enjoys working out, hiking, and watching Star Wars shows.

Lisa Zhao is a rising sophomore at the University of California, Berkeley, double majoring in statistics and economics. She is interested in learning about how statistics is used as a powerful tool in finance. Outside of work, she enjoys swimming, drawing, and watching TV shows.

Learning Ordinary Differential Equations from Data

Mon, 27 Jun 2022 11:36:49 -0400

This post was written by Emma Hayes, Mathias Heider, and Carrie Vanty and published with minor edits. The team was advised by Dr. Deepanshu Verma.

In addition to this post, the team has also created slides for a midterm presentation, a poster blitz video, and a poster.

Project Overview:

Imagine a spring mass system (Figure 1). What if you wanted to find the location of the mass at any given time point? In order to find this information, you must first understand the dynamics of the system. A spring mass system is an example of simple harmonic motion where total energy is conserved. This means that you can model the dynamics using a Hamiltonian Ordinary Differential Equation, which has the quality of energy conservation. To solve our problem, we use neural networks utilizing Hamiltonians in the forward propagation to predict our coordinates.

Figure 1 - Spring Mass by Oleg Alexandrov (public domain)

Our project aims to compute the value of the Hamiltonian for any given time and set of initial conditions using Hamiltonian Inspired neural networks. We will first introduce the mathematical background of our project and the novel technique we implemented for our forward propagation. Results will then be presented and analyzed. Lastly, we will discuss how our project expands upon both Ruthotto 3 and Greydanus 6 papers.

Background:

Often, neural networks are thought of as a black box, where the actual inner-workings are not the main focus. However, since we are mathematicians, we want to understand how the network functions in order to best optimize it. For this reason, we began by looking at why and how ODEs were first used in neural networks. Ordinary differential equations were first used in Residual Neural Networks due to the similarity between the forward propagation equation and discretization of an ordinary differential equation. The only difference is multiplication of the step size, which we denote as $\mathbf{h}$. $$Y_{j+1} = Y_j + \mathbf{h}\sigma(Y_j K_j + b_j)$$ In the context of our residual neural network, the ODE as forward propagation means that for each layer of the network, we will move one time step forward in the discretization of our network ODE. The weights and biases, $K$ and $b$, may change in between the layers depending on the given values in the network ODE. The output of our network is the Hamiltonian value at the given time, and from that we are able to approximate position and velocity values for the mass. When estimating coordinates of a Hamiltonian system, or the value of the Hamiltonian itself, the Hamiltonian relationships are important for forward propagation. Hamiltonians intrinsically conserve energy, meaning the network is better able to learn conservation laws and predict examples with energy conservation. Without considering these relationships, it would be much more difficult for the network to learn conservation, which can cause a buildup of error. In many studies (3, 6), they have found that by using the Hamiltonian equations in Hamiltonian data sets, error has decreased. We plan to investigate this further and find which discretization methods and algorithms will perform the best.

Learning Hamiltonians from Data

To learn about Hamiltonian dynamics from data, we use neural networks. Specifically a modified version of the Residual Neural Network (RNN), which we call a Hamiltonian Inspired Neural Network (HINN) drawn from 3. To create this HINN, we primarily used 2 packages - PyTorch and hessQuik. The difference between our HINN, and the traditional RNN and Ruthotto \textit{et al}’ HINN, is in our forward propagation method and how we input values into our MSE loss function. The forward propagation uses the autograd feature to calculate both $\frac{\partial H_{\theta}}{\partial \mathbf{p}}$ and $\frac{\partial H_{\theta}}{\partial \mathbf{q}}$, where $\theta$ are the network parameters we wish to optimize and $H_{\theta}$ is our network output. We then use these values in discretizing $\mathbf{p_{\theta}}$ and $\mathbf{q_{\theta}}$. $$\mathbf{p_{\theta+1}} = \mathbf{p_{\theta}} + h\frac{\partial H_{\theta}}{\partial \mathbf{q}} $$ $$\mathbf{q_{\theta+1}} = \mathbf{q_{\theta}} - h\frac{\partial H_{\theta}}{\partial \mathbf{p}}$$

The new values are then plugged into our MSE loss function. Using these techniques we created two HINNs for two different examples, those being the Simple Spring Mass System and the Two Body Problem.

Results:

Figure 2 - Spring Mass System 1. Training Loss Graph 2. Learned Position Values over True Position Values 3. Relationship of Learned p and q over ground truth

Figure 3 - Two Body Problem 1. Training Loss Graph 2. Learned Trajectories Graph over True Trajectories 3. Learned Energy over True Energy 4. Learned Position over True Position.

What’s Next

Currently our results look promising as our learned values closely match the ground truth values. Moving foward, we would like to add more complexity to our current examples: instead of fixed time steps, we would like to attempt variable time steps. We would also like to take on the Three Body Problem, which unlike our current examples has no analytical solution.

Information about Us

Mathias Heider is a rising senior at the University of Delaware, majoring in Computer Science and Mathematics and Economics. His interest are in machine learning and data science specifically when it relates to dataset with bioinformatics applications. Outside of class, Mathias likes to ski, hangout with friends, and play video games

Carrie Vanty is a rising senior at Middlebury College, majoring in mathematics. She loves working with ordinary differential equations in applied math. Outside of school, Carrie likes to ski, hike, and craft.

Emma Hayes is a rising junior at Carnegie Mellon University, majoring in Computational and Applied Mathematics. She enjoys learning about new topics in mathematics and incorporating computer science into her work. Outside of class, Emma likes hiking, baking, and making art.

References

[1] https://en.wikipedia.org/wiki/Effective_mass_(spring%E2%80%93mass_system)#/media/File:Simple_harmonic_oscillator.gif

[2] Kaiming He, X. Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.

[3] Eldad Haber and Lars Ruthotto. Stable architectures for deep neural networks. Inverse Problems, 34(1):014004, 22, 2018.

[4] Brian de Silva, Kathleen Champion, Markus Quade, Jean-Christophe Loiseau, J. Kutz, and Steven Brunton. Pysindy: A python package for the sparse identification of nonlinear dynamical systems from data. Journal of Open Source Software, 5(49):2104, 2020.

[5] Alan A. Kaptanoglu, Brian M. de Silva, Urban Fasel, Kadierdan Kaheman, Andy J. Goldschmidt, Jared Callaham, Charles B. Delahunt, Zachary G. Nicolaou, Kathleen Champion, Jean-Christophe Loiseau, J. Nathan Kutz, and Steven L. Brunton. Pysindy: A comprehensive python package for robust sparse system identification. Journal of Open Source Software, 7(69):3994, 2022.

[6] Sam Greydanus, Misko Dzamba, and Jason Yosinski. Hamiltonian neural networks. ArXiv, abs/1906.01563, 2019.

Low-Precision Algorithms for Image Processing

Mon, 27 Jun 2022 11:36:49 -0400

This post was written by Xiaoyun Gong, Yizhou Chen, and Xiang Ji and published with minor edits. The team was advised by Dr. James Nagy. In addition to this post, the team has also created slides for a midterm presentation, a poster blitz video, code, and a paper.

In One Sentence:

Our group works on experimenting with iterative methods for solving inverse problems at different precision levels.

Background: Why Low Precision?

What is the most important aspect for an excellent gaming experience? A lot of people would answer real-time! Everyone wants their games to be fast, and it is always a bummer that the screen freezes during a critical combat. This is why we are investigating low precision arithmetic: to decrease the computation time and speed things up.

Nowadays, most computer systems operate on double precision (64-bit) arithmetic. However, if we decrease the number of bits for each number to 16 bits or even lower, the processing time can be much significantly reduced, although the benefit comes at the cost of a loss of accuracy.

Using tensor cores for mixed-precision scientific computing. Oct-2021. url: https://developer.nvidia.com/blog/tensor-cores-mixed-precision-scientific-computing/

Simulating Low Precision

Matlab function chop

To simulate low precision arithmetic on our 64-bit computers, we have imported a MATLAB package called chop. The toolbox allows us to explore single precision, half precision, and other customized formats. Each input needs to be transformed, but the real work comes from chopping each operation. The code below is a toy example of how to calculate $x + y \times z$ in half precision with chop.

Blocking

When the number is being chopped from double precision to half precision, a lot of bits are dropped (from 64 bits to 16 bits). This would certainly cause a level of inaccuracy, so in order to reduce the errors, a method called blocking is used. Blocking is the same as breaking a large operation into smaller chunks, where each is computed independently and the result is then summed.

We compute the inner product for each precision and block size for 20 times and calculate the average. The errors are calculated as the differences between the result of using the chopped version of inner product function and the built-in function in matlab.

On the left-hand side of the graph where the size of the vector is 1000, the errors of half precision are the largest because it has the least bits. If we take a closer look at only half precision, we get the graph on the right with different vector sizes. The errors decrease sharply when the blocking method is introduced. However, larger block sizes do not necessarily mean lower errors, as the graph suggested: the errors increase again as the block size keeps growing. That is because when the block size is large, it's the same as doing no blocking at all. For example, for a size-500 vector, once the block size reaches 500, it just means putting the whole vector into the first block, the same as when blocking is not introduced. Therefore, the line becomes flat from 500. We use 256 as our default block size in our codes because the matrix dimension is rather large in our problem.

Inverse Problems and Iterative Methods

Inverse problems are problems where our goal is to find the internal or hidden information (inputs) from outside measurements (results). The internal data can be approximated by iterative methods, a repeating implementation of the same system of equations, with variables getting updated each round, hoping the value generated can be closer to the true value we desire each term.

Conjugate Gradient Method

The Conjugate Gradient algorithm (CG) aims to solve the linear system Ax = b where A is SPD (symmetric and positive definite),transforming the problem of finding solution to an optimization problem where we want to minimize $\phi(x)=\frac{1}{2}x^{T}Ax-x^{T}b$. This can be easily seen from $\nabla \phi (x) = 0$ -> $Ax-b=0$.

In each step, the method provides us with a search direction and a step-length so that the error of this iteration is A-orthogonal to the search direction of the previous iteration. Eventually, it will converge to the minimal point. The CGLS algorithm is the least-squares version of the CG method, applied to the normal equation A^TAx = A^Tb. However, CGLS requires computing inner products, which can overflow for large-scale problems in low precision.

Chebyshev Semi-Iterative Method

The Chebyshev Semi-Iterative (CS) Method requires no inner product computation, which is great because inner products can cause overflow easily in low precision. But there is always the trade-off! The CS method requires the user to have an idea of the range of the matrix A’s eigenvalues. The result given by CS is a linear combination of all solutions in each iteration, and the weights are obtained from the Chebyshev polynomial, which has the favorable property to ensure that the result obtained in each iteration of CS is smaller than an upper bound.

Experiment

IR Tools

We modify the CGLS method in the IRtool package in Matlab so that it can operate in lower precision, and we use two test problems in the same package to investigate how the method performs at lower precision, mainly half precision.

Image Deblurring Using CGLS

First, we use our modified version of CGLS without regularization to solve the image deblurring problem. In this application, we solve Ax = b, where b is an observed blurred image, A is a matrix that models the blurring operation, and x is the desired clean image. We didn’t add any noise to b in the problem of Ax = b at the beginning, and the graphs are demonstrated below.

We use our modified version of CGLS for single and half precision, and the graph in single precision is similar to the graph in double precision. However, for half precision, the background is not the same as that in the double-precision or single-precision graph; it contains more artifacts.

We also plot the error norms of the solution at each iteration using different precision levels.

From the graph, all three error norms overlap from the beginning until around the 20th iteration, where the half-precision errors begin to deviate from those in single and double precision. The difference is due to the round-off errors of half precision, which add up and take over. Besides, the error norms for half precision terminates at the 28th iteration because overflow of inner products causes NaNs (Not a Number) to be computed during the iteration.

After investigating the idealized situations where there is no noise in the observed image, we then apply our code to problems that contains additive random noise to see how it is likely to perform in real life. That is, we try to compute x from the observed image b = Ax + noise.

For half precision, with 0.1% noise, the picture looks almost the same as the one that contains no noise. However, if the noise level is increased to 1%, the background has substantially more artifacts, while the middle object is still identifiable. Noise has taken over the black background but not the satellite yet. Eventually, the whole image is flushed with the noisy artifacts with 10% noise; the picture no longer contains any meaningful information. Notice that the results below are generated using x from the best iteration, that is the iteration with the smallest error norms, not from the last iteration.

Now we turn our attention to the error norm, the difference between the original image and the one our algorithm generates at each iteration. When 0.1% noise is added, as the number of iterations goes up, the error norm reduces significantly across all three formats. Intriguingly, for images with 1% or 10% noise, the best reconstruction is not the last iteration but somewhere along the middle (it’s around the 50th iteration for 1% and 10th for 10%). The reason behind the phenomenon is that while we are transforming the output image, b, the blended noise also gets inverted along each iteration. Eventually, the random data accumulate and dominate the solution at some point. We are showing the results where the error norm is the smallest to see what is the best possible solution we can compute. However, in reality the true x is not known, meaning we don’t know the error norms, so we can only show results from the last iteration, not from the best iteration.

Image Deblurring Using CS

In order to prevent the occurrence of overflow, we experiment with the CS algorithm (where no inner products are needed) and use chop for lower precision. Tikhonov regularization is applied to CS after we find out that the algorithm performs poorly due to the close-to-zero singular values of A when it’s ill-conditioned. Now we are solving: $$\min_{x} {||Ax-b||_2^2+\lambda^2||x||_2^2}$$ where $\lambda$ is a parameter that needs to be chosen. Here we show experiments for the case with 10% noise, and we use $\lambda$ = 0.199.

From the graph below, it is clear that even with 10% noise, the half-precision image looks very similar to that in double precision, better than what we have using CGLS (results from the last iteration).

For the image deblurring problem, we further comfirm the similarity by plotting the error norms.

We can see that the error norms of the three precision levels overlap, illustrating that the result in half precision is close to that in double precision.

Image Deblurring Using CGLS with regularization

To fairly compare CGLS and CS, we add Tikhonov regularization to CGLS and run the test problem again. The diagrams are listed below.

The diagram for half precision looks much better than that produced by CGLS without regularization. However, difference still presents between half and double precision in the background. At the end of the iteration for half precision, the error norms still increase again. If we zoom in the graph of the error norms for CS and CGLS with regularization, we can see that the error norms at half precision converge for CS but increases rapidly for CGLS, suggesting that for half precision, CS is a better choice, especially when the noise level is high. When the noise level is close to zero, CS becomes susceptible because of the accumulation of round-off errors. However, for double precision, the CGLS method with regularization is clearly more stable. Therefore, CGLS with regularization is more suitable for double precision.

We performed similar experiments with an image reconstruction problem from tomography; see our paper for further details!

More about us

Yizhou Chen

Hi! My name is Yizhou, but I go by Riley as well. I'm a rising junior at Emory University, double majoring in Applied Math and Physics. My interest is in computational Math & computational Physics. Outside work I enjoy watching sitcoms and my favourite one is Frasier! I am an animal person and I have a toy poodle who's nine years old. I also love cycling and hiking.

Xiaoyun Gong

Hello I am Xiaoyun Gong. I am a rising senior majoring in Applied Mathematics and Statistics. I am interested in math and I also enjoy coding!! In my free time I like watching anime (most recent favorite is Made in Abyss) and drawing. I like sweet food and I am a cat person. 🐱

Xiang Ji

Hi, I am Xiang Ji, but you can also call me Zoe. I am a rising junior at Emory University who is double majoring in applied mathematics and statistics and art history. My research interest is in computational mathematics and image processing. I enjoy going to art museums and watching movies. It's quite fun doing research this summer!

Model-based approaches to neuronal network firing and its subsequent validation with a previously recorded in-vivo dataset

Mon, 27 Jun 2022 11:36:49 -0400

The research featured in this blog post was performed by Carly Ferrell, Qile Jiang, and Olivia Leu, and the team was advised by Dr. Michael Caiola. This blog post was written by Carly Ferrell and published with minor edits. In addition to this post, the team has also given a midterm presentation, made a poster blitz video, and created a poster. They are currently working on a paper with the aim to publish it in an academic journal.

Mathematical Modeling of Healthy and Parkinsonian Firing Patterns in the Primate Thalamocortical Motor Circuit

Parkinson’s disease (PD) is a slowly progressing neuro-degenerative disease featuring impaired motor symptoms such as bradykinesia, muscular rigidity, and resting tremors.¹ In industrialized countries, PD affects 0.3% of all people and 1% of people over age 60.⁶ The basal ganglia, motor thalamus, and motor cortex are three main components of the brain’s motor circuit and are responsible for movement planning and execution; movement disorders such as PD can develop when the typical activity of this circuit is disrupted.^2,3 Specifically, PD is associated with the loss of dopaminergic neurons and altered neuronal oscillations in the beta-band (13-30 Hz).⁴ Other projects, such as the 2019 paper by M. Caiola and M. Holmes, have investigated the changes in the basal ganglia neuronal activity from a mathematical modeling perspective, but little research has been done on the parkinsonism-associated changes in the areas of the thalamus and cortex which are involved in the motor circuit.⁵ We employ a mathematical model to investigate network connection changes within the thalamocortical motor ciruit to better understand the transition from healthy to parkinsonian states in the brain.

Firing Rate Model

We choose to use a firing rate model to describe our system. This approach can successfully represent networks, since each unit in the model can represent a population of neurons receiving input (average firing rates) from other neuron populations.

A simplified circuit diagram of the thalamocortical motor circuit network is shown below, and provides the neuroscience basis for our model. The rounded squares each represent a population of neurons, which are connected by either excitatory (arrow-tipped lines) or inhibitory (circle-tipped lines) synaptic weights. The green circle represents the interneuron population of the thalamus.

GPi (y₁)	globus pallidus internal
TC (y₂)	thalamocortical neurons
CT5 (y₃)	corticothalamic layer 5
CT6 (y₄)	corticothalamic layer 6
RTN (y₅)	thalamic reticular nucleus
IN (γ)	thalamic interneuron population

Treating the interneuron population as a “relay,” γ, we can establish the following system of equations:

τ₁y’₁ = −y₁ + f₁(β₁ + h)

τ₂y’₂ = −y₂ + f₂(w₁₂y₁ + w₃₂y₃ + w₄₂y₄ − w₅₂y₅ + γ + b₂)

τ₃y’₃ = −y₃ + f₃(w₂₃y₂ + w₄₃y₄ + b₃)

τ₄y’₄ = −y₄ + f₄(w₃₄y₃ + b₄)

τ₅y’₅ = −y₅ + f₅(w₄₅y₄ + b₅)

γ = −w₆₂(−w₁₆y₁ − w₅₆y₅ + w₄₆y₄ + b₆)

y_i	average neuronal population firing rate
w_jk	weight of the connection between populations j and k
h	constant basal ganglia input
τ_i	membrane time constant
f_i	activation function

Note that w₂₃ represents the difference between the excitatory and inhibitory inputs from TC to CT5. Note also that w_jk > 0 and τ_i > 0.

This can be represented with vectors and matrices as:

Ty’ = −y + F(x) ⟹ Ty’ = Ay + B

Activation Function Selection

Neurons traditionally respond to inputs sigmoidally.^7,8,9 However, this model creates a nonlinear system of equations for which it is impossible to solve for eigenvalues analytically. In order to attain eigenvalues and be able to comment on the behavior of the model as a whole, we must establish a simpler activation function that still manages to approximate experimental neuron discharge behavior.⁵ A piecewise linear (PWL) activation function is ideal in our case, as it allows us to break down a complex system into linear pieces which can be solved and manipulated:

We can break down this system into 3⁵ = 243 distinct linear regions in space, each with its own steady state (fixed point in space which the solution tends to as time increases). Out of these 243 regions, only the region in which each activation function is between 0 spikes/sec and its maximum firing rate contains a physiologically realistic steady state, further denoted as the middle region (outlined in green in the diagram below).

Data Matching

This semi-linear firing rate model has a number of constant values that we must locate in experimental data and incorporate, namely the baseline firing rates, maximum firing rates, and membrane time constants for each neuron population involved in our simplified motor circuit model. We were able to find values for these parameters through literature review, although some required that we make estimates informed by information from areas of the brain that behave similarly or data on these parameters from mice, rats, or cats. However, there does not seem to be data that documents the baseline firing rate for the thalamic interneuron population in the primate brain. Given our uncertainty about the true baseline firing rate value for the primate thalamic interneuron population, we decided to create two models, one with the low and one with the high baseline. The parameter values are shown in the table below:

Neuron Population	b_i	M_i	τ_i
GPi (y₁)	55 Hz^{5,10,11,12,13}	200 Hz¹⁴	8 ms^11,15,16
TC (y₂)	18.5 Hz¹⁷	300 Hz	25 ms
CT5 (y₃)	7.25 Hz¹⁸	200 Hz	20 ms
CT6 (y₄)	7.25 Hz¹⁸	200 Hz	15 ms
RTN (y₅)	25 Hz¹⁷	500 Hz¹⁷	16.51 ms
IN (γ)	Low: 6 Hz¹⁹ High: 22.7 Hz²⁰	N/A	N/A

Stability and Steady State Conditions

No matter the disease state of our model, the neurons should not be at a state of maximal firing or absent firing for an extended period of time. Additionally, in Parkinsonian solutions, we should expect oscillations of firing rates. Thus the following must hold:

Middle region contains its own steady state, and trajectories must not stabilize in another region.
Healthy: Middle region is stable ⟶ trajectories are thus forced to stabilize in the middle region, making the system globally asymptotically stable.
Parkinsonian: Middle region is unstable ⟶ trajectories are thus forced to oscillate around the middle region, forming a globally stable limit cycle.

To determine stability, the PWL activation function allows us to solve for the eigenvalues of each of the 243 regions explicitly. We found 3 possible cases:

The region is stable regardless of weights.
The region’s stability is conditional on weight values.
The region (including the middle region) has eigenvalues that cannot be solved for analytically. Therefore, we used the Routh-Hurwitz Stability Criterion (RH) to derive 3 stability conditions.

Weight Search

The current literature does not specify the baseline firing rate for the interneuron population, b₆, so we took two estimates: b₆ = 6 for the low estimate, and b₆ = 22.7 for the high estimate.

Comparing our data to the predicted values our model outputted, we were able to minimize the sum of squared error between the two and find a healthy solution for both the low and the high estimates of b₆. The outputs for the low b₆ are shown below:

Healthy Solution:

w₁₂ = 1.520384442	w₁₆ = 1.621278311	w₂₃ = 0.4962387866	w₃₂ = 1.117631687
w₃₄ = 0.1540248925	w₄₂ = 1.217895798	w₄₃ = 0.0672671083	w₄₅ = 1.542582263
w₄₆ = 9.049109867	w₅₂ = 4.5350845	w₅₆ = 0.3330689302	w₆₂ = 7.127373038

Below is shown the firing rate outputs using these weights.

Here, all firing rates tend toward a specifc value as time increases, so they are stable solutions.

Parkinsonian Solution:

w₁₂ = 1.520384442	w₁₆ = 1.621278311	w₂₃ = 0.8691494663	w₃₂ = 0.5043792396
w₃₄ = 0.1540248925	w₄₂ = 1.217895798	w₄₃ = 0.0672671083	w₄₅ = 1.542582263
w₄₆ = 9.049109867	w₅₂ = 4.5350845	w₅₆ = 0.3330689302	w₆₂ = 7.127373038

Below is shown the firing rate outputs using these weights.

Here, several firing rates oscillate, indicating a limit cycle solution.

Weight Space

We were interested in the role of the thalamus in parkinsonian dysfunction, so we explored the relationship between w₂₃ and w₃₂, which represent the excitatory and inhibitory connections between TC and CT5. Forcing all correlating weights to be equal in healthy and parkinsonian solutions except w₂₃ and w₃₂, we found a healthy solution that could be forced into a parkinsonian state by only altering w₂₃ and w₃₂. In a parkinsonian solution, at least one of the Routh-Hurwitz stability conditions must be broken. We examined which condition or combination of conditions is broken when the system moves from a healthy to a parkinsonian state for different values of w₂₃ and w₃₂. The region plot for the low b₆ is shown below.

Conclusions and Future Directions

Our model can represent the average firing rates of healthy and parkinsonian states in the thalamocortical motor circuit.
We established stability and steady state conditions for the system to be healthy or parkinsonian.
We have found multiple sets of weights that both satisfy the conditions and match the neuronal firing patterns in our in-vivo primate dataset.
We discovered that changing only the connection strength between TC and CT5 can force the system from a healthy to a parkinsonian state.
Next steps:

Examine the transition from both healthy to parkinsonian and parkinsonian to healthy.
Explore methods of biologically validating our model by pharmacologically manipulating the weights of motor circuit network connections.

More About the Team

Carly Ferrell is a rising senior at Mississippi State University majoring in mathematics and minoring in statistics and music with a concentration in voice. She is interested in utilizing her skills in applied mathematics and statistcs to research music, specifically music theory and sight singing. Outside class, she enjoys reading, dancing, singing, and composing music.

Qile Jiang is a rising junior at Brown University majoring in Applied Mathematics. His primary research area is in applied dynamical systems, but he also has a keen interest in pure math topics such as algebra. Outside of school, he spends his time training boxing, painting, and going to operas and classical concerts.

Margaret Olivia Leu is a junior at Pomona College double majoring in mathematics and politics. She is interested in working on ways to use mathematics as a tool in the fields of politics and social justice work, and hopes to pursue a career that combines these two interests. Outside academics, she enjoys crocheting, cooking, and listening to music.

References

Sveinbjornsdottir, S. (2016).The clinical symptoms of Parkinson’s disease. Journal of Neurochemistry, 139(1), 318-324. https://doi.org/10.1111/jnc.13691.
DeLong, M. R., & Wichmann, T. (2007). Circuits and circuit disorders of the basal ganglia. Archives of Neurology, 64(1), 20–24. https://doi.org/10.1001/archneur.64.1.20.
Alexander, G. E., DeLong, M.R., & Strick, P.L. (1986). Parallel Organization of functionally segregated circuits linking basal ganglia and cortex. Annual Review of Neuroscience, 9(1), 357-381. https://doi.org/10.1146/annurev.ne.09.030186.002041
Galvan, A., Devergnas, A., & Wichmann, T. (2015). Alterations in neuronal activity in basal ganglia-thalamocortical circuits in the parkinsonian state. Frontiers in Neuroanatomy, 9, 5. https://doi.org/10.3389/fnana.2015.00005.
Caiola, M., & Holmes, M. H. (2019). Model and analysis for the onset of parkinsonian firing patterns in a simplified basal ganglia. International Journal of Neural Systems, 29(1). https://doi.org/10.1142/S0129065718500211.
de Lau, L. M. L., & Breteler, M. M. B. (2006). Epidemiology of Parkinson’s disease. The Lancet Neurology, 5(6), 525-535. https://doi.org/10.1016/S1474-4422(06)70471-9.
Rall, W. (1955). Experimental monosynaptic input-output relations in the mammalian spinal cord. Journal of Cellular and Comparative Physiology, 46(3), 413–437. https://doi.org/10.1002/jcp.1030460303
Wilson, C. J., & Bevan, M. D. (2011). Intrinsic dynamics and synaptic inputs control the activity patterns of subthalamic nucleus neurons in health and in Parkinson’s disease. Neuroscience, 198, 54–68. https://doi.org/10.1016/j.neuroscience.2011.06.049
Nambu, A., & Llinaś, R. (1994). Electrophysiology of globus pallidus neurons in vitro. Journal of Neurophysiology, 72(3), 1127–1139. https://doi.org/10.1152/jn.1994.72.3.1127
Kita, H., Tachibana, Y., Nambu, A., & Chiken, S. (2005). Balance of Monosynaptic Excitatory and Disynaptic Inhibitory Responses of the Globus Pallidus Induced after Stimulation of the Subthalamic Nucleus in the Monkey. Journal of Neuroscience, 25(38), 8611–8619. https://doi.org/10.1523/JNEUROSCI.1719-05.2005
Hikosaka, O. (2007). GABAergic output of the basal ganglia. Progress in Brain Research, 160, 209–226. https://doi.org/10.1016/S0079-6123(06)60012-5
Wichmann, T., Bergman, H., Starr, P. A., Subramanian, T., Watts, R. L., & DeLong, M. R. (1999). Comparison of MPTP-induced changes in spontaneous neuronal discharge in the internal pallidal segment and in the substantia nigra pars reticulata in primates. Experimental Brain Research, 125(4), 397–409. https://doi.org/10.1007/s002210050696
Bergman, H., Wichmann, T., Karmon, B., & DeLong, M. R. (1994). The primate subthalamic nucleus. II. Neuronal activity in the MPTP model of parkinsonism. Journal of Neurophysiology, 72(2), 507–520. https://doi.org/10.1152/jn.1994.72.2.507
Hashimoto, T., Elder, C. M., Okun, M. S., Patrick, S. K., & Vitek, J. L. (2003). Stimulation of the Subthalamic Nucleus Changes the Firing Pattern of Pallidal Neurons. The Journal of Neuroscience, 23(5), 1916–1923. https://doi.org/10.1523/JNEUROSCI.23-05-01916.2003
Nakanishi, H., Tamura, A., Kawai, K., & Yamamoto, K. (1997). Electrophysiological studies of rat substantia nigra neurons in an in vitro slice preparation after middle cerebral artery occlusion. Neuroscience, 77(4), 1021–1028. https://doi.org/10.1016/s0306-4522(96)00555-6
Nambu, A. (2007). Globus pallidus internal segment. Progress in Brain Research, 160, 135–150. https://doi.org/10.1016/S0079-6123(06)60008-3
van Albada, S. J., & Robinson, P. A. (2009). Mean-field modeling of the basal ganglia-thalamocortical system. I Firing rates in healthy and parkinsonian states. Journal of Theoretical Biology, 257(4), 642–663. https://doi.org/10.1016/j.jtbi.2008.12.018
Opris, I., Hampson, R. E., Stanford, T. R., Gerhardt, G. A., & Deadwyler, S. A. (2011). Neural Activity in Frontal Cortical Cell Layers: Evidence for Columnar Sensorimotor Processing. Journal of Cognitive Neuroscience, 23(6), 1507–1521. https://doi.org/10.1162/jocn.2010.21534
Ison, M. J., Mormann, F., Cerf, M., Koch, C., Fried, I., & Quiroga, R. Q. (2011). Selectivity of pyramidal cells and interneurons in the human medial temporal lobe. Journal of Neurophysiology, 106(4), 1713–1721. https://doi.org/10.1152/jn.00576.2010
Putrino, D. F., Chen, Z., Ghosh, S., & Brown, E. N. (2011). Motor Cortical Networks for Skilled Movements Have Dynamic Properties That Are Related to Accurate Reaching. Neural Plasticity, 2011, 1–15. https://doi.org/10.1155/2011/413543

Shallow vs. Deep Brain Network Models for Mental Disorder Analysis

Mon, 27 Jun 2022 11:36:49 -0400

This post was written by Erica Choi, Sally Smith, and Ethan Young and published with minor edits. The team was advised by Professor Carl Yang. In addition to this post and the paper, the team has also created slides for a midterm presentation, a poster blitz video, and a poster. This work was accepted at the BrainNN workshop at IEEE BigData 2022 and slides are available here.

Comparing Shallow vs. Deep Brain Network Models

Our shallow models use graph kernels to compare structural similarity of brain network data. Plugging those kernels into support vector machines allows us to classify patients’ brain scans. These kernel methods are called “shallow” because they do not require many layers of computation, unlike their “deep” model counterparts. Our deep models are graph neural networks, machine learning (ML) models that can exploit the local information of nodes in graph data to perform classification. In this project, our models are classifying brain scans as either diseased or healthy. We are comparing the two types of models, shallow and deep, to further determine which might be more useful in analyzing neuroimaging data, as well as working on using the models in conjunction with one another to leverage the strengths of both.

For useful background and definitions refer to Preliminaries.

Datasets

We are working with 2 datasets, one documenting human immunodeficiency virus (HIV) patients and one documenting bipolar disorder (BP) patients. Each dataset consists of functional magnetic resonance imaging (fMRI) scans, diffusion tensor imaging (DTI) scans, and classification labels in the form of integers, where 1 indicates a healthy patient and -1 indicates an unhealthy patient. Both datasets have been processed for us, as detailed in Section 3 of the paper authored by Cui et al.

Problem Formulation

The DTI and fMRI brain scans of each patient $i$ are represented as weighted adjacency matrices $\mathbf{W}_i \in \mathbb{R}^{M \times M}$. The adjacency matrix is constructed from the brain scan and is a natural way of mathematically representing graph data. Nodes in the brain network represent regions of interest (ROIs), and edge links between nodes indicate the strength of the connection between the two regions. In general, fMRI scans are considered to be more robust than DTI scans; specifically, fMRI scans are less affected by noise caused by data collection. Thus, our experiments prioritize working with the fMRI scans.

Classification Task

The standard graph classification task considers the problem of classifying graphs into two or more categories. The goal is to learn a model that maps graphs in the set of graphs $G$ to a set of labels $Y$. In this project, our set of graphs $G$ is the set of brain scans from patients and our set of labels $Y$ consists of two labels: diseased and healthy. The goal of our models is to classify brain scans accurately and improve model interpretability.

Implementation

For implementation of support vector machines (SVM) with graph kernels, we utilized threshold rounding to remove edge weights and sparsify the adjacency matrices. This means that values in the adjacency matrices were rounded to make the matrices simpler. While this results in information loss, it preserves the overall structure of the adjacency matrices and makes them usable for this particular method. It also makes the computation less expensive. Further manipulation creates a list of graph objects that are compatible with the Python package GraKel.

For implementation of graph convolutional networks (GCNs), we followed BrainGB’s code to create a data type that can be used with the Python package PyG.

For implementation of kernel graph neural networks (i.e., KerGNN), we followed KerGNN’s code and implemented threshold rounding to run experiments. The motivation for threshold rounding is the same as for implementing SVM.

Methods

1. Graph Kernels

Fig.1 - Support Vector Machines with Kernels

We computed three kernels to plug into SVM: Weisfeiler-Lehman (WL), Weisfeiler-Lehman Optimal Assignment (WLOA), and propagation (Prop). The choice of these kernels is motivated by exploiting structural information (i.e., subgraphs) in the brain networks. We tested these graph kernels to find which ones were most effective. On average, the propagation kernel classified HIV best and the WLOA kernel classified bipolar disorder best.

2. Graph Convolutional Networks (GCNs)

Fig.2 - BrainGB Framework

The deep model that we experimented with in this project is graph convolutional networks (GCNs). This is a “deep” model because GCNs are under the umbrella of “deep learning”. GCNs are modern ML algorithms that pass information through several layers and do more extensive computations than shallow models. Note that our GCNs (and GNNs in general) are shallow in the sense that the models have few layers. Machine learning is still a very active field of research, and recent interest in graph data has led to major strides in graph-based ML.

We implement message passing GNNs (MPGNN)—a type of GCN—using the BrainGB Python package, which is built on the Pytorch and Pytorch Geometric libraries. MPGNNs involve what are called message passing schemes to aggregate information from a node’s neighbors. Figure 2, adapted from Cui et al., visualizes the MPGNN architecture.

3. Merging Graph Kernels and GNNs

To leverage the higher order structural information given by graph kernels and local information given by GCNs, we implement GNNs that incorporate various graph kernels (WL, WLOA, etc.) and benchmark their performance on our dataset. The frameworks of particular interest to us are:

the graph convolution layer (GKC) proposed by Cosmo et al., visualized in Figure 3, and
the kernel graph neural network (KerGNN) proposed by Feng et al., visualized in Figure 5.

Fig.3 - GKNN Framework

Fig.5 - KerGNN Framework

Results

Our most successful model was a graph attention network (GAT) model that used a node concatenation message passing scheme. This model was able to classify HIV patients as healthy or diseased with 81% accuracy on average. In general, our highest performing models were classifying HIV data, particularly using deep models.

Our highest performing model for bipolar disorder prediction used support vector classifiers (SVCs) with propagation WLOA kernels and had average accuracy of 63%. The differences in performance are minor; furthermore, all kernels’ mean performance had high standard deviation.

Our preliminary results from using a combination of kernel methods and GNNs are not outperforming our HIV-GAT(node) model, but we are seeing some improvements in classifying bipolar disorder with the hybrid model, particularly with KerGNN.

Discussion

In general, we found that our models were better able to classify HIV patients than BP patients. Cui et al. observes that HIV affects both the visual network (VN) and default mode network (DMN), while bipolar disorder mainly affects the bilateral limbic network (BLN). It is possible that HIV was easier to model because it significantly affected multiple networks in the brain, while BP was more elusive with only one major network significantly affected.

For more details and discussion of our results, see our manuscript (coming soon).

Limitations

Due to our datasets consisting of less than 100 patients each, our results may not generalize well beyond our specific dataset. If this study were to be replicated, a larger dataset would be ideal, but the expensive nature of brain imaging data and its processing requirements will pose some degree of limitation to any study that uses it. Additionally, brain imaging data is a highly protected data type due to the right to privacy of the patients whose brain scans are used in these experiments. This means that much of the information about the patients is kept private, so it can be challenging to find confounding variables or alternative explanations for statistical results from this data.

Another notable limitation is that of structure of the brain networks themselves. Specifically, it remains unclear what subgraphs and higher-order information are relevant in classifying brain scans as belonging to diseased or healthy individuals. GNNs also have limitations. For example, GNNs are prone to overfitting, especially with datasets as small as our own. This is an issue that could potentially be alleviated with access to a larger dataset.

Future Work

There are many avenues with which we may take future research in brain network classification. There are many ways of incorporating graph kernels into GNNs that improve the interpretability of the model, which in turn gives insights into the key underlying structures that help to classify brain networks.