MetabolismGraph/model.qmd at main · allierc/MetabolismGraph · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
---
title: "Model: Metabolic Network Dynamics"
---

## Metabolic Networks as Bipartite Graphs

A metabolic network can be represented as a **bipartite graph** connecting metabolites to reactions:

```{mermaid}
%%| fig-width: 9
%%{init: {'theme': 'neutral'}}%%
flowchart LR
    subgraph met["Metabolites"]
        direction TB
        A((Glucose))
        B((ATP))
        C((Pyruvate))
        D((ADP))
    end

    subgraph rxn["Reactions"]
        direction TB
        R1["R₁ · k₁"]
        R2["R₂ · k₂"]
    end

    A -->|"−1"| R1
    B -->|"−1"| R1
    R1 -->|"+2"| C
    R1 -->|"+1"| D

    C -->|"−1"| R2
    R2 -->|"+1"| B

    style A fill:#e1f5fe,stroke:#0277bd
    style B fill:#e1f5fe,stroke:#0277bd
    style C fill:#e1f5fe,stroke:#0277bd
    style D fill:#e1f5fe,stroke:#0277bd
    style R1 fill:#fff3e0,stroke:#ef6c00
    style R2 fill:#fff3e0,stroke:#ef6c00
    style met fill:none,stroke:#0277bd,stroke-dasharray: 5 5
    style rxn fill:none,stroke:#ef6c00,stroke-dasharray: 5 5
```

Each edge has a **stoichiometric coefficient**:

- **Negative** coefficients: substrates (consumed by the reaction)
- **Positive** coefficients: products (produced by the reaction)

## The Stoichiometric Matrix

The stoichiometric matrix $\mathbf{S}$ is an $(n_{\text{metabolites}} \times n_{\text{reactions}})$ matrix where entry $S_{ij}$ indicates how metabolite $i$ participates in reaction $j$:

$$
\mathbf{S} = \begin{pmatrix}
-1 & 0 & \cdots \\
-1 & +1 & \cdots \\
+2 & -1 & \cdots \\
+1 & 0 & \cdots \\
\vdots & \vdots & \ddots
\end{pmatrix}
$$

**Properties of S:**

- **Sparse**: most entries are zero (each reaction involves only 2-6 metabolites)
- **Integer-valued**: entries are typically in $\{-2, -1, 0, +1, +2\}$
- **Mass conservation**: column sums should be zero for balanced reactions

**Example:**

Consider the diagram above with 2 reactions:

$$
\begin{aligned}
\text{R1}: \quad & \text{Glucose} + \text{ATP} \longrightarrow 2\,\text{Pyruvate} + \text{ADP} \\
\text{R2}: \quad & \text{Pyruvate} \longrightarrow \text{ATP}
\end{aligned}
$$

The corresponding stoichiometric matrix is:

$$
\begin{array}{c|cc}
 & \text{R1} & \text{R2} \\
\hline
\text{Glucose} & -1 & 0 \\
\text{ATP} & -1 & +1 \\
\text{Pyruvate} & +2 & -1 \\
\text{ADP} & +1 & 0
\end{array}
$$

## Reaction Kinetics

### Mass-Action Law

The rate of a chemical reaction is proportional to the product of the concentrations of its substrates, each raised to the power of its stoichiometric coefficient. For a reaction $j$ with substrates $k$:

$$v_j = k_j \cdot \prod_{k \in \text{sub}(j)} c_k^{|S_{kj}|}$$

This is the **law of mass action** (Guldberg & Waage, 1864): a reaction proceeds faster when its substrates are more abundant. The exponent $|S_{kj}|$ reflects how many molecules of substrate $k$ are consumed — a reaction requiring 2 molecules of ATP depends quadratically on ATP concentration ($c_{\text{ATP}}^2$), while a reaction consuming 1 molecule depends linearly ($c_{\text{ATP}}^1$).

### Substrate Aggregation

Each reaction consumes multiple substrates. The **aggregation rule** determines how substrate contributions combine to form the reaction rate:

| Aggregation | Formula | Physical meaning |
|-------------|---------|-----------------|
| **Multiplicative** (product) | $v_j = k_j \cdot \prod_k c_k^{\|S_{kj}\|}$ | True mass-action: all substrates must be present simultaneously. Rate drops to zero if any substrate is absent. |
| **Additive** (sum) | $v_j = k_j \cdot \sum_k c_k^{\|S_{kj}\|}$ | Approximate: substrates contribute independently. Useful when reactions are not strictly mass-action (e.g., enzyme-mediated). |

**Multiplicative aggregation** is the physically correct form for mass-action kinetics — it encodes the requirement that a reaction can only proceed if *all* its substrates are available. It also produces richer dynamics: multiplicative coupling between metabolites creates nonlinear feedback loops that can sustain oscillations.

**Additive aggregation** is a simplification where each substrate contributes independently to the rate. It cannot produce sustained oscillations from autocatalytic cycles because it lacks the multiplicative coupling needed for positive feedback.

## Model Evolution

### 1. Pure Reaction

Pure reaction dynamics without homeostasis, using **additive aggregation**:

$$\frac{dc_i}{dt} = \sum_j S_{ij} \cdot k_j \cdot \sum_{k \in \text{sub}(j)} c_k^{|S_{kj}|}$$

| Parameter | Value |
|-----------|-------|
| n_metabolites | 100 |
| n_reactions | 64 |
| Stoichiometry | Random |
| Aggregation | Sum (additive) |
| $\lambda$ (homeostatic strength) | 0.0 |
| Rate constants $k_j$ | $[10^{-3}, 10^{-1}]$ |
| Initial concentrations | [2.5, 7.5] |
| Flux limiting | Enabled |

![Pure reaction dynamics: concentrations evolve under additive mass-action kinetics without homeostasis. Most metabolites equilibrate within ~100 time steps.](graphs_data/simulation_pure/concentrations.png){width=100%}

### 2. Homeostasis

With homeostatic regulation pulling concentrations toward baseline, using **additive aggregation**:

$$\frac{dc_i}{dt} = \underbrace{-\lambda_i \cdot (c_i - c_i^{\text{baseline}})}_{\text{homeostasis}} + \underbrace{\sum_j S_{ij} \cdot k_j \cdot \sum_{k \in \text{sub}(j)} c_k^{|S_{kj}|}}_{\text{reactions}}$$

| Parameter | Value |
|-----------|-------|
| n_metabolites | 100 |
| n_reactions | 64 |
| Stoichiometry | Random |
| Aggregation | Sum (additive) |
| $\lambda$ (homeostatic strength) | 0.01 |
| $c^{\text{baseline}}$ | 5.0 |
| Rate constants $k_j$ | $[10^{-3}, 10^{-1}]$ |
| Initial concentrations | [2.5, 7.5] |
| Flux limiting | Enabled |

![Homeostatic regulation: concentrations are pulled toward a baseline ($c^{\text{baseline}} = 5.0$) by a linear restoring force. Dynamics converge to steady state.](graphs_data/simulation_homeostatic/concentrations.png){width=100%}

### 3. Oscillatory

Autocatalytic cycles with **mass-action kinetics** (multiplicative aggregation) for sustained oscillations:

$$\frac{dc_i}{dt} = \sum_j S_{ij} \cdot k_j \cdot \prod_{k \in \text{sub}(j)} c_k^{|S_{kj}|}$$

With multiplicative aggregation, autocatalytic cycles create positive feedback: in $A + B \to 2B$, the rate $v = k \cdot c_A \cdot c_B$ increases with both $A$ and $B$, so producing more $B$ accelerates the reaction — a nonlinear feedback loop that sustains oscillations when cycles are closed ($A \to B \to C \to A$).

| Parameter | Value |
|-----------|-------|
| n_metabolites | 100 |
| n_reactions | 256 |
| Stoichiometry | 100% autocatalytic 3-cycles |
| Aggregation | Product (multiplicative) |
| $\lambda$ (homeostatic strength) | 0.0 |
| Rate constants $k_j$ | $[10^{-2.5}, 10^{-1}]$ |
| Initial concentrations | [1.0, 9.0] |
| Flux limiting | Disabled |

![Oscillatory dynamics (activity rank 20): autocatalytic 3-cycles with multiplicative aggregation produce sustained oscillations. The wide rate constant range $[10^{-2.5}, 10^{-1}]$ means ~40% of reactions have $k < 0.01$ and contribute negligibly.](graphs_data/simulation_oscillatory_rank_20/concentrations.png){width=100%}

## Activity Rank

To quantify the complexity of concentration dynamics, we compute the **activity rank** using singular value decomposition (SVD) of the concentration matrix $\mathbf{C} \in \mathbb{R}^{T \times n}$ (time frames × metabolites).

The **rank at 99% variance** is the number of singular values needed to capture 99% of the total variance:

$$\text{rank}_{99} = \min \left\{ k : \frac{\sum_{i=1}^{k} \sigma_i^2}{\sum_{i=1}^{n} \sigma_i^2} \geq 0.99 \right\}$$

**Interpretation:**

- **Low rank** (1-5): concentrations are highly correlated, dynamics are simple (equilibration or uniform decay)
- **High rank** (>20): metabolites evolve independently with rich, complex dynamics

### Increasing Activity Rank

With the baseline oscillatory config ($k_j \in [10^{-2.5}, 10^{-1}]$, 256 reactions), the activity rank is **20**. Activity rank is controlled by the **number of actively contributing reactions per metabolite**:

| Config | Reactions | $k_j$ range | Activity Rank |
|--------|:---------:|:-----------:|:---:|
| Baseline (rank_20) | 256 | $[10^{-2.5}, 10^{-1}]$ | **20** |
| Narrow $k$ (rank_50) | 256 | $[10^{-2.0}, 10^{-1}]$ | **47** |
| More reactions (rank_70) | 512 | $[10^{-2.0}, 10^{-1}]$ | **70** |

The most impactful change is **narrowing the rate constant range** from $[10^{-2.5}, 10^{-1}]$ to $[10^{-2.0}, 10^{-1}]$. With the wider range, ~40% of reactions have $k < 0.01$ and contribute negligibly. Removing these inert reactions more than doubles the activity rank (20 → 47). Doubling the number of reactions further increases overlap per metabolite (47 → 70).

### GNN Training Dataset

For GNN training and rate constant recovery, we use the **rank_50** config (256 reactions, activity rank 47) — complex enough to test the inverse problem while keeping it tractable:

| Parameter | Value |
|-----------|-------|
| n_metabolites | 100 |
| n_reactions | 256 |
| Stoichiometry | 100% autocatalytic 3-cycles |
| Rate constants $k_j$ | $[10^{-2.0}, 10^{-1}]$ |
| All other parameters | Same as oscillatory above |

![Oscillatory dynamics with narrowed rate constants (activity rank 47). Eliminating slow inert reactions produces richer dynamics. This is the primary dataset for GNN training.](graphs_data/simulation_oscillatory_rank_50/concentrations.png){width=100%}

## Summary: The Full Model

The complete metabolic dynamics:

$$
\frac{dc_i}{dt} = \underbrace{-\lambda_i \cdot (c_i - c_i^{\text{baseline}})}_{\text{homeostasis}} + \underbrace{\sum_{j=1}^{m} S_{ij} \cdot v_j}_{\text{reaction dynamics}}
$$

where the reaction rate follows mass-action kinetics:

$$
v_j = k_j \cdot \prod_{k \in \text{sub}(j)} c_k^{|S_{kj}|}
$$

## The Inverse Problem

Given observed concentration time series $\mathbf{C}(t)$ and the bipartite graph structure, the goal is to recover the model components that generated the dynamics. A **graph neural network** (GNN) operates on the bipartite metabolite–reaction graph and learns **two functions and a set of scalar parameters**:

### What the GNN Learns

The forward model is:

$$
\frac{dc_i}{dt} = \underbrace{\text{MLP}_{\text{node}}(c_i, a_i)}_{\text{learns } -\lambda_i(c_i - c_i^{\text{baseline}})} + \sum_{j=1}^{m} S_{ij} \cdot \underbrace{k_j \cdot \prod_{k \in \text{sub}(j)} \text{MLP}_{\text{sub}}(c_k, |S_{kj}|)}_{\text{learns } k_j \text{ and } c_k^{|S_{kj}|}}
$$

The GNN replaces the known functions with learnable MLPs, and the rate constants with learnable parameters. It must recover all three simultaneously from concentration data alone:

**1. Substrate function** $f_{\text{sub}} \to \text{MLP}_{\text{sub}}(c_k, |S_{kj}|)$

A neural network that learns how each substrate concentration contributes to the reaction rate. The ground-truth function is the mass-action power law $c_k^{|S_{kj}|}$, but the GNN does not know this — it must discover the functional form from data:

$$
\text{MLP}_{\text{sub}}(c_k, |S_{kj}|) \;\overset{?}{\approx}\; c_k^{|S_{kj}|}
$$

This is a **function discovery** problem: the MLP receives concentration and stoichiometric coefficient as inputs and must learn to output the correct power-law relationship. Since the stoichiometric coefficients are typically 1 or 2, the MLP must learn to distinguish between linear ($c^1$) and quadratic ($c^2$) dependencies.

**2. Homeostasis function** $f_{\text{node}} \to \text{MLP}_{\text{node}}(c_i)$

A neural network that learns the per-metabolite self-regulation term. The ground truth is a linear function $-\lambda_{\text{type}(i)} (c_i - c_i^{\text{baseline}})$ with small magnitude, but the GNN must discover this from data:

$$
\text{MLP}_{\text{node}}(c_i) \;\overset{?}{\approx}\; -\lambda_{\text{type}(i)} \cdot (c_i - c_i^{\text{baseline}})
$$

This function captures how each metabolite is regulated independently of reactions — pulling concentrations back toward a baseline level. The challenge is that homeostatic forces are small compared to reaction rates, so $\text{MLP}_{\text{node}}$ must learn a subtle signal without absorbing information that belongs to the reaction terms.

**3. Rate constants** $k_j$ **(256 learnable scalars)**

Per-reaction rate constants learned in log-space. Unlike the MLPs, these are not functions but a vector of 256 scalar parameters — one per reaction — that scale the reaction fluxes.

### Identifiability Challenges

The three components interact and can compensate for each other:

- **Scale ambiguity**: $k_j \cdot \text{MLP}_{\text{sub}}$ is invariant under $k \to \alpha k$, $\text{MLP}_{\text{sub}} \to \text{MLP}_{\text{sub}} / \alpha$. Without anchoring, the MLP can absorb a global scale factor and shift all $k$ values. Regularization ($\texttt{coeff\_k\_center}$) anchors $\text{mean}(\log k)$ to the known range center.

- **Function compensation**: If $\text{MLP}_{\text{sub}}$ learns a wrong functional form (e.g., $c^{1.5}$ instead of $c^2$), the rate constants can partially compensate by adjusting their values. This leads to degenerate solutions with high prediction accuracy but poor parameter recovery.

- **Homeostasis absorption**: $\text{MLP}_{\text{node}}$ can grow large and absorb dynamics that should be explained by the reaction terms, masking the true rate constants.

### Learning Modes

| Mode | S matrix | Primary metric | Challenge |
|------|----------|----------------|-----------|
| **S learning** | Learnable | stoichiometry $R^2$ | Recovering integer coefficients and sparsity |
| **S given** | Frozen from GT | rate constants $R^2$ | Disentangling $k$, $\text{MLP}_{\text{sub}}$, $\text{MLP}_{\text{node}}$ |