Designing lipid nanoparticles utilizing a transformer-based neural community


COMET particulars

This part describes the mannequin structure and coaching algorithms of COMET. Pseudocode for inference is supplied in Algorithm S1.

COMET mannequin structure

Lipid molecular buildings are encoded into high-dimensional vectors (molecular embeddings), whereas scalar compositional options are encoded utilizing a Gaussian-based encoder53. Steady formulation-wide parameters (for instance, N/P ratio and volumetric combine ratio) are encoded with Gaussian layers; categorical inputs use one-hot embeddings.

The transformer makes use of a [CLS] token to combination enter options throughout a number of consideration layers. For multitask studying, every cell kind is assigned a separate [CLS] token and prediction head, enabling task-specific outputs whereas sharing LNP-level illustration studying.

Molecular encoder

COMET is suitable with numerous molecular encoders; right here we use Uni-Mol11, pretrained to get better masked atom varieties and corrupted three-dimensional coordinates. It presents robust property prediction efficiency and is used with default hyperparameters (from https://github.com/dptech-corp/Uni-Mol/tree/major/unimol). Pretrained weights are frozen throughout COMET coaching. Every compound is encoded right into a 512-dimensional vector utilizing atom varieties and coordinates.

Lipid molar percentages are encoded into 128-dimensional vectors utilizing a shared Gaussian layer. Every part is additional assigned a 128-dimensional one-hot embedding (({z}_{okay}^{{rm{kind}}})) to tell apart lipid lessons. These are concatenated and projected by way of a two-layer MLP right into a 256-dimensional part illustration.

N/P ratio and volumetric ratio

N/P ratio is encoded utilizing a separate 256-dimensional Gaussian layer (zN/P). Aqueous/natural ratios, handled as categorical variables, are one-hot encoded (zpart) with 256 dimensions.

CLS token and prediction head

Every cell kind makes use of a discovered [CLS] token (zCLS) of dimension 256. These combination part and formulation-wide token representations throughout Nblock transformer layers through consideration54. Ultimate predictions are made by passing the [CLS] token by way of a two-layer MLP (MLPpredict).

Transformer blocks

Every block follows a Pre-LayerNorm construction55 composed of layernorm → self-attention → MLP with residual connections.

Coaching particulars

The mannequin is educated with a binary rating goal56 the place, given a pair of LNP samples, the mannequin learns to foretell a bigger efficacy rating for the LNP that has the next efficacy label worth from the opposite LNP:

$${{mathcal{L}}}_{mathrm{rating}}=-log left(sigma (;{f}_{theta }({x}_{mathrm{h}})-{f}_{theta }({x}_{mathrm{l}}))proper.$$

(1)

the place xh and xl are high- and low-efficacy LNPs and fθ is COMET’s scoring perform. Coaching makes use of a batch measurement of 64 (2,016 pairwise comparisons per batch).

Battle-averse gradient descent

Battle-averse gradient descent (CAGrad)33 mitigates conflicting gradients in multitask settings. We apply CAGrad with a coefficient of 0.2 to stabilize coaching throughout duties.

Noise augmentation

To deal with noise within the experimental information, particularly within the fluid dealing with course of, we increase the molar proportion with Gaussian noise proportionate to its worth the place the usual deviation of the noise is 10% of precise molar proportion.

Label margin

From the label values, we will inform not solely which LNP is healthier than one other but additionally by how a lot. To coach the mannequin to study this extra information, we embrace a margin time period57 within the binary rating goal:

$${{mathcal{L}}}_{mathrm{rating}}=-log left({rm{Sigmoid}}(;{f}_{theta }({x}_{mathrm{h}})-{f}_{theta }({x}_{mathrm{l}})-{lambda }_{mathrm{margin}}(;{y}_{mathrm{h}}-{y}_{mathrm{l}}))proper.$$

(2)

the place yh and yl are the (efficacy) label values of the extra efficacious and fewer efficacious LNP, respectively, and λmargin controls how a lot this goal dominates the coaching. We use λmargin = 0.01 in our experiments.

Ensembling

For in silico analysis (Fig. 3e–l), the ensemble is fashioned by Nmannequin fashions educated with the identical hyperparameters and dataset (practice/legitimate/check break up) however weights initialized with totally different random seeds. For the ensemble deployed to deduce digital LNPs, 5 totally different practice (80%)/legitimate (20%) splits are made in a fivefold method and every mannequin within the ensemble is educated on a distinct fold. To make sure that ensembled scores will not be biased in direction of fashions with excessive variance, the expected scores from every mannequin are normalized by making their scores for the LANCE LNPs match a traditional distribution with imply 0 and customary deviation 1 earlier than ensembling. Extra particularly, for every mannequin, that is accomplished by inferring the expected scores on all of the LANCE LNPs and utilizing the imply (implyi) and customary deviation (stdi) of LANCE LNPs’ scores to compute the normalized scores ({y}_{i}^{{prime} mathrm{normalized}}) by way of

$${y}_{i}^{{prime} mathrm{normalized}}=frac{{y}_{i}^{{prime} }-{mathrm{imply}}_{i}}{mathrm{std}_{i}},quad i sim {1,{.}{.}{.},{N}_{{rm{mannequin}}}}$$

(3)

The ultimate ensemble rating is the imply of all fashions’ normalized scores:

$${y}^{{prime} mathrm{ensemble}}=frac{1}{{N}_{{rm{mannequin}}}}mathop{sum }limits_{i}^{{N}_{{rm{mannequin}}}}{y}_{i}^{{prime} mathrm{normalized}}$$

(4)

COMET is carried out in PyTorch and educated with NVIDIA V100 GPUs.

okay-Nearest neighbours and random forest mannequin particulars

The okay-nearest neighbours and random forest fashions are carried out with the scikit-learn (https://scikit-learn.org/) bundle, with default hyperparameters. Extra particularly, the okay-nearest neighbours mannequin makes use of n = 5 nearest neighbours whereas the random forest mannequin makes use of n = 100 estimators (bushes).

LANCE dataset particulars

LANCE includes 4 elements spanning orthogonal LNP design dimensions: lipid part identities, molar percentages, synthesis parameters (for instance, N/P and aqueous/natural volumetric ratios) and high-resolution molar sweeps.

Seven ionizable lipids, three sterols, two helper lipids and two PEG lipids had been used (Supplementary Desk 14), reflecting the main focus of present analysis12,42. To review molar % results, we designed 13 lipid ratios by various one lipid class at a time from a reference BASE ratio (Fig. 1d), based mostly on ref. 20. For example, ratios I1–I4 modify ionizable lipid %, C1–C3 alter ldl cholesterol (compensated by helper lipid), and P1–P3 alter PEG lipid %, whereas the remaining modify a number of elements (Supplementary Desk 13).

Half 1 (lipid selection)

To look at lipid identification results, we generated 84 mixtures from all permutations of seven ionizable lipids, 3 sterols, 2 helper lipids and a pair of PEG lipids. Paired with 13 molar ratios, this ends in 1,092 attainable LNPs; 1,066 had been examined. After eradicating 91 overlapping with half 2, this half yielded 975 distinctive LNPs.

Half 2 (ionizable lipid synergy)

Following research suggesting synergy from dual-ionizable lipid formulations58, we created LNPs with 60:40 molar splits throughout all ionizable lipid pairs, distributed throughout 13 lipid ratios. This yielded 637 further LNPs.

Half 3 (key synthesis parameters)

To discover synthesis results, we launched variation in ionizable lipid/RNA weight ratios (10:1, 15:1 and 20:1) and aqueous/natural part ratios (1:1 and three:1). Weight ratios had been adjusted by molar mass to take care of equal molar %. These parameters had been later transformed to N/P ratios for mannequin enter. This half consists of 924 LNPs.

Half 4 (molar proportion sweeps)

To review finer-grained molar % results, we created 24 evenly spaced intervals from 10% to 80% for ionizable lipid, ldl cholesterol and helper lipid, producing 492 LNPs throughout 3 targeted sweeps.

Formulation ratios

Single-ionizable LNPs span 18 distinctive N/P ratios, derived from 3 ionizable lipid/RNA weight ratios and seven ionizable lipids. Twin-ionizable formulations add 63 extra, totalling 81 N/P ratios. In molar phrases, 13 base lipid ratios and 72 sweep ratios (24 per lipid class) lead to 85 whole molar compositions.

LNP synthesis

LNPs had been synthesized by mixing lipid–ethanol and mRNA–citrate buffer phases, incubated at 4 °C for 10 min. Automated dealing with was carried out on the Tecan Fluent platform. For animal research, LNPs had been combined, incubated on ice for 10 min and dialysed in a single day at 4 °C in PBS (Slide-A-Lyzer, ThermoFisher).

Supplies

FLuc mRNA (L-7202, Trilink); lipids (Cayman Chemical substances, Avanti); luciferase assay (Regular-Glo, E2550) and Agilent BioTek plate reader for readout. alamarBlue was used for viability assays.

Information processing

Every 96-well plate included a ‘customary’ LNP. Uncooked luminescence values had been normalized to the usual and averaged throughout 4 replicates (two organic, two technical). Imply values had been log-transformed and min–max normalized to [0, 1].

Now we have represented a number of key options of the LANCE dataset in Fig. 2. Under, we clarify how these key options had been extracted from LANCE. For Fig. 2a, half 1 formulations had been chosen. For the 4 ionizable lipids (ALC-0315, DLin-MC3-DMA, C12-200 and CKK-E12), we had 156 formulations containing 2 helper lipids, 3 sterol lipids and a pair of PEG lipids (that’s, 2 × 2 × 3 = 12 mixtures) at 13 molar ratios (12 × 13 = 156 formulations).

For Fig. 2b,c, half 3 formulations containing one ionizable lipid, ldl cholesterol and C14-PEG had been chosen. Two molar ratios of the lipid elements (that are proven within the determine) had been studied. The ionizable lipid to mRNA molar ratio was 10,162. The aqueous to natural quantity ratio was assorted. For Fig. 2nd, half 3 formulations containing one ionizable lipid, DOPE, ldl cholesterol and C14-PEG had been chosen. Two molar ratios of the lipid elements (that are proven within the determine) had been studied. The natural to aqueous quantity ratio was held at 1:3.

Determine 2e was generated from half 2 information. Solely formulations containing DOPE, ldl cholesterol and C14-PEG had been used for the graph. The identify of the primary ionizable lipid was listed because the title of graph and the second ionizable lipid identify was the row identify. The overall molar content material of the ionizable lipids was the column identify. The molar ratio of ionizable lipid 1/ionizable lipid 2 is 1.5. The molar ratio of DOPE/ldl cholesterol was 0.34. The molar % of C14-PEG was 2.5%. The molar ratio of ionizable lipid/mRNA was 10,162. Your complete library was used to assemble Fig. 2f. We calculated the normalized transfection efficacy for the thirtieth and seventieth percentile formulations in B16-F10 and DC2.4 cells. These values had been as follows: seventieth percentile, B16-F10 = 0.43887; thirtieth percentile, B16-F10 = 0.24315; seventieth percentile DC2.4 = 0.64623; thirtieth percentile DC2.4 = 0.30946. Formulations above and beneath these values within the respective cell strains had been chosen and are plotted in Fig. 2f.

In vitro validation particulars

The LNPs are named in keeping with the teams to which they belong. A abstract of the prefixes used right here is given in Supplementary Desk 16.

Clinically authorised LNP baselines

The recipes for the three medical LNP baselines are based mostly on the literature39 and synthesized in an aqueous/natural volumetric of three:1 following what is often utilized in earlier work.

Prime LANCE LNP hits baselines

To seek out robust and dependable LNP baselines from LANCE, we randomly choose 10 LNP formulations from the ninetieth percentile for every cell line to once more display screen them with the respective cell line to test for reproducibility. Amongst these ten formulations, three LNPs with their normalized efficacy worth closest to their authentic LANCE efficacy label values had been chosen as LANCE baseline LNPs.

Exploratory LNP library

To span an enormous formulation house, the digital library was generated by enumerating by way of attainable LNP options equivalent to lipid decisions, their molar percentages and key synthesis parameters equivalent to N/P ratios and aqueous/natural volumetric ratios, in keeping with Supplementary Desk 15. To seek out LNPs which can be totally different from the hits within the LANCE dataset, formulations inside a ten% L1 distance lipid molar proportion neighbourhood of any high 10% most efficacious LANCE hits had been excluded. After this step, the exploratory library has 27,354,600 and 34,539,960 formulations for DC2.4 and B16-F10, respectively. An ensemble of 5 COMET fashions predicted efficacy in each cell strains. The highest 0.1% highest-scoring LNPs had been chosen (34,529 B16-F10 and 27,354 DC2.4).

The following step removes formulations based mostly on uncertainty in COMET prediction. We seize the extent of uncertainty by first computing the usual deviation (σ) between the fashions’ prediction (({y}_{i}^{{prime} mathrm{normalized}}) in equation (3)) inside the ensemble. We then scale the usual deviation by division with a non-negative predicted efficacy time period to get a relative uncertainty worth (urel):

$${u}_{{rm{rel}}}=frac{sigma }{{hat{y}}^{rm{ensemble}}},quad {hat{y}}^{rm{ensemble}}={y}^{{prime} rm{ensemble}}-{y}^{{prime} rm{ensemble,min,LANCE}}$$

(5)

the place ({y}^{{prime} mathrm{ensemble,min,LANCE}}) is the minimal ensemble rating among the many LANCE LNPs. Any formulations with detrimental ({hat{y}}^{mathrm{ensemble}}) time period had been dropped. Supplementary Fig. 13 reveals the distribution of this relative uncertainty worth. Formulations with largest 50% relative uncertainty values had been eliminated, leaving 17,269 B16-F10 and 13,677 DC2.4 formulations.

To advertise chemical variety, Ok-means clustering (on 14-dimensional vectors encoding lipid molar percentages) grouped these candidates into 10 clusters. Clustering was repeated 1,000 occasions to stabilize assignments. The best-scoring formulation in every cluster was chosen, leading to ten numerous in silico hits per cell line (Supplementary Tables 17 and 18).

Lead optimization LNP library

For every cell kind, three high LANCE hits (from ‘Prime LANCE LNP hits baselines’ part) had been used as beginning factors. Round every, digital candidates had been generated by (1) exploring inside a 20% L1 molar proportion distance, (2) substituting no less than one lipid (6 ionizable lipids, 2 cholesterols, 1 helper and 1 PEG) and (3) altering the N/P ratio.

To generate three numerous candidates per lead, we segmented the neighbourhood into three zones: (1) molar % section (inside 20% L1, no lipid adjustments), (2) substitute-lipid section (inside 20% L1, however with no less than one totally different lipid) and (3) N/P ratio section (differing N/P ratio). From every zone, the highest predicted LNP was chosen (Fig. 3d, proper). This yielded three optimized LNPs per lead. The digital library measurement ranged from 1.5 million (single-ionizable lipid) to 9 million (dual-ionizable lipid) candidates. The sixfold improve in dual-ionizable lipid instances arises from combinatorial enumeration: every minor ionizable lipid was paired with six main ones. Against this, single-ionizable lipid compositions require no pairing. The ultimate chosen formulations for validation are listed in Supplementary Tables 19 and 20.

PBAE synthesis

The compositions and molar ratios of amines, diacrylates and branching brokers are listed in Supplementary Desk 21. To synthesize PBAE polymers, the mixture of the amines, diacrylates and branching brokers had been used. In short, in a 20 ml glass vial, the complete weight of diacrylate and branching agent was added. Then, the solvent (dimethylformamide) was added to the response combination. Later, the response vials had been positioned on a hotplate at 90 °C. After 24 h, the vials had been faraway from the hotplate and cooled to room temperature. The amines had been added to the response vial and positioned again on the hotplate at 90 °C and the response was allowed to proceed for 48 h. Lastly, the vials had been faraway from the hotplate and allowed to chill to room temperature. Then, the response combination was added (drop-by-drop) right into a beaker containing ~150 ml ice-cold diethyl ether (~10× extra quantity). The collected samples had been transferred to 50 ml tubes and centrifuged at 1,000 × g for 3 min to pellet the polymer. Later, the supernatant was eliminated and dissolved within the minimal attainable quantity of dimethylformamide. This purification step was repeated 3 times. Ultimate polymers had been dried beneath vacuum and solubility examined in ethanol.

Representing PBAEs in COMET

PBAEs had been represented as a mixture of their diacrylate–amine repeating unit and branching agent, every with distinctive component-type embeddings. The repeating unit was handled as a fifth molar part kind alongside lipids, with its molar focus estimated from polymer weight and molecular weight. Whole molar percentages of PBAE and lipids sum to 100%. Inference proceeds as in lipid-only LNPs (‘COMET particulars’ part).

COMET PBAE LNP lead optimization hits

Two top-performing PBAE LNPs per cell kind had been used as beginning factors. Round every, digital candidates had been generated by (1) exploring inside a 20% L1 molar proportion neighbourhood, (2) substituting lipids (6 ionizable lipids, 2 sterols, 1 helper and 1 PEG) and (3) changing to dual-ionizable compositions. To pick out three numerous candidates, we outlined three non-overlapping segments: one inside the 20% L1 distance however will need to have the identical lipid decisions, one with no less than one totally different lipid compound and one with a dual-ionizable lipid configuration. The highest predicted LNP from every section was chosen (Fig. 3d, proper). Ultimate hits are detailed in Supplementary Tables 22 and 23.

Human IL-15 screening

The IL-15 mRNA is synthesized through in vitro transcription with a HiScribe T7 mRNA equipment with CleanCap Reagent AG (E2080S) from New England Biolabs, with 5-methoxy-UTP (N-1093) from Trilink. The LNP transfection is finished at an mRNA focus of 0.25 µg ml−1 within the 96-well plate format. The Human IL-15 expression stage is measured with Human IL-15 Uncoated ELISA (88-7620) procured from Invitrogen, after 16 h of incubation of HepG2 cells with LNP. Uncooked efficacy information are normalized, much like bioluminescence information talked about above, earlier than used as dataset for machine studying experiments. This dataset (20%) is randomly break up into check set, whereas the remaining is used because the practice and validation units.

Lyophilization of LNPs

The LNPs are synthesized in a tris buffer (5 mM tris buffer, pH 8). After synthesis, the LNP formulations are frozen at −80 °C for two h earlier than present process the next lyophilization course of: equilibrate at −40 °C for two h, in ambiance → −40 °C for 21 h, in vacuum → 25 °C for two h, in vacuum. Labconco FreeZone 6 l with a Stoppering Tray Dryer was used for lyophilization.

Degradation within the efficacy

Put up-lyophilization efficacy values had been computed and normalized equally to the LANCE label values (‘Information processing’ part). The degradation of efficacy owing to lyophilization was calculated by subtracting the post-lyophilization efficacy values rating from the LANCE B16-F10 values.

Animal experiments

Animal experiments for this examine had been authorised by the Massachusetts Institute of Expertise Institutional Animal Care and Use Committee and had been according to native, state and federal rules as relevant. Feminine C57BL/6J mice (000664, The Jackson Laboratory) had been used within the experiments. For imagining, d-luciferin (LUCK-1G, Gold Biotechnology) solubilized in PBS was administered through intraperitoneal injection and the mice had been imaged utilizing an IVIS imaging system (PerkinElmer).

t-SNE visualization

We chosen the COMET mannequin most correlated (Spearman) with ensemble scores throughout a random digital LNP subset. LNP options for t-SNE had been the ultimate [CLS] token representations. To make sure even distribution throughout ionizable varieties, dual-ionizable lipid LNPs had been handled as a definite class, and 1,250 LNPs per class (8 whole) had been randomly sampled (10,000 whole).

Built-in gradients implementation

To execute built-in gradients (IG) with COMET’s multimodal inputs, we tailored the Captum library. IG computes attribution by integrating gradients alongside a path from reference to enter. Function attributions had been computed per LNP, baseline-subtracted and averaged throughout every group. Non-PBAE LANCE LNPs had been used because the baseline. Attribution scores had been normalized (max = 1) and averaged throughout ensemble fashions.

Reporting abstract

Additional data on analysis design is accessible within the Nature Portfolio Reporting Abstract linked to this text.