GraphML News (June 15th) - ICML’24 graph papers, musings on AF3, more Flow Matching
🎉 ICML 2024 papers (including orals and spotlights) are now visible on OpenReview (however, without search). If you don’t want to scroll through 100 pages of accepted papers manually or write a custom parser, Azmine Toushik Wasi compiled a collection of accepted Graph ML papers with a nice categorization.
👨🔬 More blogs on AlphaFold 3 and reflexions about the future of TechBio: Charlie Harris focuses more on the technical side whereas Carlos Outeiral presents the CompBio perspective highlighting some cases where AF3 still underperforms.
🔀 Flow Matching continues to reach new heights with recently released papers: Variational Flow Matching (you didn’t forget ELBO and KL divergence, right?) by the UvA team of Floor Eijkelboom, Grigory Bartosh, et al (feat. Max Welling) derives a generalized flow matching formulation that naturally allows for categorical data (😼 CatFlow) and graph generation - the model outperform DiGress and other diffusion baselines. At the same time, the NYU team of Boffi et al propose Flow Map Matching - pretty much the Consistency Models for FMs that enable generation in one step instead of 20-100. Finally, Ross Irwin et al from AstraZeneca come up with MolFlow - flow matching for generating 3D conformations of molecules showing compelling results on QM9 and Geom-Drugs.
📚 Weekend reading (no flow matching):
GraphStorm: all-in-one graph machine learning framework for industry applications by Da Zheng and AWS - we wrote about a new GNN framework for enterprises back in 2023, here is the full paper with details.
CRAG -- Comprehensive RAG Benchmark from Meta (and a Kaggle competition for $30k) - the factual QA benchmark that simulates queries to knowledge graphs and APIs. Vanilla RAG yields only 44% accuracy and fancy industrial models barely reach 63% - so a plenty of room for improvements.
Explainable Graph Neural Networks Under Fire - by Zhong Li feat Stephan Günnemann. Turns out most GNN explainers utterly fail and cannot be trusted in the presence of simple adversarial perturbations. Let us know if you ever found a successful working case for GNN explainers 🤭
🎉 ICML 2024 papers (including orals and spotlights) are now visible on OpenReview (however, without search). If you don’t want to scroll through 100 pages of accepted papers manually or write a custom parser, Azmine Toushik Wasi compiled a collection of accepted Graph ML papers with a nice categorization.
👨🔬 More blogs on AlphaFold 3 and reflexions about the future of TechBio: Charlie Harris focuses more on the technical side whereas Carlos Outeiral presents the CompBio perspective highlighting some cases where AF3 still underperforms.
🔀 Flow Matching continues to reach new heights with recently released papers: Variational Flow Matching (you didn’t forget ELBO and KL divergence, right?) by the UvA team of Floor Eijkelboom, Grigory Bartosh, et al (feat. Max Welling) derives a generalized flow matching formulation that naturally allows for categorical data (😼 CatFlow) and graph generation - the model outperform DiGress and other diffusion baselines. At the same time, the NYU team of Boffi et al propose Flow Map Matching - pretty much the Consistency Models for FMs that enable generation in one step instead of 20-100. Finally, Ross Irwin et al from AstraZeneca come up with MolFlow - flow matching for generating 3D conformations of molecules showing compelling results on QM9 and Geom-Drugs.
📚 Weekend reading (no flow matching):
GraphStorm: all-in-one graph machine learning framework for industry applications by Da Zheng and AWS - we wrote about a new GNN framework for enterprises back in 2023, here is the full paper with details.
CRAG -- Comprehensive RAG Benchmark from Meta (and a Kaggle competition for $30k) - the factual QA benchmark that simulates queries to knowledge graphs and APIs. Vanilla RAG yields only 44% accuracy and fancy industrial models barely reach 63% - so a plenty of room for improvements.
Explainable Graph Neural Networks Under Fire - by Zhong Li feat Stephan Günnemann. Turns out most GNN explainers utterly fail and cannot be trusted in the presence of simple adversarial perturbations. Let us know if you ever found a successful working case for GNN explainers 🤭
GraphML News (June 22nd) - $30M seed for CuspAI, Graph Foundation Models, MoML 2024
💸 A new startup CuspAI by Max Welling and Chad Edwards focusing on materials discovery and materials design for clean energy and sustainability raised $30M in the seed round (led by Hoxton, Basis Set, and Lightspeed). The support from the godfathers is significant - Geoff Hinton is a board advisor and Yann LeCun commented on the collaboration with FAIR and OpenCatalyst teams on OpenDAC. The materials design area gets hotter - not as hot as drug discovery and protein design though - but is steadily growing. In addition to Radical AI, Orbital Materials, new CuspAI, a fresh Entalpic by ex-Mila founders raised $5M+.
🔖 Together with Michael Bronstein, we released a new blog post on Graph Foundation Models. First, we define what GFMs are and what are the key design challenges covering heterogeneous model expressivity, scaling laws, and data scarcity. Then, we describe several successful examples of recent generalist models that can be considered GFMs in a particular area, eg, GraphAny for node classification, ULTRA for KG reasoning, and MACE MP-0 as universal potentials. We made sure to include all the recent references including position papers to appear at ICML’24!
🧬 The Molecular ML 2024 conference took place in Montreal this week (concluding the ML for Drug Discovery summer school) and featured talks on drug discovery and drug design. The recording is already available - check out talks by Jian Tang (BioGeometry) on geometric DL for proteins and by Max Jaderberg (Chief AI Officer at Isomorphic Labs) on AlphaFold 3. Might be one of the first public talks on AF3!
Weekend reading:
More benchmarks (brought to you by the NeurIPS Datasets & Benchmarking track deadline).
Temporal Graph Benchmark 2.0 by Gastinger, Huang et al - the first large-scale benchmark for temporal KGs and heterogeneous graphs
Text-space Graph Foundation Models by Chen et al feat. Anton Tsitsulin and Bryan Perozzi - a collection of text-attributed graphs for node classification, link prediction, and graph-level tasks
Towards Neural Scaling Laws for Foundation Models on Temporal Graphs by Shirzadkhani, Ngo, Shamsi et al - perhaps the first evidence that one temporal GNN can generalize to different temporal graphs (here those are token transactions in Ethereum)
RNA-FrameFlow: Flow Matching for de novo 3D RNA Backbone Design by Rishabh Anand, our own Chaitanya K. Joshi, et al - equivariant flow matching for generating 3D RNA structures.
💸 A new startup CuspAI by Max Welling and Chad Edwards focusing on materials discovery and materials design for clean energy and sustainability raised $30M in the seed round (led by Hoxton, Basis Set, and Lightspeed). The support from the godfathers is significant - Geoff Hinton is a board advisor and Yann LeCun commented on the collaboration with FAIR and OpenCatalyst teams on OpenDAC. The materials design area gets hotter - not as hot as drug discovery and protein design though - but is steadily growing. In addition to Radical AI, Orbital Materials, new CuspAI, a fresh Entalpic by ex-Mila founders raised $5M+.
🔖 Together with Michael Bronstein, we released a new blog post on Graph Foundation Models. First, we define what GFMs are and what are the key design challenges covering heterogeneous model expressivity, scaling laws, and data scarcity. Then, we describe several successful examples of recent generalist models that can be considered GFMs in a particular area, eg, GraphAny for node classification, ULTRA for KG reasoning, and MACE MP-0 as universal potentials. We made sure to include all the recent references including position papers to appear at ICML’24!
🧬 The Molecular ML 2024 conference took place in Montreal this week (concluding the ML for Drug Discovery summer school) and featured talks on drug discovery and drug design. The recording is already available - check out talks by Jian Tang (BioGeometry) on geometric DL for proteins and by Max Jaderberg (Chief AI Officer at Isomorphic Labs) on AlphaFold 3. Might be one of the first public talks on AF3!
Weekend reading:
More benchmarks (brought to you by the NeurIPS Datasets & Benchmarking track deadline).
Temporal Graph Benchmark 2.0 by Gastinger, Huang et al - the first large-scale benchmark for temporal KGs and heterogeneous graphs
Text-space Graph Foundation Models by Chen et al feat. Anton Tsitsulin and Bryan Perozzi - a collection of text-attributed graphs for node classification, link prediction, and graph-level tasks
Towards Neural Scaling Laws for Foundation Models on Temporal Graphs by Shirzadkhani, Ngo, Shamsi et al - perhaps the first evidence that one temporal GNN can generalize to different temporal graphs (here those are token transactions in Ethereum)
RNA-FrameFlow: Flow Matching for de novo 3D RNA Backbone Design by Rishabh Anand, our own Chaitanya K. Joshi, et al - equivariant flow matching for generating 3D RNA structures.
GraphML News (June 29th) - ESM 3, TDC 2, AI 4 Genomics Conference
🧬 Evolutionary Scale (formerly a team in Meta, now a standalone startup) released ESM 3 - the next version of the SOTA protein LM, pretty much GPT-4 of pLMs. Now it’s not only a sequence model, but also a structure and function model. Following best LLM practices, ESM 3 even employs RLHF for aligning with human feedback! Besides, the model features SE(3)-invariant geometric attention based on distances between frames (equivariance not dead!) and VQ-VAE to tokenize structures and functions. The ESM 3 family is available in three sizes: 1.4B is open weights, 8B and 98B are available in the API (it’s time to embrace that). The preprint is quite informative about training data, pre-/post-training details, and RLHF details - kudos for not sweeping it under the rug. The model code is also available, so you only need 10,000 H100’s to train it on your own 🙂
💊 The team of Harvard, MIT, and Stanford researchers led by Marinka Zitnik released Therapeutic Data Commons 2 adding even more datasets and modalities: over 1000 single-cell datasets over 85M cells, the first protein-peptide binding dataset, drug-target interaction data, clinical trials data, and much more covering 10+ modalities. TDC-2 packages several pre-trained embeddings and can be used for evaluating a variety of models - from LLMs to GNNs. TDC-1 received some critic from drug discovery people back in the days, let’s see if TDC-2 closes those gaps.
The AI for Genomics and Health conference will be held in Boston, Oct 17-18th, with a stellar lineup of speakers including Shekoofeh Azizi (DeepMind, the author of Med-PaLM), Mo Lotfollahi (Sanger Institute), Sergey Ovchinnikov (MIT), Marinka Zitnik (Harvard), and James Zou (Stanford).
🔮 A small update: MatterSim by MSR AI4 Science became SOTA on MatBench Discovery beating recent GNoME from DeepMind - competition makes wonders even in such advanced scientific topic as materials discovery, ML potentials, and molecular dynamics.
Weekend reading:
Multimodal Graph Benchmark by Zhu et al feat. Danai Koutra - three datasets combining graphs, texts, and images for node classification and link prediction tasks.
Transformers meet Neural Algorithmic Reasoners by Bounsi et al feat. Petar Veličković - Transformer cross-attenting to the pre-trained Triplet-GMPNN solves algorithmic reasoning tasks (CLRS-text) better than the vanilla Transformer (but still struggles with OOD generalization though)
Clifford-Steerable Convolutional Neural Networks by Maksim Zhdanov et al - ConvNets go spacetime — equivariant to the Lorentz group and useful for electrodynamics. The thread by Maurice Weiler explains the work much more in details. Someday (after another PhD in math and physics) I will be able to understand the math behind this paper
🧬 Evolutionary Scale (formerly a team in Meta, now a standalone startup) released ESM 3 - the next version of the SOTA protein LM, pretty much GPT-4 of pLMs. Now it’s not only a sequence model, but also a structure and function model. Following best LLM practices, ESM 3 even employs RLHF for aligning with human feedback! Besides, the model features SE(3)-invariant geometric attention based on distances between frames (equivariance not dead!) and VQ-VAE to tokenize structures and functions. The ESM 3 family is available in three sizes: 1.4B is open weights, 8B and 98B are available in the API (it’s time to embrace that). The preprint is quite informative about training data, pre-/post-training details, and RLHF details - kudos for not sweeping it under the rug. The model code is also available, so you only need 10,000 H100’s to train it on your own 🙂
💊 The team of Harvard, MIT, and Stanford researchers led by Marinka Zitnik released Therapeutic Data Commons 2 adding even more datasets and modalities: over 1000 single-cell datasets over 85M cells, the first protein-peptide binding dataset, drug-target interaction data, clinical trials data, and much more covering 10+ modalities. TDC-2 packages several pre-trained embeddings and can be used for evaluating a variety of models - from LLMs to GNNs. TDC-1 received some critic from drug discovery people back in the days, let’s see if TDC-2 closes those gaps.
The AI for Genomics and Health conference will be held in Boston, Oct 17-18th, with a stellar lineup of speakers including Shekoofeh Azizi (DeepMind, the author of Med-PaLM), Mo Lotfollahi (Sanger Institute), Sergey Ovchinnikov (MIT), Marinka Zitnik (Harvard), and James Zou (Stanford).
🔮 A small update: MatterSim by MSR AI4 Science became SOTA on MatBench Discovery beating recent GNoME from DeepMind - competition makes wonders even in such advanced scientific topic as materials discovery, ML potentials, and molecular dynamics.
Weekend reading:
Multimodal Graph Benchmark by Zhu et al feat. Danai Koutra - three datasets combining graphs, texts, and images for node classification and link prediction tasks.
Transformers meet Neural Algorithmic Reasoners by Bounsi et al feat. Petar Veličković - Transformer cross-attenting to the pre-trained Triplet-GMPNN solves algorithmic reasoning tasks (CLRS-text) better than the vanilla Transformer (but still struggles with OOD generalization though)
Clifford-Steerable Convolutional Neural Networks by Maksim Zhdanov et al - ConvNets go spacetime — equivariant to the Lorentz group and useful for electrodynamics. The thread by Maurice Weiler explains the work much more in details. Someday (after another PhD in math and physics) I will be able to understand the math behind this paper
GraphML News (July 7th) - ICML Workshops, AI4Science Lectures, GraphRAG release
📚 ICML workshops started publishing their accepted papers on OpenReview. Remember that workshop papers send a good signal of future full papers at next big conferences so you might find something interesting! Among others, check out:
- GRaM workshop (Geometry-grounded Representation Learning and Generative Modeling),
- AI4Science,
- TF2M (Theoretical Foundations of Foundation Models)
- SPIGM (Structured Probabilistic Inference & Generative Modeling)
📺 Simons Institute at Berkeley recently organized a workshop AI≡Science: Strengthening the Bond Between the Sciences and Artificial Intelligence with a stellar lineup including Tess Smidt, Mohammed AlQuraishi, Rafael Gomez-Bombarelli, and many others. All lectures recordings are now available.
🚒 Microsoft Research released GraphRAG, their take on graph-enriched RAG, on GitHub along with the accompanying blogpost. The repo received 6k stars just in 5 days 📈.
Weekend reading:
Foundations and Frontiers of Graph Learning Theory by Yu Huang et al. feat Muhan Zhang - a survey on the GNN theory, can accompany the recent ICML position paper by Morris et al.
Aligning Target-Aware Molecule Diffusion Models with Exact Energy Optimization by Siyi Gu, Minkai Xu et al feat. Jure Leskovec - perhaps the first diffusion model for ligand generation (conditioned on the pocket) with the DPO alignment (RLHF without H).
📚 ICML workshops started publishing their accepted papers on OpenReview. Remember that workshop papers send a good signal of future full papers at next big conferences so you might find something interesting! Among others, check out:
- GRaM workshop (Geometry-grounded Representation Learning and Generative Modeling),
- AI4Science,
- TF2M (Theoretical Foundations of Foundation Models)
- SPIGM (Structured Probabilistic Inference & Generative Modeling)
📺 Simons Institute at Berkeley recently organized a workshop AI≡Science: Strengthening the Bond Between the Sciences and Artificial Intelligence with a stellar lineup including Tess Smidt, Mohammed AlQuraishi, Rafael Gomez-Bombarelli, and many others. All lectures recordings are now available.
🚒 Microsoft Research released GraphRAG, their take on graph-enriched RAG, on GitHub along with the accompanying blogpost. The repo received 6k stars just in 5 days 📈.
Weekend reading:
Foundations and Frontiers of Graph Learning Theory by Yu Huang et al. feat Muhan Zhang - a survey on the GNN theory, can accompany the recent ICML position paper by Morris et al.
Aligning Target-Aware Molecule Diffusion Models with Exact Energy Optimization by Siyi Gu, Minkai Xu et al feat. Jure Leskovec - perhaps the first diffusion model for ligand generation (conditioned on the pocket) with the DPO alignment (RLHF without H).
This year's ICLM will finally have a tutorial on graphs! Adrian Arnaiz-Rodriguez and Ameya Velingker will present a tutorial on on Graph Learning: Principles, Challenges, and Open Directions.
🗓️ Date: Monday, July 22
🕒 Time: 15:30 CEST - 17:30 CEST
📍 ICML In-person Event: Hall A8, ICML Venue
📍 Virtual attendance: https://icml.cc/virtual/2024/tutorial/35233
What to expect?
- Intro to Graph Learning and GNNs: Introduction to Traditional graph representation, Graph Neural Networks (GNNs), Message Passing Networks (MPNNs), Graph Transformers (GTs) and spectral quantities.
- Expressiveness and Generalizability: GNN expressivity linked with the WL test, generalizability of MPNNs, and their performance implications.
- Challenges in GNNs: Understanding and addressing under-reaching, over-smoothing, over-squashing, and graph rewiring techniques.
- Panel Discussion on Future Directions: Panel discussion with Michael Bronstein, Bryan Perozzi, Christopher Morris and more panelist TBC. We will discuss about GNN limitations, graph foundation models, and integrating GNNs with large language models (LLMs).
This tutorial balances introductory content and advanced insights, aimed to both general audiences and experts. Don’t miss this opportunity to deepen your understanding of GNNs!
🗓️ Date: Monday, July 22
🕒 Time: 15:30 CEST - 17:30 CEST
📍 ICML In-person Event: Hall A8, ICML Venue
📍 Virtual attendance: https://icml.cc/virtual/2024/tutorial/35233
What to expect?
- Intro to Graph Learning and GNNs: Introduction to Traditional graph representation, Graph Neural Networks (GNNs), Message Passing Networks (MPNNs), Graph Transformers (GTs) and spectral quantities.
- Expressiveness and Generalizability: GNN expressivity linked with the WL test, generalizability of MPNNs, and their performance implications.
- Challenges in GNNs: Understanding and addressing under-reaching, over-smoothing, over-squashing, and graph rewiring techniques.
- Panel Discussion on Future Directions: Panel discussion with Michael Bronstein, Bryan Perozzi, Christopher Morris and more panelist TBC. We will discuss about GNN limitations, graph foundation models, and integrating GNNs with large language models (LLMs).
This tutorial balances introductory content and advanced insights, aimed to both general audiences and experts. Don’t miss this opportunity to deepen your understanding of GNNs!
GraphML News (July 13th) - Recursion goes brrr, Acquisition of Graphcore, Illustrated AF3
💸 Recursion and NVIDIA launched BioHive-2, a GPU cluster made of 504 H100’s which is roughly equivalent to 1 petaflops in FP16 / BF16 and perhaps sub-$50M in the costs. Some napkin math indicates it could train and fine-tune a full AlphaFold 3-like model in about 4 days. Except for ESM-3, we haven’t yet seen drug discovery models trained on such compute - congrats to Recursion, Valence, and researchers with engineers who can now really go brrr.
💸 Graphcore, a UK hardware startup offering their hardware platform (BOW IPUs), was acquired by SoftBank for rumored $500M (back in 2020 valuation was about $2.8B). Former employees likely lost their vested options ($500M is still less than $600M originally invested into the company) but let’s hope that now the future would be more stable for Graphcore and we will see more successful products.
🧬 The Illustrated AlphaFold by Elana Simon and Jake Silberg from Stanford (inspired by the Illustrated Transformer) explains visually the main building blocks of the model - starting from the input data down to PairFormer, triangular attention to the diffusion module to the training losses. Things get much simpler indeed when you know which shapes are involved at each particular step.
Weekend reading:
Link Prediction with Untrained Message Passing Layers by Lisi Qarkaxhija, Anatol E. Wegner, and Ingo Scholtes - the unreasonable effectiveness of untrained MPNNs strikes back
SE(3)-Hyena Operator for Scalable Equivariant Learning by Artem Moskalev et al - FFT with Clifford MLPs enable equivariant Hyena on long sequences up to 3.5M tokens on a single GPU
On the Expressive Power of Sparse Geometric MPNNs by Yonatan Sverdlov, Nadav Dym - enabling equivariant GNNs on sparse graphs (usually EGNNs work on fully-connected graphs)
💸 Recursion and NVIDIA launched BioHive-2, a GPU cluster made of 504 H100’s which is roughly equivalent to 1 petaflops in FP16 / BF16 and perhaps sub-$50M in the costs. Some napkin math indicates it could train and fine-tune a full AlphaFold 3-like model in about 4 days. Except for ESM-3, we haven’t yet seen drug discovery models trained on such compute - congrats to Recursion, Valence, and researchers with engineers who can now really go brrr.
💸 Graphcore, a UK hardware startup offering their hardware platform (BOW IPUs), was acquired by SoftBank for rumored $500M (back in 2020 valuation was about $2.8B). Former employees likely lost their vested options ($500M is still less than $600M originally invested into the company) but let’s hope that now the future would be more stable for Graphcore and we will see more successful products.
🧬 The Illustrated AlphaFold by Elana Simon and Jake Silberg from Stanford (inspired by the Illustrated Transformer) explains visually the main building blocks of the model - starting from the input data down to PairFormer, triangular attention to the diffusion module to the training losses. Things get much simpler indeed when you know which shapes are involved at each particular step.
Weekend reading:
Link Prediction with Untrained Message Passing Layers by Lisi Qarkaxhija, Anatol E. Wegner, and Ingo Scholtes - the unreasonable effectiveness of untrained MPNNs strikes back
SE(3)-Hyena Operator for Scalable Equivariant Learning by Artem Moskalev et al - FFT with Clifford MLPs enable equivariant Hyena on long sequences up to 3.5M tokens on a single GPU
On the Expressive Power of Sparse Geometric MPNNs by Yonatan Sverdlov, Nadav Dym - enabling equivariant GNNs on sparse graphs (usually EGNNs work on fully-connected graphs)
GraphML News (July 20th) - Pinder and Plinder, LAB bench, ICML 2024
🎙️ ICML 2024 starts next week - enjoy the conference and Vienna if you are participating this year! Beside the main program, Monday will feature the Graph learning tutorial, Thursday and Friday have a handful of graph-related workshops.
🧬 VantAI together with MIT, NVIDIA, UniBasel, and SIB introduce two novel large-scale benchmarks: Pinder (Protein INteraction Dataset and Evaluation Resource) and Plinder (Protein-Ligand Interaction Dataset and Evaluation Resource). Pinder includes 500x more data than PPIRef, and Plinder is roughly 10x larger than DockGen, previous largest datasets in the area susceptible to test set leakages. Re-training SOTA diffusion models on Pinder and Plinder shows much lower results indicating that saturation is far away (at least for the coming year). Besides, it is great to see the industrial company (from a highly competitive CompBio area) contributing to the field with open datasets. Pinder and Plinder will be the main datasets for the upcoming ML for Structural Bio challenge at NeurIPS 2024, so prepare your GPUs and diffusion models.
🔬 FutureHouse released the LAB bench for studying LLMs in Biology and Chemistry. The benchmark includes 8 categories where LLMs have to deal with figures, images, scientific literature, databases, and designing protocols. Recent LLMs and VLMs (GPT-4o, Claude, and LLama-3) all show rather underwhelming results on those tasks - it is finally a new unsaturated benchmark for the LLM crowd! The authors saved some data to check training contamination of future models (eg, when training data for the next gen of such models would include validation and test splits of the datasets).
Weekend reading:
Beyond Euclid: An Illustrated Guide to Modern Machine Learning with Geometric, Topological, and Algebraic Structures by Sophia Sanborn, Johan Mathe, Mathilde Papillon, et al - a massive survey with amazing illustrations
PINDER: The protein interaction dataset and evaluation resource by Daniel Kovtun, Mehmet Akdel, and VantAI folks feat. Michael Bronstein
PLINDER: The protein-ligand interactions dataset and evaluation resource by Janani Durairaj, Yusuf Adeshina, and VantAI folks
LAB-Bench: Measuring Capabilities of Language Models for Biology Research by Jon M. Laurent, Joseph D. Janizek, et al feat. Andrew White
🎙️ ICML 2024 starts next week - enjoy the conference and Vienna if you are participating this year! Beside the main program, Monday will feature the Graph learning tutorial, Thursday and Friday have a handful of graph-related workshops.
🧬 VantAI together with MIT, NVIDIA, UniBasel, and SIB introduce two novel large-scale benchmarks: Pinder (Protein INteraction Dataset and Evaluation Resource) and Plinder (Protein-Ligand Interaction Dataset and Evaluation Resource). Pinder includes 500x more data than PPIRef, and Plinder is roughly 10x larger than DockGen, previous largest datasets in the area susceptible to test set leakages. Re-training SOTA diffusion models on Pinder and Plinder shows much lower results indicating that saturation is far away (at least for the coming year). Besides, it is great to see the industrial company (from a highly competitive CompBio area) contributing to the field with open datasets. Pinder and Plinder will be the main datasets for the upcoming ML for Structural Bio challenge at NeurIPS 2024, so prepare your GPUs and diffusion models.
🔬 FutureHouse released the LAB bench for studying LLMs in Biology and Chemistry. The benchmark includes 8 categories where LLMs have to deal with figures, images, scientific literature, databases, and designing protocols. Recent LLMs and VLMs (GPT-4o, Claude, and LLama-3) all show rather underwhelming results on those tasks - it is finally a new unsaturated benchmark for the LLM crowd! The authors saved some data to check training contamination of future models (eg, when training data for the next gen of such models would include validation and test splits of the datasets).
Weekend reading:
Beyond Euclid: An Illustrated Guide to Modern Machine Learning with Geometric, Topological, and Algebraic Structures by Sophia Sanborn, Johan Mathe, Mathilde Papillon, et al - a massive survey with amazing illustrations
PINDER: The protein interaction dataset and evaluation resource by Daniel Kovtun, Mehmet Akdel, and VantAI folks feat. Michael Bronstein
PLINDER: The protein-ligand interactions dataset and evaluation resource by Janani Durairaj, Yusuf Adeshina, and VantAI folks
LAB-Bench: Measuring Capabilities of Language Models for Biology Research by Jon M. Laurent, Joseph D. Janizek, et al feat. Andrew White
GraphML News (July 27th) - LLMs in Chemistry, Discrete Flow Matching
ICML kept most of the community busy (Saturday is the last day of workshops) while in the other news Llama 3.1, SearchGPT, AlphaProof, and AlphaGeometry 2 took the headlines of approaching AGI singularity. Anyways, August would likely be a quieter month.
Some fresh works for the weekend reading:
A Review of Large Language Models and Autonomous Agents in Chemistry by Mayk Ramos, Christopher Collison, and Andrew White - a massive survey on what current gen of LLMs can do in chemistry - from property prediction and synthesis prediction to tool-augmented and multi-modal frontier models for orchestrating automated discovery labs. (paying respects to the LLM week)
Discrete Flow Matching by Itai Gat and Meta FAIR including Ricky Chen and Yaron Lipman - the OG authors of (Riemannian) Flow Matching. Discrete FM is now competitive to Llama 2/3 on coding tasks - so we should expect that module to be in all generative models for molecules, proteins, and crystals around ICLR’25 submissions and later.
Generative Modeling of Molecular Dynamics Trajectories by Bowen Jing and Hannes Stärk - MD via stochastic interpolants, supports accurate forward simulation, upsampling, interpolation between two states in the trajectory, and even inpainting of the simulated structure.
ICML kept most of the community busy (Saturday is the last day of workshops) while in the other news Llama 3.1, SearchGPT, AlphaProof, and AlphaGeometry 2 took the headlines of approaching AGI singularity. Anyways, August would likely be a quieter month.
Some fresh works for the weekend reading:
A Review of Large Language Models and Autonomous Agents in Chemistry by Mayk Ramos, Christopher Collison, and Andrew White - a massive survey on what current gen of LLMs can do in chemistry - from property prediction and synthesis prediction to tool-augmented and multi-modal frontier models for orchestrating automated discovery labs. (paying respects to the LLM week)
Discrete Flow Matching by Itai Gat and Meta FAIR including Ricky Chen and Yaron Lipman - the OG authors of (Riemannian) Flow Matching. Discrete FM is now competitive to Llama 2/3 on coding tasks - so we should expect that module to be in all generative models for molecules, proteins, and crystals around ICLR’25 submissions and later.
Generative Modeling of Molecular Dynamics Trajectories by Bowen Jing and Hannes Stärk - MD via stochastic interpolants, supports accurate forward simulation, upsampling, interpolation between two states in the trajectory, and even inpainting of the simulated structure.
Seminar on Graph-based Causal Discovery in Computational Biology
🎓 Topic: "Causal discovery from multivariate information in biological and biomedical data"
👨🔬 Who: Hervé Isambert, The Isambert Lab, CNRS, Institut Curie, Paris
⌚ When: Monday, July 29th, 5pm CEST
Abstract: In this webinar, I will present the principles and limitations of graph-based causal discovery methods and their improvement using multivariate information decomposition, recently developed in my lab. Applications will range from gene expression data in single cells to nationwide medical databases of cancer patients. I will then discuss the theoretical link between graph-based causality and temporal (Granger-Schreiber) causality, which can both be expressed in terms of conditional multivariate information. While temporal causality is shown to imply graph-based causality, the converse may not be true (see Figure). An application to time series data concerns the analysis of video images of reconstituted tumor ecosystems, which uncovered a novel antagonistic effect of cell-cell interactions under therapeutically relevant conditions.
The Zoom link will appear in this channel shortly before 5pm
🎓 Topic: "Causal discovery from multivariate information in biological and biomedical data"
👨🔬 Who: Hervé Isambert, The Isambert Lab, CNRS, Institut Curie, Paris
⌚ When: Monday, July 29th, 5pm CEST
Abstract: In this webinar, I will present the principles and limitations of graph-based causal discovery methods and their improvement using multivariate information decomposition, recently developed in my lab. Applications will range from gene expression data in single cells to nationwide medical databases of cancer patients. I will then discuss the theoretical link between graph-based causality and temporal (Granger-Schreiber) causality, which can both be expressed in terms of conditional multivariate information. While temporal causality is shown to imply graph-based causality, the converse may not be true (see Figure). An application to time series data concerns the analysis of video images of reconstituted tumor ecosystems, which uncovered a novel antagonistic effect of cell-cell interactions under therapeutically relevant conditions.
The Zoom link will appear in this channel shortly before 5pm
GraphML News (August 3rd) - NeurIPS workshops, MoML @ MIT, RUM and GraM
⛷️ NeurIPS’24 announced 56 accepted workshops (brace yourself, Vancouver convention center). In addition to a good bunch of LLM, VLM, and foundation model-focused events, graph and geometric learning folks might be interested in:
- AI for New Drug Modalities
- Machine Learning in Structural Biology
- Symmetry and Geometry in Neural Representations
- Multimodal Algorithmic Reasoning
- Machine Learning and the Physical Sciences
- AI for Accelerated Materials Design
🧬 The second part of MoML 2024 (Molecular ML) will be happening at MIT on November 5, you can submit short papers until October 10th. The authors of accepted papers get free admission!
💎 The GraM workshop of ICML’24 published accepted blogposts with some hidden gems like JAX implementation of EGNN, intro to equivariant neural fields, and the study of how consistency models don’t work for 3D molecule generation. Check out others as well - most of them require only entry-level background.
📈 Non-convolutional Graph Neural Networks by Yuanqing Wang and Kyunghyun Cho (the OG of GRUs) introduce RUM (random walk with unified memory) nets free of convolutions. Practically, the recipe of RUM included sampling random walks with anonymous node ID sequences (tracking the first occurrence of a node ID in the sequence), encodes both sequences via RNNs (sure, you can drop-in your fav Mamba here), concats both vectors with an MLP on top. The authors show RUMs are more expressive than 1-WL GNNs while not suffering from oversmoothing and oversquashing (and beating the baselines on a bunch of benchmarks). Interestingly, RUMs look like DeepWalk on steroids with several improvements. Is Bryan Perozzi the Noam Shazeer of graph learning? 🤔
More weekend reading:
Spatio-Spectral Graph Neural Networks by Simon Geisler et al feat. Stephan Günnemann - spectral GNNs can be strong performers, too - just to contrast with RUMs
Learning production functions for supply chains with graph neural networks by Serina Chang et al feat Jure Leskovec - a cool work that frames supply chains as temporal graphs, shows significant gains in prediction accuracy, and releases the data simulator
What Are Good Positional Encodings for Directed Graphs? by Yinan Huang, Haoyu Wang, and Pan Li. The answer is the Magnetic Laplacian with multiple potential factors (multi-q) - your best choice for DAGs.
⛷️ NeurIPS’24 announced 56 accepted workshops (brace yourself, Vancouver convention center). In addition to a good bunch of LLM, VLM, and foundation model-focused events, graph and geometric learning folks might be interested in:
- AI for New Drug Modalities
- Machine Learning in Structural Biology
- Symmetry and Geometry in Neural Representations
- Multimodal Algorithmic Reasoning
- Machine Learning and the Physical Sciences
- AI for Accelerated Materials Design
🧬 The second part of MoML 2024 (Molecular ML) will be happening at MIT on November 5, you can submit short papers until October 10th. The authors of accepted papers get free admission!
💎 The GraM workshop of ICML’24 published accepted blogposts with some hidden gems like JAX implementation of EGNN, intro to equivariant neural fields, and the study of how consistency models don’t work for 3D molecule generation. Check out others as well - most of them require only entry-level background.
📈 Non-convolutional Graph Neural Networks by Yuanqing Wang and Kyunghyun Cho (the OG of GRUs) introduce RUM (random walk with unified memory) nets free of convolutions. Practically, the recipe of RUM included sampling random walks with anonymous node ID sequences (tracking the first occurrence of a node ID in the sequence), encodes both sequences via RNNs (sure, you can drop-in your fav Mamba here), concats both vectors with an MLP on top. The authors show RUMs are more expressive than 1-WL GNNs while not suffering from oversmoothing and oversquashing (and beating the baselines on a bunch of benchmarks). Interestingly, RUMs look like DeepWalk on steroids with several improvements. Is Bryan Perozzi the Noam Shazeer of graph learning? 🤔
More weekend reading:
Spatio-Spectral Graph Neural Networks by Simon Geisler et al feat. Stephan Günnemann - spectral GNNs can be strong performers, too - just to contrast with RUMs
Learning production functions for supply chains with graph neural networks by Serina Chang et al feat Jure Leskovec - a cool work that frames supply chains as temporal graphs, shows significant gains in prediction accuracy, and releases the data simulator
What Are Good Positional Encodings for Directed Graphs? by Yinan Huang, Haoyu Wang, and Pan Li. The answer is the Magnetic Laplacian with multiple potential factors (multi-q) - your best choice for DAGs.
GraphML News (August 10th) - Summer School recordings, DD merger
🖥️ Recordings from the ML for Drug Discovery Summer School are now available covering 5 days of talks with 28 videos - from basics of GNNs for chemistry and equivariance to protein folding, ML potentials, simulations, protein-protein (-ligand) binding, to generative modeling and causal discovery.
🖥️ The Eastern European ML Summer School’24 also published their recordings - 25 videos covering a more general area of deep learning including LLMs, reasoning, VLMs, RL, generative models, Bayesian DL, and many more. Notebooks from the practical sessions are available on GitHub.
Both schools feature the most up-to-date material from the top experts in the field, quite the gems to watch during the summer break 💎.
⚛️ Continuing with the quality content, Sophia Tang published a massive, 2.5h-read guide to spherical equivariant graph transformers deriving them from the first principles and spherical harmonics to TensorField nets to the SE(3)-Transformer. Lots of illustrations with the code going along. The best tutorial so far.
💸 News from the Geometric Wall Street Journal: a huge merger between Recursion and Exscientia (focusing on precision oncology) - actually, Recursion bought Exscientia for $688M in stocks continuing its acquisition spree (besides the BioHive-2 with 500 H100’s). (Not a stonks advice)
Weekend reading:
The Heterophilic Graph Learning Handbook: Benchmarks, Models, Theoretical Analysis, Applications and Challenges by Sitao Luan feat. Rex Ying and Stefanie Jegelka - everything you wanted to know about heterophilic graphs in 2024
When Heterophily Meets Heterogeneity: New Graph Benchmarks and Effective Methods by Junhong Lin et al - introduces H2DB, a collection of known and new heterophilic and heterogeneous graphs, much larger than existing datasets.
🖥️ Recordings from the ML for Drug Discovery Summer School are now available covering 5 days of talks with 28 videos - from basics of GNNs for chemistry and equivariance to protein folding, ML potentials, simulations, protein-protein (-ligand) binding, to generative modeling and causal discovery.
🖥️ The Eastern European ML Summer School’24 also published their recordings - 25 videos covering a more general area of deep learning including LLMs, reasoning, VLMs, RL, generative models, Bayesian DL, and many more. Notebooks from the practical sessions are available on GitHub.
Both schools feature the most up-to-date material from the top experts in the field, quite the gems to watch during the summer break 💎.
⚛️ Continuing with the quality content, Sophia Tang published a massive, 2.5h-read guide to spherical equivariant graph transformers deriving them from the first principles and spherical harmonics to TensorField nets to the SE(3)-Transformer. Lots of illustrations with the code going along. The best tutorial so far.
💸 News from the Geometric Wall Street Journal: a huge merger between Recursion and Exscientia (focusing on precision oncology) - actually, Recursion bought Exscientia for $688M in stocks continuing its acquisition spree (besides the BioHive-2 with 500 H100’s). (Not a stonks advice)
Weekend reading:
The Heterophilic Graph Learning Handbook: Benchmarks, Models, Theoretical Analysis, Applications and Challenges by Sitao Luan feat. Rex Ying and Stefanie Jegelka - everything you wanted to know about heterophilic graphs in 2024
When Heterophily Meets Heterogeneity: New Graph Benchmarks and Effective Methods by Junhong Lin et al - introduces H2DB, a collection of known and new heterophilic and heterogeneous graphs, much larger than existing datasets.
GraphML News (August 17th) - Spanner Graph, some new papers
🔧 Google announced Spanner Graph - the infinitely scalable graph database (as the vanilla Spanner) with all the bells and whistles GDBMS have in 2024: support both Graph Query Language (GQL, finally standardized by ISO in April after 8 years of work) and SQL, vector search and full-text search, basic graph algorithms at query time.
Otherwise, it’s mid-August and vacation time, so probably no major news for the next few weeks.
Weekend reading:
Training Language Models on the Knowledge Graph: Insights on Hallucinations and Their Detectability from the large DeepMind team - turns out reducing hallucinations when training LLMs on KGs (ie, recalling training triples) requires an order of magnitude more compute than Chinchilla scaling laws. Lots of qualitative results - have a look! Besides, it is one of the accepted papers at COLM - a new conference specifically tailored for LLM research (rip, ACL/EMNLP).
Topological Blind Spots: Understanding and Extending Topological Deep Learning Through the Lens of Expressivity by Yam Eitan et al. feat Haggai Maron - one of the first studies of expressive power of topological (higher-order) MPNNs. Turns out standard models based on simplicial complexes or cellular networks cannot distinguish many common topological patterns like a Möbius strip vs cylinder. The authors then derive provably more powerful scalable multi-cell networks.
Tokenized and Continuous Embedding Compressions of Protein Sequence and Structure by Amy X. Lu et al feat. Pieter Abbeel and Kyunghyun Cho - a deep dive into the latent space of ESMFold which happens to be quite sparse, it can reduced by 128x without losing in prediction performance.
🔧 Google announced Spanner Graph - the infinitely scalable graph database (as the vanilla Spanner) with all the bells and whistles GDBMS have in 2024: support both Graph Query Language (GQL, finally standardized by ISO in April after 8 years of work) and SQL, vector search and full-text search, basic graph algorithms at query time.
Otherwise, it’s mid-August and vacation time, so probably no major news for the next few weeks.
Weekend reading:
Training Language Models on the Knowledge Graph: Insights on Hallucinations and Their Detectability from the large DeepMind team - turns out reducing hallucinations when training LLMs on KGs (ie, recalling training triples) requires an order of magnitude more compute than Chinchilla scaling laws. Lots of qualitative results - have a look! Besides, it is one of the accepted papers at COLM - a new conference specifically tailored for LLM research (rip, ACL/EMNLP).
Topological Blind Spots: Understanding and Extending Topological Deep Learning Through the Lens of Expressivity by Yam Eitan et al. feat Haggai Maron - one of the first studies of expressive power of topological (higher-order) MPNNs. Turns out standard models based on simplicial complexes or cellular networks cannot distinguish many common topological patterns like a Möbius strip vs cylinder. The authors then derive provably more powerful scalable multi-cell networks.
Tokenized and Continuous Embedding Compressions of Protein Sequence and Structure by Amy X. Lu et al feat. Pieter Abbeel and Kyunghyun Cho - a deep dive into the latent space of ESMFold which happens to be quite sparse, it can reduced by 128x without losing in prediction performance.
GraphML News (August 24th) - Psiformer, ML potentials arena, Single-cell foundation models
⚛️ DeepMind announced the updated version of Psiformer (together with the paper in Science, twitter thread, and source code in Jax) - a transformer for quantum physics tasks. The new model can approximate excited states of molecules on par or better than existing gold standard models. Excited energy states are responsible for lasers, semiconductors, solar panels, fluorescence, and many other phenomena - a huge potential for Psiformer in industrial applications.
🏆 Continuing with energy states - you probably know that the ultimate LLM benchmark those days is the ELO rating on the Chatbot Arena. Yuan Chiang started a similar effort for ML potential models (MLIP Arena) featuring 3 tasks: two atoms of the same type (the only LB for now) and two molecular dynamics tasks (loading time is slow). The supported models for now are Equiformer V2, CHGNet, MACE MP, M3GNet, SevenNet, and the GPAW DFT baseline from the DFT world.
🎻 Single-cell foundation models are getting more attention. The new scCello by Mila is a transformer trained on the masked LM task together with the alignment loss using the Cell Ontology. scCello in the zero-shot inference regime outperforms end-to-end trained models on tasks like cell type classification, marker gene prediction, and batch integration. If you are interested to learn more, have a look at the fresh survey on transformers in SC omics.
Weekend reading: more foundation models and materials science:
A foundation model for clinician-centered drug repurposing by Kexin Huang et al feat. Jure Leskovec and Marinka Zitnik - introduces TxGNN, a graph foundation model for drug repurposing trained on a medical KG of 17k diseases and 8k drugs, strong zero-shot performance included. The model and example weights are already on Github.
Microsoft published the source code of Aurora - FM for atmospheric forecasting, consists of Perceiver encoder/decoder and SwinTransformer as the backbone.
Crystalline Material Discovery in the Era of Artificial Intelligence by Zhenzhong Wang et al (thanks to Wanyu Lin for highlighting the work) - a survey on predictive and generative models for crystals, with the github repo of relevant papers
From Text to Insight: Large Language Models for Materials Science Data Extraction + tutorial online book by Mara Schilling-Wilhelmi, Martiño Ríos-García et al. LLMs are surprisingly strong in generating 3D structures of solid-state materials (ICLR 2024) on par with fancy equivariant diffusion models, this survey studies how much MatSci data LLMs could possibly feed.
⚛️ DeepMind announced the updated version of Psiformer (together with the paper in Science, twitter thread, and source code in Jax) - a transformer for quantum physics tasks. The new model can approximate excited states of molecules on par or better than existing gold standard models. Excited energy states are responsible for lasers, semiconductors, solar panels, fluorescence, and many other phenomena - a huge potential for Psiformer in industrial applications.
🏆 Continuing with energy states - you probably know that the ultimate LLM benchmark those days is the ELO rating on the Chatbot Arena. Yuan Chiang started a similar effort for ML potential models (MLIP Arena) featuring 3 tasks: two atoms of the same type (the only LB for now) and two molecular dynamics tasks (loading time is slow). The supported models for now are Equiformer V2, CHGNet, MACE MP, M3GNet, SevenNet, and the GPAW DFT baseline from the DFT world.
🎻 Single-cell foundation models are getting more attention. The new scCello by Mila is a transformer trained on the masked LM task together with the alignment loss using the Cell Ontology. scCello in the zero-shot inference regime outperforms end-to-end trained models on tasks like cell type classification, marker gene prediction, and batch integration. If you are interested to learn more, have a look at the fresh survey on transformers in SC omics.
Weekend reading: more foundation models and materials science:
A foundation model for clinician-centered drug repurposing by Kexin Huang et al feat. Jure Leskovec and Marinka Zitnik - introduces TxGNN, a graph foundation model for drug repurposing trained on a medical KG of 17k diseases and 8k drugs, strong zero-shot performance included. The model and example weights are already on Github.
Microsoft published the source code of Aurora - FM for atmospheric forecasting, consists of Perceiver encoder/decoder and SwinTransformer as the backbone.
Crystalline Material Discovery in the Era of Artificial Intelligence by Zhenzhong Wang et al (thanks to Wanyu Lin for highlighting the work) - a survey on predictive and generative models for crystals, with the github repo of relevant papers
From Text to Insight: Large Language Models for Materials Science Data Extraction + tutorial online book by Mara Schilling-Wilhelmi, Martiño Ríos-García et al. LLMs are surprisingly strong in generating 3D structures of solid-state materials (ICLR 2024) on par with fancy equivariant diffusion models, this survey studies how much MatSci data LLMs could possibly feed.
GraphML News (August 31st) - When GNNs help, randomized transformers, and new papers
August is a dry month in terms of news, but soon we’ll start to see upcoming ICLR submissions!
🔨 Meanwhile, have a look at the Measuring and Exploiting Network Usable Information blogpost by Meng-Chieh Lee (based off the spotlight ICLR 2024 paper) that touches upon the question asked every day in industrial labs - will GNNs outperform MLPs on my data? Are there any hints or data characteristics (well, apart from the homophily ratio) that could indicate which model would be better without training one? The authors introduce the notion of Network Usable Information (NUI) as a function of structural embeddings, node features, and neighbors’ features and find some correlations between the new score and performance on node classification and link prediction.
We submitted a position paper to ICML’24 studying a similar question but it didn’t get through because reviewers demanded more experiments (in the positions track, yeah).
🎰 Learning Randomized Algorithms with Transformers by Google and ETH Zurich - a intriguing blend of theoretical CS, math, and randomized algorithms with expressiveness of transformers. Experiments shows that randomized transformers can solve graph coloring problems on small sizes and explore grid worlds.
More weekend reading:
💊 Graph Artificial Intelligence in Medicine by Ruth Johnson, Michelle Li, feat Marinka Zitnik - a massive survey on GNNs in clinical applications.
Do Graph Neural Networks Work for High Entropy Alloys? by Zhang et al - the answer is yes, but with proper modeling. High-entropy alloys are unordered at the atomic scale but can be represented as sets of graphs (each graph is a local env for an alloy). Practically, adding a set pooling function like DeepSet(GNN(set of graphs)) is what we are looking for.
Expressive Power of Temporal Message Passing by Przemysław Wałega and Michael Rawson - Weisfeiler and Leman Go Temporal! Another fun fact about temporal GNNs: two models named DyG-Mamba (one, two, both add Mamba on top of GNN encoders) were submitted on arxiv with a few days gap.
August is a dry month in terms of news, but soon we’ll start to see upcoming ICLR submissions!
🔨 Meanwhile, have a look at the Measuring and Exploiting Network Usable Information blogpost by Meng-Chieh Lee (based off the spotlight ICLR 2024 paper) that touches upon the question asked every day in industrial labs - will GNNs outperform MLPs on my data? Are there any hints or data characteristics (well, apart from the homophily ratio) that could indicate which model would be better without training one? The authors introduce the notion of Network Usable Information (NUI) as a function of structural embeddings, node features, and neighbors’ features and find some correlations between the new score and performance on node classification and link prediction.
We submitted a position paper to ICML’24 studying a similar question but it didn’t get through because reviewers demanded more experiments (in the positions track, yeah).
🎰 Learning Randomized Algorithms with Transformers by Google and ETH Zurich - a intriguing blend of theoretical CS, math, and randomized algorithms with expressiveness of transformers. Experiments shows that randomized transformers can solve graph coloring problems on small sizes and explore grid worlds.
More weekend reading:
💊 Graph Artificial Intelligence in Medicine by Ruth Johnson, Michelle Li, feat Marinka Zitnik - a massive survey on GNNs in clinical applications.
Do Graph Neural Networks Work for High Entropy Alloys? by Zhang et al - the answer is yes, but with proper modeling. High-entropy alloys are unordered at the atomic scale but can be represented as sets of graphs (each graph is a local env for an alloy). Practically, adding a set pooling function like DeepSet(GNN(set of graphs)) is what we are looking for.
Expressive Power of Temporal Message Passing by Przemysław Wałega and Michael Rawson - Weisfeiler and Leman Go Temporal! Another fun fact about temporal GNNs: two models named DyG-Mamba (one, two, both add Mamba on top of GNN encoders) were submitted on arxiv with a few days gap.
GraphML News (September 7th) - AF 3 reproductions, AlphaProteo, ORB, Entalpic round
Just the first week of September, but already so much news in the protein design and materials science!
🧬 Two AlphaFold 3 reproductions are now available: HelixFold 3 from Baidu (tech report) and AF3 from Ligo Bioscience (no tech report yet). Training HelixFold 3 on PDB and custom data yields results roughly similar to the OG AlphaFold 3 on PoseBusters and CASP 15 - good news for science and reproducibility (and for Nature editors, hehe). Getting more data will be the key to the full reproduction - probably no other lab has as large and diverse dataset as DM and Iso.
Meanwhile, Google DeepMind announced AlphaProteo - a generative model for binders conditioned on the target protein and possible binding sites. The preprint has no information about the generative model itself (an educated guess would be either autoregressive transformer or discrete diffusion as a backbone) but the training dataset is similar to that of the full AlphaFold 3. Experimentally, AlphaProteo generates plausible binders in several use-cases like Epstein-Barr virus protein, COVID-19 spike protein, and proteins involved in cancer.
🔮 In the computational materials science, Orbital Materials announced ORB - a family of forcefield models to compute energy, forces, and stresses of atomistic systems (like bulk materials or semiconductors). ORB trained on Alexandria and Materials Project trajectories with the denoising objective (improved Noisy Nodes) yields SOTA on MatBench Discovery outperforming big boys MatterSim from MSR and GNoME from DeepMind. The authors highlight that ORB are non-equivariant GNNs - in fact, the backbone is very similar to the Graph Network Simulator from 2020 with an optional attention interaction. It will be fun to watch equivariant vs non-equivariant folks beating each others SOTA in the next few months 🍿
💸 Entalpic, a French materials discovery startup with founders graduated from Mila, announced €8.5m seed round co-lead by Breega, Cathay Innovation and Felicis - congrats to Mathieu, Victor, and Alexandre! Entalpic joins CuspAI and Orbital Materials in the emerging market of DL-based materials discovery companies - we’ll be keeping an eye on their advances.
Weekend reading:
Two papers from Shuiwang Ji’s lab on SE(3)-invariant 1D tokenization of 3D molecules for autoregressive generation:
Geometry Informed Tokenization of Molecules for Language Model Generation - for small molecules on QM9 and Geom-Drugs.
Fragment and Geometry Aware Tokenization of Molecules for Structure-Based Drug Design Using Language Models - for generating ligands for protein pockets.
Talking about autoregressive molecule generation,
Any-Property-Conditional Molecule Generation with Self-Criticism using Spanning Trees is another strong baseline improving spanning tree-based graph generation.
Just the first week of September, but already so much news in the protein design and materials science!
🧬 Two AlphaFold 3 reproductions are now available: HelixFold 3 from Baidu (tech report) and AF3 from Ligo Bioscience (no tech report yet). Training HelixFold 3 on PDB and custom data yields results roughly similar to the OG AlphaFold 3 on PoseBusters and CASP 15 - good news for science and reproducibility (and for Nature editors, hehe). Getting more data will be the key to the full reproduction - probably no other lab has as large and diverse dataset as DM and Iso.
Meanwhile, Google DeepMind announced AlphaProteo - a generative model for binders conditioned on the target protein and possible binding sites. The preprint has no information about the generative model itself (an educated guess would be either autoregressive transformer or discrete diffusion as a backbone) but the training dataset is similar to that of the full AlphaFold 3. Experimentally, AlphaProteo generates plausible binders in several use-cases like Epstein-Barr virus protein, COVID-19 spike protein, and proteins involved in cancer.
🔮 In the computational materials science, Orbital Materials announced ORB - a family of forcefield models to compute energy, forces, and stresses of atomistic systems (like bulk materials or semiconductors). ORB trained on Alexandria and Materials Project trajectories with the denoising objective (improved Noisy Nodes) yields SOTA on MatBench Discovery outperforming big boys MatterSim from MSR and GNoME from DeepMind. The authors highlight that ORB are non-equivariant GNNs - in fact, the backbone is very similar to the Graph Network Simulator from 2020 with an optional attention interaction. It will be fun to watch equivariant vs non-equivariant folks beating each others SOTA in the next few months 🍿
💸 Entalpic, a French materials discovery startup with founders graduated from Mila, announced €8.5m seed round co-lead by Breega, Cathay Innovation and Felicis - congrats to Mathieu, Victor, and Alexandre! Entalpic joins CuspAI and Orbital Materials in the emerging market of DL-based materials discovery companies - we’ll be keeping an eye on their advances.
Weekend reading:
Two papers from Shuiwang Ji’s lab on SE(3)-invariant 1D tokenization of 3D molecules for autoregressive generation:
Geometry Informed Tokenization of Molecules for Language Model Generation - for small molecules on QM9 and Geom-Drugs.
Fragment and Geometry Aware Tokenization of Molecules for Structure-Based Drug Design Using Language Models - for generating ligands for protein pockets.
Talking about autoregressive molecule generation,
Any-Property-Conditional Molecule Generation with Self-Criticism using Spanning Trees is another strong baseline improving spanning tree-based graph generation.
GraphML News (September 17th) - Chai-1, GenMS
🍓 This week offered a significant portion of strawberries that might result in major improvements in scientific applications. For now, let’s try to check what’s there beyond the berries.
🧬 Chai Discovery emerged from stealth and released Chai-1 - a reproduction of AlphaFold 3 with trained weights (thanks to a month on 128 A100 which saved you roughly $500k), a tech report, open inference server, and inference code (interestingly, no model code). Initial experiments report numbers close to AF 3. Chai is backed by OpenAI and many famous VCs, so it might appear as a new strong player in the industry, we’ll keep an eye.
🔮 Google DeepMind announced GenMS: Generative Hierarchical Materials Search by Sherry Yang, Simon Batzner, and the team that brought us UniMat last year. GenMS employs three components: (1) Gemini 1.5 to sample candidate formulae after a natural language query, eg, “give me the formula for a stable, chalcogenide with atom ratio 1:1:2 that's not in the ICSD database”. Samples are filtered through some rule-based heuristics and re-reranked by an LLM; (2) best candidates are sent to a diffusion model (non-equivariant, attention-based 3D Unet) to generate 3D structures; (3) the structures are scored by a pre-trained ML potential (NequIP) - if they are stable and exhibit target characteristics, we add them as a tree branch for the new iteration by LLMs. GenMS excels at perovskites, pyrochlore, and spinel crystals with structures confirmed by DFT formation energy calculations. Almost no geometric DL whatsoever 🙀
Weekend reading:
Recurrent Aggregators in Neural Algorithmic Reasoning by Kaijia Xu and Petar Veličković - the first model capable to solve quickselect from the CLRS benchmark happened to be a Triplet MPNN with a non permutation-invariant LSTM aggregator (GraphSAGE vibes). Back in January in our annual review post quickselect was the most unlikely candidate for traction, and looks like it is almost solved now!
On the design space between molecular mechanics and machine learning force fields by Yuanqing Wang and a huge collab of physicists and chemists led by NYU (feat. Kyunghyung Cho) - a nice intro to molecular mechanics, force fields, and potentials approachable by folks without a degree in physics. The survey includes a discussion on foundational ML potential models and “a nihilstic epilogue” worth checking out.
🍓 This week offered a significant portion of strawberries that might result in major improvements in scientific applications. For now, let’s try to check what’s there beyond the berries.
🧬 Chai Discovery emerged from stealth and released Chai-1 - a reproduction of AlphaFold 3 with trained weights (thanks to a month on 128 A100 which saved you roughly $500k), a tech report, open inference server, and inference code (interestingly, no model code). Initial experiments report numbers close to AF 3. Chai is backed by OpenAI and many famous VCs, so it might appear as a new strong player in the industry, we’ll keep an eye.
🔮 Google DeepMind announced GenMS: Generative Hierarchical Materials Search by Sherry Yang, Simon Batzner, and the team that brought us UniMat last year. GenMS employs three components: (1) Gemini 1.5 to sample candidate formulae after a natural language query, eg, “give me the formula for a stable, chalcogenide with atom ratio 1:1:2 that's not in the ICSD database”. Samples are filtered through some rule-based heuristics and re-reranked by an LLM; (2) best candidates are sent to a diffusion model (non-equivariant, attention-based 3D Unet) to generate 3D structures; (3) the structures are scored by a pre-trained ML potential (NequIP) - if they are stable and exhibit target characteristics, we add them as a tree branch for the new iteration by LLMs. GenMS excels at perovskites, pyrochlore, and spinel crystals with structures confirmed by DFT formation energy calculations. Almost no geometric DL whatsoever 🙀
Weekend reading:
Recurrent Aggregators in Neural Algorithmic Reasoning by Kaijia Xu and Petar Veličković - the first model capable to solve quickselect from the CLRS benchmark happened to be a Triplet MPNN with a non permutation-invariant LSTM aggregator (GraphSAGE vibes). Back in January in our annual review post quickselect was the most unlikely candidate for traction, and looks like it is almost solved now!
On the design space between molecular mechanics and machine learning force fields by Yuanqing Wang and a huge collab of physicists and chemists led by NYU (feat. Kyunghyung Cho) - a nice intro to molecular mechanics, force fields, and potentials approachable by folks without a degree in physics. The survey includes a discussion on foundational ML potential models and “a nihilstic epilogue” worth checking out.
GraphML News (September 21st) - AITHYRA, Fragrance 2o, LOG meetups
🧬 The Austrian Academy of Sciences together with Boehringer Ingelheim Foundation launched AITHYRA - the Institute for Biomedical AI - with a generous €150M funding over the next 12 years as a part of the Vienna BioCenter with Michael Bronstein as the first scientific director! AITHYRA plans to host 10-15 research groups supporting them with compute resources and robotic lab. Chances are AITHYRA might become the European version of the Institute for Protein Design (behold, David Baker) and the hub for Geometric Deep Learning research. Big win for Vienna 👏
👃 Osmo, a generative fragrance startup founded by ex-Google researchers who worked on the Principal Odor Map, uncovered a bit more details on the Fragrance 2o platform - essentially, this is a molecule search / generation for potential fragrance molecules with further conditional generation capabilities. It would certainly be exciting to discover a personalized scent like “of a sweaty researcher submitting an ICLR paper while camping in Yosemite forests”. We will keep you up to date whether GNNs conquer the perfume world and beauty industry and when Fragrantica starts to list LLM prompts as ingredients.
🍻 One of the unique ideas of the Learning on Graphs conference are local meetups about graph learning research. To date, seven meetups spanning October-December have been announced: Tel Aviv, New Jersey, Aachen, Amsterdam, Paris, Kunshan, and Siena - feel free to attend or organize one at your place!
Weekend reading:
Accelerating Training with Neuron Interaction and Nowcasting Networks by Boris Knyazev et al and collab between Samsung and Mila - pretty amazing work where every k-th optimization step model weights are predicted by a graph transformer conditioned on the neural net architecture (supports convnets, GPT2, BERT, Llama, and ViTs), brings up to 50% speed ups in optimization.
The Empirical Impact of Neural Parameter Symmetries, or Lack Thereof by Derek Lim, Moe Putterman feat. Haggai Maron - another interesting work on neural parameter symmetries. Turns out that fixing weights in MLPs via freezing or non-linearities breaks parameter symmetries and enables better model merging (you can interpolate between pre-trained models to get even better performance).
Can Graph Reordering Speed Up Graph Neural Network Training? An Experimental Study by Nikolai Merkel et al (VLDB 2025) - The answer is yes, avg speedup is 25%. The idea of partitioning the graph into several components to optimize memory reads is similar to the findings of Graph Segment Pre-training (by Google) and Sequential Aggregation and Rematerialization (Intel).
Deep learning-based predictions of gene perturbation effects do not yet outperform simple linear methods by Constantin Ahlmann-Eltze et al 🫳🎤
🧬 The Austrian Academy of Sciences together with Boehringer Ingelheim Foundation launched AITHYRA - the Institute for Biomedical AI - with a generous €150M funding over the next 12 years as a part of the Vienna BioCenter with Michael Bronstein as the first scientific director! AITHYRA plans to host 10-15 research groups supporting them with compute resources and robotic lab. Chances are AITHYRA might become the European version of the Institute for Protein Design (behold, David Baker) and the hub for Geometric Deep Learning research. Big win for Vienna 👏
👃 Osmo, a generative fragrance startup founded by ex-Google researchers who worked on the Principal Odor Map, uncovered a bit more details on the Fragrance 2o platform - essentially, this is a molecule search / generation for potential fragrance molecules with further conditional generation capabilities. It would certainly be exciting to discover a personalized scent like “of a sweaty researcher submitting an ICLR paper while camping in Yosemite forests”. We will keep you up to date whether GNNs conquer the perfume world and beauty industry and when Fragrantica starts to list LLM prompts as ingredients.
🍻 One of the unique ideas of the Learning on Graphs conference are local meetups about graph learning research. To date, seven meetups spanning October-December have been announced: Tel Aviv, New Jersey, Aachen, Amsterdam, Paris, Kunshan, and Siena - feel free to attend or organize one at your place!
Weekend reading:
Accelerating Training with Neuron Interaction and Nowcasting Networks by Boris Knyazev et al and collab between Samsung and Mila - pretty amazing work where every k-th optimization step model weights are predicted by a graph transformer conditioned on the neural net architecture (supports convnets, GPT2, BERT, Llama, and ViTs), brings up to 50% speed ups in optimization.
The Empirical Impact of Neural Parameter Symmetries, or Lack Thereof by Derek Lim, Moe Putterman feat. Haggai Maron - another interesting work on neural parameter symmetries. Turns out that fixing weights in MLPs via freezing or non-linearities breaks parameter symmetries and enables better model merging (you can interpolate between pre-trained models to get even better performance).
Can Graph Reordering Speed Up Graph Neural Network Training? An Experimental Study by Nikolai Merkel et al (VLDB 2025) - The answer is yes, avg speedup is 25%. The idea of partitioning the graph into several components to optimize memory reads is similar to the findings of Graph Segment Pre-training (by Google) and Sequential Aggregation and Rematerialization (Intel).
Deep learning-based predictions of gene perturbation effects do not yet outperform simple linear methods by Constantin Ahlmann-Eltze et al 🫳🎤
Discrete Neural Algorithmic Reasoning
Guest post by Gleb Rodionov
Paper: https://www.arxiv.org/abs/2402.11628
Blog: https://research.yandex.com/blog/discrete-neural-algorithmic-reasoning
Code: https://github.com/yandex-research/dnar
In this paper, we focus on generalizable and interpretable neural algorithmic reasoners. Starting with attention-based GNN, we inspect the reasons for generalization errors and propose several architectural modifications: feature discretization, hard attention and separating discrete and continuous data flows. All of these blocks are important for generalization:
⁃ State discretization prevents the model to use complex and redundant dependencies in the data;
⁃ Hard attention ensures that attention weights are not annealed for larger graphs. Also, hard attention limits the set of possible messages that each node can receive;
⁃ Separating discrete and continuous flows is needed to ensure that state discretization does not lose information about continuous data.
As a result, we achieve a model that provably imitates the execution of several algorithms for any test data when trained with hints. Practically, on SALSA-CLRS, trained on problem sizes of 16 nodes, the model demonstrates perfect graph- and node-level scores generalizing to problems of up to 1600 nodes.
For future work, it would be interesting to enhance the expressivity of the proposed model to a broader set of algorithms and investigate whether it is possible to train these models without hints.
Guest post by Gleb Rodionov
Paper: https://www.arxiv.org/abs/2402.11628
Blog: https://research.yandex.com/blog/discrete-neural-algorithmic-reasoning
Code: https://github.com/yandex-research/dnar
In this paper, we focus on generalizable and interpretable neural algorithmic reasoners. Starting with attention-based GNN, we inspect the reasons for generalization errors and propose several architectural modifications: feature discretization, hard attention and separating discrete and continuous data flows. All of these blocks are important for generalization:
⁃ State discretization prevents the model to use complex and redundant dependencies in the data;
⁃ Hard attention ensures that attention weights are not annealed for larger graphs. Also, hard attention limits the set of possible messages that each node can receive;
⁃ Separating discrete and continuous flows is needed to ensure that state discretization does not lose information about continuous data.
As a result, we achieve a model that provably imitates the execution of several algorithms for any test data when trained with hints. Practically, on SALSA-CLRS, trained on problem sizes of 16 nodes, the model demonstrates perfect graph- and node-level scores generalizing to problems of up to 1600 nodes.
For future work, it would be interesting to enhance the expressivity of the proposed model to a broader set of algorithms and investigate whether it is possible to train these models without hints.
GraphML News (September 28th) - AlphaChip, Generate + Novartis deal, MolPhenix
NeurIPS results for both tracks have arrived - congrats to those who made it, the datasets track this year was particularly egregious with hard score cutting below average 6.3. Good luck with the final ICLR push and see you in Vancouver!
💻 Google DeepMind presented AlphaChip - the improved version of the famous 2021 Nature paper that introduced the RL agent that uses edge-level GNNs for chip placement - that is, placing dozens of smaller blocks (often implementing certain logical function) on a canvas to optimize common design metrics like HPWL or PPA. The addendum highlights that pre-training with large compute is rather crucial and reports that AlphaChip has been successfully used for several generations of TPUs (25 RL-designed blocks in the latest TPU) as well as for external customers like MediaTek. The paper got some controversial reputation in the chip design community and some professors even argued for retracting the work from Nature for lack of clarity and reproducibility. Over time, however, it seems more like a skill issue of those who tried to replicate it - generally, the level of ML expertise in the chip design community is pretty low (some accepted papers at top venues like DAC are just 🫣) and most university teams are stuck somewhere between MLPs and convnets. Professors gonna hate, Google gonna continue making impactful real-world products, and we will have new pre-trained checkpoints of AlphaChip with some Colab tutorials 🍿.
💸 Generate:Biomedicines (the authors of Chroma, a generative model for protein design) announced collaboration with Novartis resulting in $65M upfront payments and $1B in biobucks (royalties and other performance-based milestones typically split across many years).
🐦 Valence Labs announced MolPhenix, a CLIP-like model to study phenomics (how cells respond to perturbations). Practically, it is trained on pairs of microscopy images and molecules using ViT as image encoder and MolGPS for molecules. Experiments report massive 10x improvements in Top-1% recall of active molecules over previous SOTA 👏.
Weekend reading:
TabGraphs: A Benchmark and Strong Baselines for Learning on Graphs with Tabular Node Features by Gleb Bazhenov et al - a fresh collection of new graph datasets where features are interpretable (numerical, categorical) - a stark contrast to boring text-attributed graphs or Planetoid datasets with bag-of-words as features.
Design of Ligand-Binding Proteins with Atomic Flow Matching by Junqi Liu et al feat. Jian Tang - generate a docked protein-ligand 3D structure conditioned just on 2D ligand graph and protein sequence with flow matching. Outperforms RFDiffusionAA on several metrics.
NeurIPS results for both tracks have arrived - congrats to those who made it, the datasets track this year was particularly egregious with hard score cutting below average 6.3. Good luck with the final ICLR push and see you in Vancouver!
💻 Google DeepMind presented AlphaChip - the improved version of the famous 2021 Nature paper that introduced the RL agent that uses edge-level GNNs for chip placement - that is, placing dozens of smaller blocks (often implementing certain logical function) on a canvas to optimize common design metrics like HPWL or PPA. The addendum highlights that pre-training with large compute is rather crucial and reports that AlphaChip has been successfully used for several generations of TPUs (25 RL-designed blocks in the latest TPU) as well as for external customers like MediaTek. The paper got some controversial reputation in the chip design community and some professors even argued for retracting the work from Nature for lack of clarity and reproducibility. Over time, however, it seems more like a skill issue of those who tried to replicate it - generally, the level of ML expertise in the chip design community is pretty low (some accepted papers at top venues like DAC are just 🫣) and most university teams are stuck somewhere between MLPs and convnets. Professors gonna hate, Google gonna continue making impactful real-world products, and we will have new pre-trained checkpoints of AlphaChip with some Colab tutorials 🍿.
💸 Generate:Biomedicines (the authors of Chroma, a generative model for protein design) announced collaboration with Novartis resulting in $65M upfront payments and $1B in biobucks (royalties and other performance-based milestones typically split across many years).
🐦 Valence Labs announced MolPhenix, a CLIP-like model to study phenomics (how cells respond to perturbations). Practically, it is trained on pairs of microscopy images and molecules using ViT as image encoder and MolGPS for molecules. Experiments report massive 10x improvements in Top-1% recall of active molecules over previous SOTA 👏.
Weekend reading:
TabGraphs: A Benchmark and Strong Baselines for Learning on Graphs with Tabular Node Features by Gleb Bazhenov et al - a fresh collection of new graph datasets where features are interpretable (numerical, categorical) - a stark contrast to boring text-attributed graphs or Planetoid datasets with bag-of-words as features.
Design of Ligand-Binding Proteins with Atomic Flow Matching by Junqi Liu et al feat. Jian Tang - generate a docked protein-ligand 3D structure conditioned just on 2D ligand graph and protein sequence with flow matching. Outperforms RFDiffusionAA on several metrics.
GraphML News (Oct 5th) - ICLR 2025 Graph and Geometric DL Submissions
📚 Brace yourselves, for your browser is about to endure 50+ new tabs. All accepted NeurIPS 2024 papers are now visible (titles and abstracts), and a new batch of goodies from ICLR’25 has just arrived. Tried to select the papers that haven't yet appeared during the ICML/NeurIPS cycles. PDFs will be available on the respective OpenReview pages shortly:
Towards Graph Foundation Models:
GraphProp: Training the Graph Foundation Models using Graph Properties
GFSE: A Foundational Model For Graph Structural Encoding
Towards Neural Scaling Laws for Foundation Models on Temporal Graphs
Graph Generative Models:
Quality Measures for Dynamic Graph Generative Models
Improving Graph Generation with Flow Matching and Optimal Transport
Equivariant Denoisers Cannot Copy Graphs: Align Your Graph Diffusion Models
Topology-aware Graph Diffusion Model with Persistent Homology
Hierarchical Equivariant Graph Generation
Smooth Probabilistic Interpolation Benefits Generative Modeling for Discrete Graphs
GNN Theory:
Towards a Complete Logical Framework for GNN Expressiveness
Rethinking the Expressiveness of GNNs: A Computational Model Perspective
Learning Efficient Positional Encodings with Graph Neural Networks
Equivariant GNNs:
Improving Equivariant Networks with Probabilistic Symmetry Breaking
Does equivariance matter at scale?
Beyond Canonicalization: How Tensorial Messages Improve Equivariant Message Passing
Spacetime E(n) Transformer: Equivariant Attention for Spatio-temporal Graphs
Rethinking Efficient 3D Equivariant Graph Neural Networks
Generative modeling with molecules (hundreds of them actually):
AssembleFlow: Rigid Flow Matching with Inertial Frames for Molecular Assembly
RoFt-Mol: Benchmarking Robust Fine-tuning with Molecular Graph Foundation Models
Multi-Modal Foundation Models Induce Interpretable Molecular Graph Languages
MolSpectra: Pre-training 3D Molecular Representation with Multi-modal Energy Spectra
Reaction Graph: Toward Modeling Chemical Reactions with 3D Molecular Structures
Accelerating 3D Molecule Generation via Jointly Geometric Optimal Transport
📚 Brace yourselves, for your browser is about to endure 50+ new tabs. All accepted NeurIPS 2024 papers are now visible (titles and abstracts), and a new batch of goodies from ICLR’25 has just arrived. Tried to select the papers that haven't yet appeared during the ICML/NeurIPS cycles. PDFs will be available on the respective OpenReview pages shortly:
Towards Graph Foundation Models:
GraphProp: Training the Graph Foundation Models using Graph Properties
GFSE: A Foundational Model For Graph Structural Encoding
Towards Neural Scaling Laws for Foundation Models on Temporal Graphs
Graph Generative Models:
Quality Measures for Dynamic Graph Generative Models
Improving Graph Generation with Flow Matching and Optimal Transport
Equivariant Denoisers Cannot Copy Graphs: Align Your Graph Diffusion Models
Topology-aware Graph Diffusion Model with Persistent Homology
Hierarchical Equivariant Graph Generation
Smooth Probabilistic Interpolation Benefits Generative Modeling for Discrete Graphs
GNN Theory:
Towards a Complete Logical Framework for GNN Expressiveness
Rethinking the Expressiveness of GNNs: A Computational Model Perspective
Learning Efficient Positional Encodings with Graph Neural Networks
Equivariant GNNs:
Improving Equivariant Networks with Probabilistic Symmetry Breaking
Does equivariance matter at scale?
Beyond Canonicalization: How Tensorial Messages Improve Equivariant Message Passing
Spacetime E(n) Transformer: Equivariant Attention for Spatio-temporal Graphs
Rethinking Efficient 3D Equivariant Graph Neural Networks
Generative modeling with molecules (hundreds of them actually):
AssembleFlow: Rigid Flow Matching with Inertial Frames for Molecular Assembly
RoFt-Mol: Benchmarking Robust Fine-tuning with Molecular Graph Foundation Models
Multi-Modal Foundation Models Induce Interpretable Molecular Graph Languages
MolSpectra: Pre-training 3D Molecular Representation with Multi-modal Energy Spectra
Reaction Graph: Toward Modeling Chemical Reactions with 3D Molecular Structures
Accelerating 3D Molecule Generation via Jointly Geometric Optimal Transport
Generative modeling with proteins (hundreds of them either):
EquiJump: Protein Dynamics Simulation via SO(3)-Equivariant Stochastic Interpolants
Design of Ligand-Binding Proteins with Atomic Flow Matching
RapidDock: Unlocking Proteome-scale Molecular Docking
Deep Learning for Protein-Ligand Docking: Are We There Yet?
ProteinBench: A Holistic Evaluation of Protein Foundation Models
Fast and Accurate Blind Flexible Docking
Solving Inverse Problems in Protein Space Using Diffusion-Based Priors
Crystals and Materials:
Flow Matching for Accelerated Simulation of Atomic Transport in Materials
MOFFlow: Flow Matching for Structure Prediction of Metal-Organic Frameworks
Learning the Hamiltonian of Disordered Materials with Equivariant Graph Networks
Designing Mechanical Meta-Materials by Learning Equivariant Flows
SymmCD: Symmetry-Preserving Crystal Generation with Diffusion Models
Rethinking the role of frames for SE(3)-invariant crystal structure modeling
A Periodic Bayesian Flow for Material Generation
ECD: A Machine Learning Benchmark for Predicting Enhanced-Precision Electronic Charge Density in Crystalline Inorganic Materials
Wyckoff Transformer: Generation of Symmetric Crystals
PDDFormer: Pairwise Distance Distribution Graph Transformer for Crystal Material Property Prediction
EquiJump: Protein Dynamics Simulation via SO(3)-Equivariant Stochastic Interpolants
Design of Ligand-Binding Proteins with Atomic Flow Matching
RapidDock: Unlocking Proteome-scale Molecular Docking
Deep Learning for Protein-Ligand Docking: Are We There Yet?
ProteinBench: A Holistic Evaluation of Protein Foundation Models
Fast and Accurate Blind Flexible Docking
Solving Inverse Problems in Protein Space Using Diffusion-Based Priors
Crystals and Materials:
Flow Matching for Accelerated Simulation of Atomic Transport in Materials
MOFFlow: Flow Matching for Structure Prediction of Metal-Organic Frameworks
Learning the Hamiltonian of Disordered Materials with Equivariant Graph Networks
Designing Mechanical Meta-Materials by Learning Equivariant Flows
SymmCD: Symmetry-Preserving Crystal Generation with Diffusion Models
Rethinking the role of frames for SE(3)-invariant crystal structure modeling
A Periodic Bayesian Flow for Material Generation
ECD: A Machine Learning Benchmark for Predicting Enhanced-Precision Electronic Charge Density in Crystalline Inorganic Materials
Wyckoff Transformer: Generation of Symmetric Crystals
PDDFormer: Pairwise Distance Distribution Graph Transformer for Crystal Material Property Prediction