Graph Machine Learning
6.71K subscribers
53 photos
11 files
808 links
Everything about graph theory, computer science, machine learning, etc.


If you have something worth sharing with the community, reach out @gimmeblues, @chaitjo.

Admins: Sergey Ivanov; Michael Galkin; Chaitanya K. Joshi
Download Telegram
GraphML News (April 27th) - 🧬 The Protein Edition: OpenCRISPR, Xaira, ScaleFold

✂️ 🧬 Profluent Bio announced OpenCRISPR - an initiative to share CRISPR-Cas like proteins generated by protein LMs (a-la ESM 2). Profluent managed to generate rather novel proteins hundreds of mutations away from the known ones, and those new work surprisingly well — check out the thread by Ali Madani and a fresh preprint for more details. CRISRP is a genome editing tool that was awarded with 2020 Nobel Prize in Chemistry and got recently approved by FDA as a therapy for sickle cell disease (and a huge potential in other areas as well). Jennifer Doudna, one of the OG authors, gave a keynote at ICML’23 and even attended graph learning and comp bio workshops!

💸 A new biotech startup Xaira Therapeutics was established with $1B+ funding with David Baker as a co-founder. Investors include ARCH, Sequoia, Two Sigma, and other prominent VC bros. Perhaps we could hypothesize that the scaled up technology stack behind RF Diffusion (both ML and lab) is going to play a key role in Xaira. In related news, Max Welling announced his departure from MSR and co-founding of a new startup on molecular and materials discovery together with Chad Edwards.

📈 ScaleFold: Reducing AlphaFold Initial Training Time to 10 Hours - you only need 2080 H100’s to train AlphaFold in 7 hours (that’s roughly $130M given $500k price tag for a DGX with 8 H100 gpus). Gross extrapolation suggests that GPU-rich places like Meta could train a few AlphaFolds in less than an hour at the same time. Next milestone: train an AlphaFold-like model during a coffee break 👀.

📉 Artificial Intelligence Driving Materials Discovery? Perspective on the Article: Scaling Deep Learning for Materials Discovery - a critical look at the recently published GNoMe database of discovered crystalline structures. The two main points are (1) a lot of those structures contain radioactive elements making them impractical for real-world use; (2) a lot of those structures are isomorphic to well-known structures in crystallographic terms, eg, replacing one element with that of a similar group that induces pretty much the same crystal structure.

Weekend reading:

The GeometricKernels library by Viacheslav Borovitskiy et al that implements kernels for Riemannian manifolds, graphs, and meshes with TF, PT, and Jax bindings.

Learning with 3D rotations, a hitchhiker's guide to SO(3) by A. René Geist et al - a great introductory paper and resource for studying geometric rotations, a perfect companion to the Hitchhiker’s guide to Geometric GNNs

From Local to Global: A Graph RAG Approach to Query-Focused Summarization by Darren Edge and MSR - we mentioned GraphRAG a few times and here is the full preprint.

STaRK: Benchmarking LLM Retrieval on Textual and Relational Knowledge Bases by Shirley Wu, Shiyu Zhao et al feat. Jure Leskovec - a new benchmark for question answering on texts and structured sources
GraphML News (May 3rd) - The ICLR Week, new blogs

🎉 ICLR’24 starts in Vienna next Tuesday (May 5th)! There will be a ton of graph learning papers, geometric DL workshops, and, more importantly, the authors and folks who constitute the community. Michael and Chaitanya will be there, feel free to reach out to chat!

A few new blogposts:

- The TeraHAC algorithm by Google (to be presented at SIGMOD’24) for approximate clustering graphs with trillions of edges in quasi-linear time.
- Adventures of Pop – the undruggable protein by Dom Beaini (Valence Labs) - a spectacular ELI5 read about drug discovery where a celebrity protein Pop (the cause of a bad disease) has to eat a banana 🍌 (the ligand with a potential drug that would inhibit the protein). With this yummy vocabulary at hand, the post explains several key concepts like protein-ligand binding, free energy, molecular dynamics, DPMK optimization, and more.

Weekend reading:

Uncertainty for Active Learning on Graphs by Dominik Fuchsgruber, Tom Wollschläger et al feat. Stephan Günnemann (all TU Munich)

Parameter-Efficient Tuning Large Language Models for Graph Representation Learning by Qi Zhu and AWS team feat. George Karypis - on using GNNs for producing soft prompts to be sent to LLMs

4DBInfer: A 4D Benchmarking Toolbox for Graph-Centric Predictive Modeling on Relational DBs by Minjie Wang, Quan Gan feat. Muhan Zhang - a new benchmark for graph learning on relational DBs similar to a recent RelBench, but including more tasks like link prediction. Some GNNs seem to outperform XGBoost (Kaggle GMs are anxious and frowning)
GraphML News (May 11th) - AlphaFold 3

🧬 Google DeepMind and Isomorphic Labs announced AlphaFold 3 going beyond proteins and extending structure prediction capabilities to RNA, DNA, and small molecules (ligands). AF3 employs Pairformer (improved Evoformer) as an encoder and a diffusion model for generating 3D coordinates. Yes, AF3 demonstrates huge gains in structural biology tasks compared to previous models, but perhaps the hottest take from the Nature preprint is:

> Similarly to some recent work, we find that no invariance or equivariance with respect to global rotations and translation of the molecule are required in the architecture and so we omit them to simplify the machine learning architecture.

🔥 For reference, AF2 used SE(3)-equivariant attention that spun off a great deal of research in equivariance and geometry for structural biology. The new statement took researchers at ICLR by storm: do we need to invest time and efforts into complex math and group theory if vanilla non-equivariant transformer and diffusion trained on 48 random augmentations can beat other geometric models with baked-in equivariances? AF3 used rather modest compute (compared to LLMs) - 256 A100s for 10 days of pretraining and 10 days of finetuning (overall roughly $420K on Azure) - and it seems to be enough to send a wake-up call to the Geometric DL community.

🤔 Does the bitter lesson strike again? Is it easier to learn symmetries from data and augmentations (classical 2016 paper by Taco Cohen and Max Welling) rather than enforcing those constraints in the model? Maybe it’s the task (DNA and RNA structure prediction) that does not have explicit symmetries to bake into a model? It it quite likely that equivariant models can achieve a similar result - but with higher compute and inference costs - is it still worth it? The inference argument looks quite plausible - foundation models (be it LLMs or AF) run billions of inference passes, if you can save 2x inference time by not doing expensive math and just use longer pre-training, the total serving costs are also reduced.

Those will be the main questions in the community on social media and conferences in 2024.

Besides that, researchers can use the AlphaFold Server for custom inference jobs - we welcome comp bio folks into the world (thanks OpenAI and Anthropic) of paid API access and proprietary models 😉 Still, given the pace of OS community (at least two ongoing re-implementations 1, 2), relatively easy model, and modest training compute, it might take <6 months to replicate a model similar to AF3 in performance.
🔥1
GraphML News (May 18th) - MatterSim, new workshops

🔮 Continuing the success of MatterGen, a diffusion model for material generation, MSR announced MatterSim (blog), an ML force field for atomistic simulations. A single MatterSim model supports a wide range of temperatures (0-5000 K) and pressures (up to 1000 GPa) and thus could be seen as a competitor to a recent MACE MP-0 - in fact, the authors compare against MACE MP-0 and observe significant improvements in certain tasks. Practically, MatterSim exists with M3GNet or Graphormer backbones (equivariance lives!) so you can select one depending on the available compute. MatterSim could be especially useful in active learning scenarios as a quick proxy when filtering generated candidates.

👷 A few upcoming summer schools and workshops:

- Machine Learning for Chemistry 2024 CZS Summer School (Sept 9-13th in Karlsruhe) with invited speakers from Google, MSR, Mila, TU Munich, KIT, and EPFL. Early bird registration lasts until June 13th.
- 21st Machine Learning on Graphs (MLG) workshop (Sept 9th or 13th, co-located with ECML PKDD 2024 in Vilnius) accepts submissions until June 15th. Invited speakers include Yllka Velaj (Uni Vienna) and Haggai Maron (NVIDIA & Technion).

Weekend reading:

Improving Subgraph-GNNs via Edge-Level Ego-Network Encodings by Nurudin Alvareg-Gonzalez et al

AdsorbDiff: Adsorbate Placement via Conditional Denoising Diffusion by Adeesh Kolluru and John R Kitching - perhaps the first diffusion model for this task (uses Equiformer V2 and GemNet OC)

MiniMol: A Parameter-Efficient Foundation Model for Molecular Learning by Kerstin Kläser, Błazej Banaszewski, and Valence labs - a 10M param for encompassing most tasks on 2D molecules (where you have smiles and graphs)
GraphML News (May 25th) - Aurora, primer on MD, PoET for proteins

The main NeurIPS deadline has finally passed - congrats to those who made it to the submission, you deserved some decompression time! (and reviewers behold, 20k submissions are coming). We could probably expect a flurry of preprint on arxiv next weeks - we’ll keep you posted about the most interesting things.

🌍 MSR AI 4 Science presented Aurora - a foundation model of the atmosphere that works for weather forecasting, air pollution, and predicting rare weather events. Aurora improves over the recent GraphCast and does so with plain vanilla Perceivers and ViTs, no equivariance involved 🥲

⚛️ Abishaike Mahajan prepared a great primer on molecular dynamics for complete beginners gradually introducing most important concepts (with illustrations) from force fields to equilibration to computational simulation methods. Finally, the article touched upon some successful use-cases of MD in industry. Highly recommended read to grasp the basics.

✍️ Meanwhile, folks returning from ICLR share some reflections on their fields - for instance, Patrick Schwab (GSK) on the papers for ML for Drug Discovery, and Lindsay Edwards (Relation) on why AI for DD is difficult.

🧬 Openprotein released PoET (the protein evolution transformer) - a protein LM that significantly outperforms ESM-2 in zero-shot prediction on ProteinGym while being much smaller. The authors project that a 200M PoET model can be equivalent to a 500B ESM model (by extrapolating scaling laws a bit). The checkpoint and inference code are publicly available.

Weekend reading:

Deep Learning for Protein-Ligand Docking: Are We There Yet? by Alex Morehead et al. - introduces the PoseBench benchmark for docking and evaluated a handful of modern baselines (DiffDock-L leads in most cases)

Explaining Graph Neural Networks via Structure-aware Interaction Index (ICML’24) by Ngoc Bui et al. feat Rex Ying: Myerson-Taylor instead of Shapley methods

Fisher Flow Matching for Generative Modeling over Discrete Data by Oscar Davis feat. Michael Bronstein and Joey Bose - flow matching for discrete data, already outperforms a recent discrete FM model DirichletFM
GraphML News (June 1st) - GNNs for Automotive Vision, NeurIPS submissions

A fresh example of applying GNNs in real-world problems is provided in the Nature paper Low-latency automotive vision with event cameras by Daniel Gehrig and Davide Scaramuzza from Uni Zurich. There, GNNs help to parse temporal events (like appearance of a pedestrian on a road) and save a lot of compute by updating only local neighborhood of changed patches. The model (with efficient CUDA implementation) works in real time in cars! Code and video demo are available.

The week brought a handful of cool new papers formatted with the NeurIPS template (what could that mean 🤔) - let’s see:

🧬 Genie 2 by AlQuraishi lab - better protein diffusion model now supporting multi-motif scaffolding, outperforms RFDiffusion, FrameFlow, and Chroma, code is available.

🦄 LoGAH: Predicting 774-Million-Parameter Transformers using Graph HyperNetworks with 1/100 Parameters - Xinyu Zhou feat. Boris Knyzev - the next iteration of the Graph Hypernetwork (GHN-3) that directly predicts parameters of neural networks, now with an efficient module for transformer-sized matrices. The model can predict weights of GPT-2 and ViT-sized networks! Code

🍭 Understanding Transformer Reasoning Capabilities via Graph Algorithms by Clayton Sanford and Google team feat. Anton Tsitsulin and Bryan Perozzi - a theoretical study on transformers and their ability to solve graph problems. The study reveals that, eg, depth has to scale as O(log(V+E)) from the graph size for parallelizable problems, and additional scaling of width for search problems. Besides, there is a comparison between GNNs and Transformers (trained from scratch and fine-tuned T5) on the GraphQA benchmark. Prompting LLMs doesn’t really work.

🤓 Two papers on flow matching from Michael Bronstein’s lab: Fisher Flow Matching for Generative Modeling over Discrete Data by Oscar Davis et al - the best discrete FM model so far, and Metric Flow Matching for Smooth Interpolations on the Data Manifold by Kacper Kapusniak et al - improvement of the OT-CFM (conditional flow matching with optimal transport).

We’ll be posting more new cool papers in the coming days!
​​GraphAny: A Foundation Model for Node Classification on Any Graph

by Jianan Zhao, Hesham Mostafa, Michael Galkin, Michael Bronstein, Zhaocheng Zhu, Jian Tang

🚀 We have just released a new work!

Pre-trained on one graph (Wisconsin with 120 labeled node), GraphAny generalizes to any unseen graph with arbitrary feature and label spaces - 30 new graphs - with an average accuracy of 67.26% in an inductive manner, surpassing GCN and GAT individually trained in the supervised regime.

GraphAny runs inference on a new graph as analytical solutions to LinearGNNs and enjoys the inductive (training-free) inference on arbitrary feature and label spaces. The model learns inductive attention scores for each node to fuse the predictions of multiple LinearGNNs. It adaptively predicts the most important LinearGNN channels via transforming the distances features between LinearGNNs, eg, high-pass filters are more preferred on heterophilic graphs.

Unlike LLM-based-models that can’t scale to large graphs, GraphAny efficiently can be trained on 1 graph and evaluated on 30 others—3M nodes & 244M edges—in just 10 mins. Works great on any 16GB GPU or even a CPU.

Finally, you can train a model on Cora and run inductive node classification on Citeseer, Pubmed, and actually any graph!

Paper, Code
Internship/Visiting period at NEC Labs Europe, Heidelberg

Guest post by Federico Errica

Who: Federico Errica is hiring a PhD student or Postdoc for a 6-months collaboration in the form of internship or visiting research period.

What: the collaboration will be focused on improving and designing message passing methods that address long-range propagation issues, with application to computational science problems.

How to apply: through the official website or LinkedIn
GraphML News (June 8th) - LOG’24, FoldFlow 2, more new papers

🎙️The biggest announcement of the week is that the virtual LOG’24 actually happens before going physical at UCLA in 2025. The dates are Nov 26-29th 2024, and submission deadline is September 11th. LOG is known for a much higher review quality - a considerable part of the whole budget is dedicated to monetary rewards for reviewers (one of the few events that ever appreciate good reviews).

🧬 The Dreamfold team announced FoldFlow 2 - an improved version of the protein structure generative model that made Riemannian flow matching a mainstream topic. FoldFlow 2 adds an ESM2 encoder for protein sequences and is trained on a much bigger dataset (featuring filtered synthetic structures from SwissProt and AlphaFold 2 DB). Experimentally, FoldFlow 2 substantially improves over previous SOTA big guys, RFDiffusion and Chroma, on unconditional and conditional (motif scaffolding) generation tasks.

Besides, it’s never too late to remind that Federico Errica is hiring interns and visiting researchers at NEC Labs in Heidelberg.

📚 The weeks after the NeurIPS deadline continue to bring cool submissions and accepted ICML papers!

- Topological GNNs went equivariant all the way:

Topological Neural Networks go Persistent, Equivariant, and Continuous (ICML’24) by Yogesh Verma et al
E(n) Equivariant Topological Neural Networks by Claudio Battirolo et al
E(n) Equivariant Message Passing Cellular Networks by Veliko Kovač et al feat Erik Bekkers

- Theory on graph transformers and spectral GNNs (all will be at ICML’24)

What Improves the Generalization of Graph Transformers? A Theoretical Dive into the Self-attention and Positional Encoding by Hongkang Li et al
Aligning Transformers with Weisfeiler–Leman by Luis Müller and Chris Morris
On the Expressive Power of Spectral Invariant Graph Neural Networks by Bohang Zhang et al feat. Haggai Maron

- Transformers through the graph lens (both featuring Petar Veličković)

Transformers need glasses! Information over-squashing in language tasks by Federico Barbero et al - the old friend over-squashing is confirmed to be present in transformers
The CLRS-Text Algorithmic Reasoning Language Benchmark by Markeeva, McLeish, Ibarz et al - the text version of CLRS for all you LLM folks, a fresh unsaturated benchmark

- Combinatorial optimization with GNNs

Towards a General GNN Framework for Combinatorial Optimization by Frederik Wenkel, Semih Cantürk, et al
A Diffusion Model Framework for Unsupervised Neural Combinatorial Optimization by Sebastian Sanokowski et al
GraphML News (June 15th) - ICML’24 graph papers, musings on AF3, more Flow Matching

🎉 ICML 2024 papers (including orals and spotlights) are now visible on OpenReview (however, without search). If you don’t want to scroll through 100 pages of accepted papers manually or write a custom parser, Azmine Toushik Wasi compiled a collection of accepted Graph ML papers with a nice categorization.

👨‍🔬 More blogs on AlphaFold 3 and reflexions about the future of TechBio: Charlie Harris focuses more on the technical side whereas Carlos Outeiral presents the CompBio perspective highlighting some cases where AF3 still underperforms.

🔀 Flow Matching continues to reach new heights with recently released papers: Variational Flow Matching (you didn’t forget ELBO and KL divergence, right?) by the UvA team of Floor Eijkelboom, Grigory Bartosh, et al (feat. Max Welling) derives a generalized flow matching formulation that naturally allows for categorical data (😼 CatFlow) and graph generation - the model outperform DiGress and other diffusion baselines. At the same time, the NYU team of Boffi et al propose Flow Map Matching - pretty much the Consistency Models for FMs that enable generation in one step instead of 20-100. Finally, Ross Irwin et al from AstraZeneca come up with MolFlow - flow matching for generating 3D conformations of molecules showing compelling results on QM9 and Geom-Drugs.

📚 Weekend reading (no flow matching):

GraphStorm: all-in-one graph machine learning framework for industry applications by Da Zheng and AWS - we wrote about a new GNN framework for enterprises back in 2023, here is the full paper with details.

CRAG -- Comprehensive RAG Benchmark from Meta (and a Kaggle competition for $30k) - the factual QA benchmark that simulates queries to knowledge graphs and APIs. Vanilla RAG yields only 44% accuracy and fancy industrial models barely reach 63% - so a plenty of room for improvements.

Explainable Graph Neural Networks Under Fire - by Zhong Li feat Stephan Günnemann. Turns out most GNN explainers utterly fail and cannot be trusted in the presence of simple adversarial perturbations. Let us know if you ever found a successful working case for GNN explainers 🤭
GraphML News (June 22nd) - $30M seed for CuspAI, Graph Foundation Models, MoML 2024

💸 A new startup CuspAI by Max Welling and Chad Edwards focusing on materials discovery and materials design for clean energy and sustainability raised $30M in the seed round (led by Hoxton, Basis Set, and Lightspeed). The support from the godfathers is significant - Geoff Hinton is a board advisor and Yann LeCun commented on the collaboration with FAIR and OpenCatalyst teams on OpenDAC. The materials design area gets hotter - not as hot as drug discovery and protein design though - but is steadily growing. In addition to Radical AI, Orbital Materials, new CuspAI, a fresh Entalpic by ex-Mila founders raised $5M+.

🔖 Together with Michael Bronstein, we released a new blog post on Graph Foundation Models. First, we define what GFMs are and what are the key design challenges covering heterogeneous model expressivity, scaling laws, and data scarcity. Then, we describe several successful examples of recent generalist models that can be considered GFMs in a particular area, eg, GraphAny for node classification, ULTRA for KG reasoning, and MACE MP-0 as universal potentials. We made sure to include all the recent references including position papers to appear at ICML’24!

🧬 The Molecular ML 2024 conference took place in Montreal this week (concluding the ML for Drug Discovery summer school) and featured talks on drug discovery and drug design. The recording is already available - check out talks by Jian Tang (BioGeometry) on geometric DL for proteins and by Max Jaderberg (Chief AI Officer at Isomorphic Labs) on AlphaFold 3. Might be one of the first public talks on AF3!

Weekend reading:

More benchmarks (brought to you by the NeurIPS Datasets & Benchmarking track deadline).

Temporal Graph Benchmark 2.0 by Gastinger, Huang et al - the first large-scale benchmark for temporal KGs and heterogeneous graphs

Text-space Graph Foundation Models by Chen et al feat. Anton Tsitsulin and Bryan Perozzi - a collection of text-attributed graphs for node classification, link prediction, and graph-level tasks

Towards Neural Scaling Laws for Foundation Models on Temporal Graphs by Shirzadkhani, Ngo, Shamsi et al - perhaps the first evidence that one temporal GNN can generalize to different temporal graphs (here those are token transactions in Ethereum)

RNA-FrameFlow: Flow Matching for de novo 3D RNA Backbone Design by Rishabh Anand, our own Chaitanya K. Joshi, et al - equivariant flow matching for generating 3D RNA structures.
GraphML News (June 29th) - ESM 3, TDC 2, AI 4 Genomics Conference

🧬 Evolutionary Scale (formerly a team in Meta, now a standalone startup) released ESM 3 - the next version of the SOTA protein LM, pretty much GPT-4 of pLMs. Now it’s not only a sequence model, but also a structure and function model. Following best LLM practices, ESM 3 even employs RLHF for aligning with human feedback! Besides, the model features SE(3)-invariant geometric attention based on distances between frames (equivariance not dead!) and VQ-VAE to tokenize structures and functions. The ESM 3 family is available in three sizes: 1.4B is open weights, 8B and 98B are available in the API (it’s time to embrace that). The preprint is quite informative about training data, pre-/post-training details, and RLHF details - kudos for not sweeping it under the rug. The model code is also available, so you only need 10,000 H100’s to train it on your own 🙂

💊 The team of Harvard, MIT, and Stanford researchers led by Marinka Zitnik released Therapeutic Data Commons 2 adding even more datasets and modalities: over 1000 single-cell datasets over 85M cells, the first protein-peptide binding dataset, drug-target interaction data, clinical trials data, and much more covering 10+ modalities. TDC-2 packages several pre-trained embeddings and can be used for evaluating a variety of models - from LLMs to GNNs. TDC-1 received some critic from drug discovery people back in the days, let’s see if TDC-2 closes those gaps.

The AI for Genomics and Health conference will be held in Boston, Oct 17-18th, with a stellar lineup of speakers including Shekoofeh Azizi (DeepMind, the author of Med-PaLM), Mo Lotfollahi (Sanger Institute), Sergey Ovchinnikov (MIT), Marinka Zitnik (Harvard), and James Zou (Stanford).

🔮 A small update: MatterSim by MSR AI4 Science became SOTA on MatBench Discovery beating recent GNoME from DeepMind - competition makes wonders even in such advanced scientific topic as materials discovery, ML potentials, and molecular dynamics.

Weekend reading:

Multimodal Graph Benchmark by Zhu et al feat. Danai Koutra - three datasets combining graphs, texts, and images for node classification and link prediction tasks.

Transformers meet Neural Algorithmic Reasoners by Bounsi et al feat. Petar Veličković - Transformer cross-attenting to the pre-trained Triplet-GMPNN solves algorithmic reasoning tasks (CLRS-text) better than the vanilla Transformer (but still struggles with OOD generalization though)

Clifford-Steerable Convolutional Neural Networks by Maksim Zhdanov et al - ConvNets go spacetime — equivariant to the Lorentz group and useful for electrodynamics. The thread by Maurice Weiler explains the work much more in details. Someday (after another PhD in math and physics) I will be able to understand the math behind this paper
GraphML News (July 7th) - ICML Workshops, AI4Science Lectures, GraphRAG release

📚 ICML workshops started publishing their accepted papers on OpenReview. Remember that workshop papers send a good signal of future full papers at next big conferences so you might find something interesting! Among others, check out:

- GRaM workshop (Geometry-grounded Representation Learning and Generative Modeling),
- AI4Science,
- TF2M (Theoretical Foundations of Foundation Models)
- SPIGM (Structured Probabilistic Inference & Generative Modeling)

📺 Simons Institute at Berkeley recently organized a workshop AI≡Science: Strengthening the Bond Between the Sciences and Artificial Intelligence with a stellar lineup including Tess Smidt, Mohammed AlQuraishi, Rafael Gomez-Bombarelli, and many others. All lectures recordings are now available.

🚒 Microsoft Research released GraphRAG, their take on graph-enriched RAG, on GitHub along with the accompanying blogpost. The repo received 6k stars just in 5 days 📈.

Weekend reading:

Foundations and Frontiers of Graph Learning Theory by Yu Huang et al. feat Muhan Zhang - a survey on the GNN theory, can accompany the recent ICML position paper by Morris et al.

Aligning Target-Aware Molecule Diffusion Models with Exact Energy Optimization by Siyi Gu, Minkai Xu et al feat. Jure Leskovec - perhaps the first diffusion model for ligand generation (conditioned on the pocket) with the DPO alignment (RLHF without H).
This year's ICLM will finally have a tutorial on graphs! Adrian Arnaiz-Rodriguez and Ameya Velingker will present a tutorial on on Graph Learning: Principles, Challenges, and Open Directions.
🗓️ Date: Monday, July 22
🕒 Time: 15:30 CEST - 17:30 CEST
📍 ICML In-person Event: Hall A8, ICML Venue
📍 Virtual attendance: https://icml.cc/virtual/2024/tutorial/35233

What to expect?
- Intro to Graph Learning and GNNs: Introduction to Traditional graph representation, Graph Neural Networks (GNNs), Message Passing Networks (MPNNs), Graph Transformers (GTs) and spectral quantities.
- Expressiveness and Generalizability: GNN expressivity linked with the WL test, generalizability of MPNNs, and their performance implications.
- Challenges in GNNs: Understanding and addressing under-reaching, over-smoothing, over-squashing, and graph rewiring techniques.
- Panel Discussion on Future Directions: Panel discussion with Michael Bronstein, Bryan Perozzi, Christopher Morris and more panelist TBC. We will discuss about GNN limitations, graph foundation models, and integrating GNNs with large language models (LLMs).

This tutorial balances introductory content and advanced insights, aimed to both general audiences and experts. Don’t miss this opportunity to deepen your understanding of GNNs!
GraphML News (July 13th) - Recursion goes brrr, Acquisition of Graphcore, Illustrated AF3

💸 Recursion and NVIDIA launched BioHive-2, a GPU cluster made of 504 H100’s which is roughly equivalent to 1 petaflops in FP16 / BF16 and perhaps sub-$50M in the costs. Some napkin math indicates it could train and fine-tune a full AlphaFold 3-like model in about 4 days. Except for ESM-3, we haven’t yet seen drug discovery models trained on such compute - congrats to Recursion, Valence, and researchers with engineers who can now really go brrr.

💸 Graphcore, a UK hardware startup offering their hardware platform (BOW IPUs), was acquired by SoftBank for rumored $500M (back in 2020 valuation was about $2.8B). Former employees likely lost their vested options ($500M is still less than $600M originally invested into the company) but let’s hope that now the future would be more stable for Graphcore and we will see more successful products.

🧬 The Illustrated AlphaFold by Elana Simon and Jake Silberg from Stanford (inspired by the Illustrated Transformer) explains visually the main building blocks of the model - starting from the input data down to PairFormer, triangular attention to the diffusion module to the training losses. Things get much simpler indeed when you know which shapes are involved at each particular step.

Weekend reading:

Link Prediction with Untrained Message Passing Layers by Lisi Qarkaxhija, Anatol E. Wegner, and Ingo Scholtes - the unreasonable effectiveness of untrained MPNNs strikes back

SE(3)-Hyena Operator for Scalable Equivariant Learning by Artem Moskalev et al - FFT with Clifford MLPs enable equivariant Hyena on long sequences up to 3.5M tokens on a single GPU

On the Expressive Power of Sparse Geometric MPNNs by Yonatan Sverdlov, Nadav Dym - enabling equivariant GNNs on sparse graphs (usually EGNNs work on fully-connected graphs)
GraphML News (July 20th) - Pinder and Plinder, LAB bench, ICML 2024

🎙️ ICML 2024 starts next week - enjoy the conference and Vienna if you are participating this year! Beside the main program, Monday will feature the Graph learning tutorial, Thursday and Friday have a handful of graph-related workshops.

🧬 VantAI together with MIT, NVIDIA, UniBasel, and SIB introduce two novel large-scale benchmarks: Pinder (Protein INteraction Dataset and Evaluation Resource) and Plinder (Protein-Ligand Interaction Dataset and Evaluation Resource). Pinder includes 500x more data than PPIRef, and Plinder is roughly 10x larger than DockGen, previous largest datasets in the area susceptible to test set leakages. Re-training SOTA diffusion models on Pinder and Plinder shows much lower results indicating that saturation is far away (at least for the coming year). Besides, it is great to see the industrial company (from a highly competitive CompBio area) contributing to the field with open datasets. Pinder and Plinder will be the main datasets for the upcoming ML for Structural Bio challenge at NeurIPS 2024, so prepare your GPUs and diffusion models.

🔬 FutureHouse released the LAB bench for studying LLMs in Biology and Chemistry. The benchmark includes 8 categories where LLMs have to deal with figures, images, scientific literature, databases, and designing protocols. Recent LLMs and VLMs (GPT-4o, Claude, and LLama-3) all show rather underwhelming results on those tasks - it is finally a new unsaturated benchmark for the LLM crowd! The authors saved some data to check training contamination of future models (eg, when training data for the next gen of such models would include validation and test splits of the datasets).

Weekend reading:

Beyond Euclid: An Illustrated Guide to Modern Machine Learning with Geometric, Topological, and Algebraic Structures by Sophia Sanborn, Johan Mathe, Mathilde Papillon, et al - a massive survey with amazing illustrations

PINDER: The protein interaction dataset and evaluation resource by Daniel Kovtun, Mehmet Akdel, and VantAI folks feat. Michael Bronstein

PLINDER: The protein-ligand interactions dataset and evaluation resource by Janani Durairaj, Yusuf Adeshina, and VantAI folks

LAB-Bench: Measuring Capabilities of Language Models for Biology Research by Jon M. Laurent, Joseph D. Janizek, et al feat. Andrew White
GraphML News (July 27th) - LLMs in Chemistry, Discrete Flow Matching

ICML kept most of the community busy (Saturday is the last day of workshops) while in the other news Llama 3.1, SearchGPT, AlphaProof, and AlphaGeometry 2 took the headlines of approaching AGI singularity. Anyways, August would likely be a quieter month.

Some fresh works for the weekend reading:

A Review of Large Language Models and Autonomous Agents in Chemistry by Mayk Ramos, Christopher Collison, and Andrew White - a massive survey on what current gen of LLMs can do in chemistry - from property prediction and synthesis prediction to tool-augmented and multi-modal frontier models for orchestrating automated discovery labs. (paying respects to the LLM week)

Discrete Flow Matching by Itai Gat and Meta FAIR including Ricky Chen and Yaron Lipman - the OG authors of (Riemannian) Flow Matching. Discrete FM is now competitive to Llama 2/3 on coding tasks - so we should expect that module to be in all generative models for molecules, proteins, and crystals around ICLR’25 submissions and later.

Generative Modeling of Molecular Dynamics Trajectories by Bowen Jing and Hannes Stärk - MD via stochastic interpolants, supports accurate forward simulation, upsampling, interpolation between two states in the trajectory, and even inpainting of the simulated structure.
​​Seminar on Graph-based Causal Discovery in Computational Biology

🎓 Topic: "Causal discovery from multivariate information in biological and biomedical data"
👨‍🔬 Who: Hervé Isambert, The Isambert Lab, CNRS, Institut Curie, Paris
When: Monday, July 29th, 5pm CEST

Abstract: In this webinar, I will present the principles and limitations of graph-based causal discovery methods and their improvement using multivariate information decomposition, recently developed in my lab. Applications will range from gene expression data in single cells to nationwide medical databases of cancer patients. I will then discuss the theoretical link between graph-based causality and temporal (Granger-Schreiber) causality, which can both be expressed in terms of conditional multivariate information. While temporal causality is shown to imply graph-based causality, the converse may not be true (see Figure). An application to time series data concerns the analysis of video images of reconstituted tumor ecosystems, which uncovered a novel antagonistic effect of cell-cell interactions under therapeutically relevant conditions.

The Zoom link will appear in this channel shortly before 5pm
GraphML News (August 3rd) - NeurIPS workshops, MoML @ MIT, RUM and GraM

⛷️ NeurIPS’24 announced 56 accepted workshops (brace yourself, Vancouver convention center). In addition to a good bunch of LLM, VLM, and foundation model-focused events, graph and geometric learning folks might be interested in:

- AI for New Drug Modalities
- Machine Learning in Structural Biology
- Symmetry and Geometry in Neural Representations
- Multimodal Algorithmic Reasoning
- Machine Learning and the Physical Sciences
- AI for Accelerated Materials Design

🧬 The second part of MoML 2024 (Molecular ML) will be happening at MIT on November 5, you can submit short papers until October 10th. The authors of accepted papers get free admission!

💎 The GraM workshop of ICML’24 published accepted blogposts with some hidden gems like JAX implementation of EGNN, intro to equivariant neural fields, and the study of how consistency models don’t work for 3D molecule generation. Check out others as well - most of them require only entry-level background.

📈 Non-convolutional Graph Neural Networks by Yuanqing Wang and Kyunghyun Cho (the OG of GRUs) introduce RUM (random walk with unified memory) nets free of convolutions. Practically, the recipe of RUM included sampling random walks with anonymous node ID sequences (tracking the first occurrence of a node ID in the sequence), encodes both sequences via RNNs (sure, you can drop-in your fav Mamba here), concats both vectors with an MLP on top. The authors show RUMs are more expressive than 1-WL GNNs while not suffering from oversmoothing and oversquashing (and beating the baselines on a bunch of benchmarks). Interestingly, RUMs look like DeepWalk on steroids with several improvements. Is Bryan Perozzi the Noam Shazeer of graph learning? 🤔

More weekend reading:

Spatio-Spectral Graph Neural Networks by Simon Geisler et al feat. Stephan Günnemann - spectral GNNs can be strong performers, too - just to contrast with RUMs

Learning production functions for supply chains with graph neural networks by Serina Chang et al feat Jure Leskovec - a cool work that frames supply chains as temporal graphs, shows significant gains in prediction accuracy, and releases the data simulator

What Are Good Positional Encodings for Directed Graphs? by Yinan Huang, Haoyu Wang, and Pan Li. The answer is the Magnetic Laplacian with multiple potential factors (multi-q) - your best choice for DAGs.
GraphML News (August 10th) - Summer School recordings, DD merger

🖥️ Recordings from the ML for Drug Discovery Summer School are now available covering 5 days of talks with 28 videos - from basics of GNNs for chemistry and equivariance to protein folding, ML potentials, simulations, protein-protein (-ligand) binding, to generative modeling and causal discovery.

🖥️ The Eastern European ML Summer School’24 also published their recordings - 25 videos covering a more general area of deep learning including LLMs, reasoning, VLMs, RL, generative models, Bayesian DL, and many more. Notebooks from the practical sessions are available on GitHub.

Both schools feature the most up-to-date material from the top experts in the field, quite the gems to watch during the summer break 💎.

⚛️ Continuing with the quality content, Sophia Tang published a massive, 2.5h-read guide to spherical equivariant graph transformers deriving them from the first principles and spherical harmonics to TensorField nets to the SE(3)-Transformer. Lots of illustrations with the code going along. The best tutorial so far.

💸 News from the Geometric Wall Street Journal: a huge merger between Recursion and Exscientia (focusing on precision oncology) - actually, Recursion bought Exscientia for $688M in stocks continuing its acquisition spree (besides the BioHive-2 with 500 H100’s). (Not a stonks advice)

Weekend reading:

The Heterophilic Graph Learning Handbook: Benchmarks, Models, Theoretical Analysis, Applications and Challenges by Sitao Luan feat. Rex Ying and Stefanie Jegelka - everything you wanted to know about heterophilic graphs in 2024

When Heterophily Meets Heterogeneity: New Graph Benchmarks and Effective Methods by Junhong Lin et al - introduces H2DB, a collection of known and new heterophilic and heterogeneous graphs, much larger than existing datasets.
GraphML News (August 17th) - Spanner Graph, some new papers

🔧 Google announced Spanner Graph - the infinitely scalable graph database (as the vanilla Spanner) with all the bells and whistles GDBMS have in 2024: support both Graph Query Language (GQL, finally standardized by ISO in April after 8 years of work) and SQL, vector search and full-text search, basic graph algorithms at query time.

Otherwise, it’s mid-August and vacation time, so probably no major news for the next few weeks.

Weekend reading:

Training Language Models on the Knowledge Graph: Insights on Hallucinations and Their Detectability from the large DeepMind team - turns out reducing hallucinations when training LLMs on KGs (ie, recalling training triples) requires an order of magnitude more compute than Chinchilla scaling laws. Lots of qualitative results - have a look! Besides, it is one of the accepted papers at COLM - a new conference specifically tailored for LLM research (rip, ACL/EMNLP).

Topological Blind Spots: Understanding and Extending Topological Deep Learning Through the Lens of Expressivity by Yam Eitan et al. feat Haggai Maron - one of the first studies of expressive power of topological (higher-order) MPNNs. Turns out standard models based on simplicial complexes or cellular networks cannot distinguish many common topological patterns like a Möbius strip vs cylinder. The authors then derive provably more powerful scalable multi-cell networks.

Tokenized and Continuous Embedding Compressions of Protein Sequence and Structure by Amy X. Lu et al feat. Pieter Abbeel and Kyunghyun Cho - a deep dive into the latent space of ESMFold which happens to be quite sparse, it can reduced by 128x without losing in prediction performance.