Artem Ryblov’s Data Science Weekly
618 subscribers
139 photos
163 links
@artemfisherman’s Data Science Weekly: Elevate your expertise with a standout data science resource each week, carefully chosen for depth and impact.

Long-form content: https://artemryblov.substack.com
Download Telegram
Prompt Engineering Guide by Open.AI

This guide shares strategies and tactics for getting better results from large language models (sometimes referred to as GPT models) like GPT-4. The methods described here can sometimes be deployed in combination for greater effect. We encourage experimentation to find the methods that work best for you.

Some of the examples demonstrated here currently work only with our most capable model, gpt-4. In general, if you find that a model fails at a task and a more capable model is available, it's often worth trying again with the more capable model.

Link: https://platform.openai.com/docs/guides/prompt-engineering

Navigational hashtags: #armknowledgesharing #armtutorials
General hashtags: #llm #openai #prompts #promptengineering #gpt #gpt3 #gpt4

@data_science_weekly
👍3
Channel name was changed to «Data Science Weekly»
Machine Learning Engineering Online Book by Stas Bekman

An open collection of methodologies to help with successful training of large language models and multi-modal models.

This is a technical material suitable for LLM/VLM training engineers and operators. That is the content here contains lots of scripts and copy-n-paste commands to enable you to quickly address your needs.

This repo is an ongoing brain dump of my experiences training Large Language Models (LLM) (and VLMs); a lot of the know-how Stas acquired while training the open-source BLOOM-176B model in 2022 and IDEFICS-80B multi-modal model in 2023. Currently, he is working on developing/training open-source Retrieval Augmented models at Contextual.AI.

Table of Contents
Part 1. Insights
- The AI Battlefield Engineering - What You Need To Know
Part 2. Key Hardware Components
- Accelerator - the work horses of ML - GPUs, TPUs, IPUs, FPGAs, HPUs, QPUs, RDUs (WIP)
- Network - intra-node and inter-node connectivity, calculating bandwidth requirements
- IO - local and distributed disks and filesystems
- CPU - cpus, affinities (WIP)
- CPU Memory - how much CPU memory is enough - the shortest chapter ever.
Part 3. Performance
- Fault Tolerance
- Performance
- Multi-Node networking
- Model parallelism
Part 4. Operating
- SLURM
- Training hyper-parameters and model initializations
- Instabilities
Part 5. Development
- Debugging software and hardware failures
- And more debugging
- Reproducibility
- Tensor precision / Data types
- HF Transformers notes - making small models, tokenizers, datasets, and other tips
Part 6. Miscellaneous
- Resources - LLM/VLM chronicles

Link: https://github.com/stas00/ml-engineering

Navigational hashtags: #armknowledgesharing #armbooks #armrepo
General hashtags: #llm #gpt #gpt3 #gpt4 #ml #engineering #mlsystemdesign #systemdesign #reproducibility #performance

@data_science_weekly
👍2
The Incredible PyTorch

This is a curated list of tutorials, projects, libraries, videos, papers, books and anything related to the incredible PyTorch.

Table Of Contents
- Tutorials
- Large Language Models (LLMs)
- Tabular Data
- Visualization
- Explainability
- Object Detection
- Long-Tailed / Out-of-Distribution Recognition
- Activation Functions
- Energy-Based Learning
- Missing Data
- Architecture Search
- Continual Learning
- Optimization
- Quantization
- Quantum Machine Learning
- Neural Network Compression
- Facial, Action and Pose Recognition
- Super resolution
- Synthetesizing Views
- Voice
- Medical
- 3D Segmentation, Classification and Regression
- Video Recognition
- Recurrent Neural Networks (RNNs)
- Convolutional Neural Networks (CNNs)
- Segmentation
- Geometric Deep Learning: Graph & Irregular Structures
- Sorting
- Ordinary Differential Equations Networks
- Multi-task Learning
- GANs, VAEs, and AEs
- Unsupervised Learning
- Adversarial Attacks
- Style Transfer
- Image Captioning
- Transformers
- Similarity Networks and Functions
- Reasoning
- General NLP
- Question and Answering
- Speech Generation and Recognition
- Document and Text Classification
- Text Generation
- Text to Image
- Translation
- Sentiment Analysis
- Deep Reinforcement Learning
- Deep Bayesian Learning and Probabilistic Programmming
- Spiking Neural Networks
- Anomaly Detection
- Regression Types
- Time Series
- Synthetic Datasets
- Neural Network General Improvements
- DNN Applications in Chemistry and Physics
- New Thinking on General Neural Network Architecture
- Linear Algebra
- API Abstraction
- Low Level Utilities
- PyTorch Utilities
- PyTorch Video Tutorials
- Community
- To be Classified
- Links to This Repository
- Contributions

Link: The Incredible PyTorch (repository)

Navigational hashtags: #armknowledgesharing #armrepo
General hashtags: #dl #deeplearning #pytorch

@data_science_weekly
👍3
What are embeddings? by Vicki Boykis

Over the past decade, embeddings — numerical representations of machine learning features used as input to deep learning models — have become a foundational data structure in industrial machine learning systems. TF-IDF, PCA, and one-hot encoding have always been key tools in machine learning systems as ways to compress and make sense of large amounts of textual data. However, traditional approaches were limited in the amount of context they could reason about with increasing amounts of data. As the volume, velocity, and variety of data captured by modern applications has exploded, creating approaches specifically tailored to scale has become increasingly important.

Google’s Word2Vec paper made an important step in moving from simple statistical representations to semantic meaning of words. The subsequent rise of the Transformer architecture and transfer learning, as well as the latest surge in generative methods has enabled the growth of embeddings as a foundational machine learning data structure. This survey paper aims to provide a deep dive into what embeddings are, their history, and usage patterns in industry.

Link: https://vickiboykis.com/what_are_embeddings/index.html

Navigational hashtags: #armknowledgesharing #armbooks
General hashtags: #dl #deeplearning #pytorch #embeddings #tfidf #svd #pca #word2vec #cbow #skipgram #bert #gpt #llm #transformers

@data_science_weekly
👍2
The CL (changelist) author’s guide to getting through code review by Google

The pages in this section contain best practices for developers going through code review. These guidelines should help you get through reviews faster and with higher-quality results. You don’t have to read them all, but they are intended to apply to every Google developer, and many people have found it helpful to read the whole set.

- Writing Good CL Descriptions
- Small CLs
- How to Handle Reviewer Comments

Link: https://google.github.io/eng-practices/review/developer/

Navigational hashtags: #armknowledgesharing #armtutorials
General hashtags: #git #commit #pr #changelist #cl #review #pullrequest

@data_science_weekly
👍3
Google Machine Learning Education

Learn to build ML products with Google's Machine Learning Courses.

Foundational courses
The foundational courses cover machine learning fundamentals and core concepts. They recommend taking them in the order below.

1. Introduction to Machine Learning
A brief introduction to machine learning.
2. Machine Learning Crash Course
A hands-on course to explore the critical basics of machine learning.
3. Problem Framing
A course to help you map real-world problems to machine learning solutions.
4. Data Preparation and Feature Engineering
An introduction to preparing your data for ML workflows.
5. Testing and Debugging
Strategies for testing and debugging machine learning models and pipelines.

Advanced Courses
The advanced courses teach tools and techniques for solving a variety of machine learning problems. The courses are structured independently. Take them based on interest or problem domain.

- Decision Forests
Decision forests are an alternative to neural networks.
- Recommendation Systems
Recommendation systems generate personalized suggestions.
- Clustering
Clustering is a key unsupervised machine learning strategy to associate related items.
- Generative Adversarial Networks
GANs create new data instances that resemble your training data.
- Image Classification
Is that a picture of a cat or is it a dog?
- Fairness in Perspective API
Hands-on practice debugging fairness issues.

Guides
Their guides offer simple step-by-step walkthroughs for solving common machine learning problems using best practices.

- Rules of ML
Become a better machine learning engineer by following these machine learning best practices used at Google.
- People + AI Guidebook
This guide assists UXers, PMs, and developers in collaboratively working through AI design topics and questions.
- Text Classification
This comprehensive guide provides a walkthrough to solving text classification problems using machine learning.
- Good Data Analysis
This guide describes the tricks that an expert data analyst uses to evaluate huge data sets in machine learning problems.
- Deep Learning Tuning Playbook
This guide explains a scientific way to optimize the training of deep learning models.

Link: https://developers.google.com/machine-learning?hl=en

Navigational hashtags: #armknowledgesharing #armcourses
General hashtags: #machinelearning #ml #google #course #courses #featureengineering #recsys #clustering #gan

@data_science_weekly
👍2
Supervised Machine Learning for Science. How to stop worrying and love your black box by Christoph Molnar & Timo Freiesleben

Machine learning has revolutionized science, from folding proteins and predicting tornadoes to studying human nature. While science has always had an intimate relationship with prediction, machine learning amplified this focus. But can this hyper-focus on prediction models be justified? Can a machine learning model be part of a scientific model? Or are we on the wrong track?

In this book, authors explore and justify supervised machine learning in science. However, a naive application of supervised learning won’t get you far because machine learning in raw form is unsuitable for science. After all, it lacks interpretability, uncertainty quantification, causality, and many more desirable attributes. Yet, we already have all the puzzle pieces needed to improve machine learning, from incorporating domain knowledge and ensuring the representativeness of the training data to creating robust, interpretable, and causal models. The problem is that the solutions are scattered everywhere.

In this book, authors bring together the philosophical justification and the solutions that make supervised machine learning a powerful tool for science.

The book consists of two parts:
- Part 1 discusses the relationship between science and machine learning.
- Part 2 addresses the shortcomings of supervised machine learning.

Link: https://ml-science-book.com/

Navigational hashtags: #armknowledgesharing #armbooks
General hashtags: #machinelearning #ml #science #supervised

@data_science_weekly
👍9
Large Language Model Course

The LLM course is divided into three parts:

🧩 LLM Fundamentals covers essential knowledge about mathematics, Python, and neural networks.
🧑‍🔬 The LLM Scientist focuses on building the best possible LLMs using the latest techniques.
👷 The LLM Engineer focuses on creating LLM-based applications and deploying them.

Links:
- Direct link
- Content Guide link
- Topic Guide link

Navigational hashtags: #armknowledgesharing #armcourses
General hashtags: #llm #llms #largelanguagemodel #largelanguagemodels #transofrmer #transformers #deeplearning #dl #nlp #naturallanguageprocessing

@data_science_weekly
👍5
Exceptional Resources for Data Science Interview Preparation. Part 1: Live Coding

In this article, we will understand what a live coding interview is and how to prepare for it.

This blog-post will primarily be useful to Data Scientists and ML engineers, while some sections, for example, Algorithms and Data Structures, will be suitable for all IT specialists who will have to go through the live coding section.

Table of contents
- Preparing for an Algorithmic Interview
- Resources
- Algorithms and Data Structures
- Programming in Python
- Solving a Practical Data Science Problem
- Hybrid
- Learning How to Learn
- Let’s sum it up
- What’s next?

NB:
I'm the author of the article.
It was initially published in Russian (on habr.com), then I added additional resources in English to make up for deleting resources in Russian language and published it on medium.com.
So, for Russian speakers I recommend to read Russian version, for English speakers I recommend to read English version and both will benefit from starring the repository, which will be maintained and updated when new resources become available.

Links:
- Medium (eng)
- Habr (rus)

Navigational hashtags: #armknowledgesharing #armarticles
General hashtags: #interview #interviewpreparation #livecoding #leetcode #algorithms #algorithmsdatastructures #datastructures #python #sql #kaggle

@data_science_weekly
👍8
System Design
Learn how to design systems at scale and prepare for system design interviews

What is system design?
System design is the process of defining the architecture, interfaces, and data for a system that satisfies specific requirements. System design meets the needs of your business or organization through coherent and efficient systems. It requires a systematic approach to building and engineering systems. A good system design requires us to think about everything, from infrastructure all the way down to the data and how it's stored.

Table of contents

- Getting Started
What is system design?
- Chapter I
IP, OSI Model, TCP and UDP, Domain Name System (DNS), Load Balancing, Clustering, Caching, Content Delivery Network (CDN), Proxy, Availability, Scalability, Storage
- Chapter II
Databases and DBMS, SQL databases, NoSQL databases, SQL vs NoSQL databases, Database Replication, Indexes, Normalization and Denormalization, ACID and BASE consistency models, CAP theorem, PACELC Theorem, Transactions, Distributed Transactions, Sharding, Consistent Hashing, Database Federation
- Chapter III
N-tier architecture, Message Brokers, Message Queues, Publish-Subscribe, Enterprise Service Bus (ESB), Monoliths and Microservices, Event-Driven Architecture (EDA), Event Sourcing, Command and Query Responsibility Segregation (CQRS), API Gateway, REST, GraphQL, gRPC, Long polling, WebSockets, Server-Sent Events (SSE)
- Chapter IV
Geohashing and Quadtrees, Circuit breaker, Rate Limiting, Service Discovery, SLA, SLO, SLI, Disaster recovery, Virtual Machines (VMs) and Containers, OAuth 2.0 and OpenID Connect (OIDC), Single Sign-On (SSO), SSL, TLS, mTLS
- Chapter V
System Design Interviews, URL Shortener, WhatsApp, Twitter, Netflix, Uber
- Appendix
Next Steps, References

Links:
- Direct link to the site with the course
- Direct link to the repository for the course
- Content Guide link
- Topic Guide link

Navigational hashtags: #armknowledgesharing #armcourses
General hashtags: #systemdesign

@data_science_weekly
👍5
Prompt Engineering Guide

Generative AI is the world's hottest buzzword, and they have created the most comprehensive (and free) guide on how to use it. This course is tailored to non-technical readers, who may not have even heard of AI, making it the perfect starting point if you are new to Generative AI and Prompt Engineering. Technical readers will find valuable insights within their later modules.

Generative AI refers to tools that can be used to create new content such as articles or images, just like humans can. It is expected to significantly change the way we work (read: your job may be affected). With so much buzz floating around about Generative AI (Gen AI) and Prompt Engineering (PE), it is hard to know what to believe.

They have scoured the internet to find the best techniques and tools for their 1.3 Million readers from companies like OpenAI, Brex, and Deloitte. They are constantly refining their guide, to ensure that they provide you with the latest information.

Link:
- Direct link to the site with the guide
- Content Guide link
- Topic Guide link

Navigational hashtags: #armknowledgesharing #armtutorial
General hashtags: #promptengineering #prompt #prompting #genai #generativeai

@data_science_weekly
👍4
Designing Machine Learning Systems by Chip Huyen

Machine learning systems are both complex and unique. Complex because they consist of many different components and involve many different stakeholders. Unique because they're data dependent, with data varying wildly from one use case to the next. In this book, you'll learn a holistic approach to designing ML systems that are reliable, scalable, maintainable, and adaptive to changing environments and business requirements.

Author Chip Huyen, co-founder of Claypot AI, considers each design decision--such as how to process and create training data, which features to use, how often to retrain models, and what to monitor--in the context of how it can help your system as a whole achieve its objectives. The iterative framework in this book uses actual case studies backed by ample references.

This book will help you tackle scenarios such as:
- Engineering data and choosing the right metrics to solve a business problem
- Automating the process for continually developing, evaluating, deploying, and updating models
- Developing a monitoring system to quickly detect and address issues your models might encounter in production
- Architecting an ML platform that serves across use cases
- Developing responsible ML systems

Link: https://www.oreilly.com/library/view/designing-machine-learning/9781098107956/

Navigational hashtags: #armknowledgesharing #armbooks
General hashtags: #machinelearningsystemdesign #systemdesign #machinelearning #ml #designingmachinelearningsystems

@data_science_weekly
👍8
MLU-EXPLAIN
Visual explanations of core machine learning concepts

Machine Learning University (MLU) is an education initiative from Amazon designed to teach machine learning theory and practical application.

As part of that goal, MLU-Explain exists to teach important machine learning concepts through visual essays in a fun, informative, and accessible manner.

Available articles:
- Neural Networks
- Equality of Dots
- Logistic Regression
- Linear Regression
- Reinforcement Learning
- ROC & AUC
- Cross-validation
- Train, Test, and Validation Sets
- Precision & Recall
- Random Forest
- Decision Trees
- The Bias Variance Tradeoff
- Double Descent

Link:
- Direct Link

Navigational hashtags: #armknowledgesharing #armtutorials
General hashtags: #machinelearning #ml #visualisation

@data_science_weekly
👍6
Exceptional Resources for Data Science Interview Preparation. Part 2: Classic Machine Learning

In the previous article, I shared materials for preparing for one of the most daunting (for many) stages — Live Coding.

In this article, we will look at materials that can be used to prepare for the section on classic machine learning.

Table of contents
- Classic Machine Learning
- Resources
- Books
- Courses
- Sites
- Cheatsheets
- Other
- Let’s sum it up
- What’s next?

NB:
I'm the author of the article.
It was initially published in Russian (on habr.com), then I published it on medium.com. So, for Russian speakers I recommend to read Russian version, for English speakers I recommend to read English version and both will benefit from starring the repository, which will be maintained and updated when new resources become available.

Links:
- Medium (eng)
- Habr (rus)

Navigational hashtags: #armknowledgesharing #armarticles
General hashtags: #interview #interviewpreparation #machinelearning #ml

@data_science_weekly
👍5
Interpretable Machine Learning. A Guide for Making Black Box Models Explainable by Christoph Molnar

Machine learning has great potential for improving products, processes and research. But computers usually do not explain their predictions which is a barrier to the adoption of machine learning. This book is about making machine learning models and their decisions interpretable.

After exploring the concepts of interpretability, you will learn about simple, interpretable models such as decision trees, decision rules and linear regression. The focus of the book is on model-agnostic methods for interpreting black box models such as feature importance and accumulated local effects, and explaining individual predictions with Shapley values and LIME. In addition, the book presents methods specific to deep neural networks.

All interpretation methods are explained in depth and discussed critically. How do they work under the hood? What are their strengths and weaknesses? How can their outputs be interpreted? This book will enable you to select and correctly apply the interpretation method that is most suitable for your machine learning project. Reading the book is recommended for machine learning practitioners, data scientists, statisticians, and anyone else interested in making machine learning models interpretable.

Link:
- Direct Link

Navigational hashtags: #armknowledgesharing #armbooks
General hashtags: #machinelearning #ml #interpretation #explanation #interpretability #blackbox

@data_science_weekly
👍4
Mathematics for Machine Learning by Marc Peter Deisenroth and A. Aldo Faisal

The fundamental mathematical tools needed to understand machine learning include linear algebra, analytic geometry, matrix decompositions, vector calculus, optimization, probability and statistics. These topics are traditionally taught in disparate courses, making it hard for data science or computer science students, or professionals, to efficiently learn the mathematics. This self contained textbook bridges the gap between mathematical and machine learning texts, introducing the mathematical concepts with a minimum of prerequisites. It uses these concepts to derive four central machine learning methods: linear regression, principal component analysis, Gaussian mixture models and support vector machines.

For students and others with a mathematical background, these derivations provide a starting point to machine learning texts. For those learning the mathematics for the first time, the methods help build intuition and practical experience with applying mathematical concepts.

Every chapter includes worked examples and exercises to test understanding. Programming tutorials are offered on the book's web site.

Table of Contents
Part I: Mathematical Foundations
1. Introduction and Motivation
2. Linear Algebra
3. Analytic Geometry
4. Matrix Decompositions
5. Vector Calculus
6. Probability and Distribution
7. Continuous Optimization
Part II: Central Machine Learning Problems
8. When Models Meet Data
9. Linear Regression
10. Dimensionality Reduction with Principal Component Analysis
11. Density Estimation with Gaussian Mixture Models
12. Classification with Support Vector Machines

Link: Direct Link

Navigational hashtags: #armknowledgesharing #armbooks
General hashtags: #math #mathematics #maths #calculus #algebra #probability #geometry #optimization #machinelearning #ml

@data_science_weekly
👍10
The Pragmatic Engineer

The #1 technology newsletter on Substack. Highly relevant for software engineers and engineering managers, useful for those working in tech. Written by engineering manager and software engineer Gergely Orosz who was previously at Uber, Skype/Microsoft, and at startups.

What to expect:
- Big Tech and startups, from the inside. Tech is accelerating rapidly: but some fast-moving companies are ahead of the rest of the pack. What are they doing differently and why? He talks with people working at these companies to get insights and details.
- Actionable advice for engineering managers, software engineers and tech workers. Topics covered are relevant to those working at tech companies. Get tools and insights to become a more efficient engineering leader. If you use just one approach to make your project, team, or company more efficient, the weekly newsletter already pays for itself.
- A pulse on the tech market and trends worth knowing about. What is happening in tech, and why? How is the market changing? What does this mean for hiring managers and for those navigating their careers? He covers patterns and trends heard within Big Tech and high-growth startups in the series The Pulse.

Link: Direct Link

Navigational hashtags: #armknowledgesharing #armnewsletters
General hashtags: #technology #engineering #efficiency

@data_science_weekly
👍3
MLOps Guide by Arthur Olga, Gabriel Monteiro, Guilherme Leite and Vinicius Lima

This site is intended to be a MLOps Guide to help projects and companies to build more reliable MLOps environment. This guide should contemplate the theory behind MLOps and an implementation that should fit for most use cases.

What is MLOps?
MLOps is a methodology of operation that aims to facilitate the process of bringing an experimental Machine Learning model into production and maintaining it efficiently. MLOps focus on bringing the methodology of DevOps used in the software industry to the Machine Learning model lifecycle.

In that way we can define some of the main features of a MLOPs project:
- Data and Model Versioning
- Feature Management and Storing
- Automation of Pipelines and Processes
- CI/CD for Machine Learning
- Continuous Monitoring of Models

What does this guide cover?
- Introduction to MLOps Concepts
- Tutorial for Building a MLOps Environment

Link: Direct

Navigational hashtags: #armknowledgesharing #armguides
General hashtags: #mlops #ml #operations

@data_science_weekly
👍4
Lessons in Statistical Thinking by Daniel Kaplan

One of the oft-stated goals of education is the development of “critical thinking” skills. Although it is rare to see a careful definition of critical thinking, widely accepted elements include framing and recognizing coherent arguments, the application of logic patterns such as deduction, the skeptical evaluation of evidence, consideration of alternative explanations, and a disinclination to accept unsubstantiated claims.

“Statistical thinking” is a variety of critical thinking involving data and inductive reasoning directed to draw reasonable and useful conclusions that can guide decision-making and action.

Surprisingly, many university statistics courses are not primarily about statistical reasoning. They do cover some technical methods used in statistical reasoning, but they have replaced notions of “useful,” “decision-making,” and “action” with doctrines such as “null hypothesis significance testing” and “correlation is not causation.” For example, a core method for drawing responsible conclusions about causal relationships by adjusting for “covariates” is hardly ever even mentioned in conventional statistics courses.

These Lessons in Statistical Thinking present the statistical ideas and methods behind decision-making to guide action.

Link: Direct Link

Navigational hashtags: #armknowledgesharing #armbooks
General hashtags: #stats #statistics #math #maths

@data_science_weekly
👍3
Subscribe to my Substack!

All the new articles will be there and the first one is already available!
👍4