Artem Ryblov’s Data Science Weekly
621 subscribers
139 photos
163 links
@artemfisherman’s Data Science Weekly: Elevate your expertise with a standout data science resource each week, carefully chosen for depth and impact.

Long-form content: https://artemryblov.substack.com
Download Telegram
Applied Geospatial Data Science with Python: Leverage geospatial data analysis and modeling to find unique solutions to environmental problems by David S. Jordan

Key Features
- Learn how to integrate spatial data and spatial thinking into traditional data science workflows
- Develop a spatial perspective and learn to avoid common pitfalls along the way
- Gain expertise through practical case studies applicable in a variety of industries with code samples that can be reproduced and expanded

Table of Contents
1. Introducing Geographic Information Systems and Geospatial Data Science
2. What Is Geospatial Data and Where Can I Find It?
3. Working with Geographic and Projected Coordinate Systems
4. Exploring Geospatial Data Science Packages
5. Exploratory Data Visualization
6. Hypothesis Testing and Spatial Randomness
7. Spatial Feature Engineering
8. Spatial Clustering and Regionalization
9. Developing Spatial Regression Models
10. Developing Solutions for Spatial Optimization Problems
11. Advanced Topics in Spatial Data Science

Links:
- Amazon
- Packt
- GitHub

Navigational hashtags: #armknowledgesharing #armbooks
General hashtags: #datascience #geo #geospatial

@data_science_weekly
👍4
Introduction to Machine Learning (I2ML) by LMU Munich

This website offers an open and free introductory course on (supervised) machine learning. The course is constructed as self-contained as possible, and enables self-study through lecture videos, PDF slides, cheatsheets, quizzes, exercises (with solutions), and notebooks.

The quite extensive material can roughly be divided into:
- An introductory undergraduate part (chapters 1-10)
- A more advanced second one on MSc level (chapters 11-19)
- A third course, on MSc level (chapters 20-23).

A key goal of the course is to teach the fundamental building blocks behind ML, instead of introducing “yet another algorithm with yet another name”. We discuss, compare, and contrast risk minimization, statistical parameter estimation, the Bayesian viewpoint, and information theory and demonstrate that all of these are equally valid entry points to ML. Developing the ability to take on and switch between these perspectives is a major goal of this course, and in our opinion not always ideally presented in other courses.

Link:
- Main Course Website

Navigational hashtags: #armknowledgesharing #armcourses
General hashtags: #ml #machinelearning #supervised

@data_science_weekly
👍6
Forecasting: Principles and Practice by Rob J Hyndman and George Athanasopoulos

This textbook is intended to provide a comprehensive introduction to forecasting methods and to present enough information about each method for readers to be able to use them sensibly. Authors don’t attempt to give a thorough discussion of the theoretical details behind each method, although the references at the end of each chapter will fill in many of those details.

The book is written for three audiences:
(1) people finding themselves doing forecasting in business when they may not have had any formal training in the area;
(2) undergraduate students studying business;
(3) MBA students doing a forecasting elective. We use it ourselves for masters students and third-year undergraduate students at Monash University, Australia.

For most sections, authors only assume that readers are familiar with introductory statistics, and with high-school algebra. There are a couple of sections that also require knowledge of matrices, but these are flagged.

At the end of each chapter we provide a list of “further reading”. In general, these lists comprise suggested textbooks that provide a more advanced or detailed treatment of the subject. Where there is no suitable textbook, authors suggest journal articles that provide more information.

Link: Book Website

Navigational hashtags: #armknowledgesharing #armbooks
General hashtags: #forecasting #timeseries #ts

@data_science_weekly
👍3
The Cartoon Guide to Statistics by Larry Gonick, Woollcott Smith

The Cartoon Guide to Statistics covers all the central ideas of modern statistics: the summary and display of data, probability in gambling and medicine, random variables, Bernoulli Trials, the Central Limit Theorem, hypothesis testing, confidence interval estimation, and much more - all explained in simple, clear, and yes, funny illustrations. Never again will you order the Poisson Distribution in a French restaurant!

Links:
- Amazon
- Internet Archive

Navigational hashtags: #armknowledgesharing #armbooks
General hashtags: #statistics #stats #probability

@data_science_weekly
👍4
Practitioners guide to MLOps: A framework for continuous delivery and automation of machine learning by Google Cloud

Across industries, DevOps and DataOps have been widely adopted as methodologies to improve quality and reduce the time to market of software engineering and data engineering initiatives. With the rapid growth in machine learning (ML) systems, similar approaches need to be developed in the context of ML engineering, which handle the unique complexities of the practical applications of ML. This is the domain of MLOps. MLOps is a set of standardized processes and technology capabilities for building, deploying, and operationalizing ML systems rapidly and reliably.

The document is in two parts. The first part, an overview of the MLOps lifecycle, is for all readers. It introduces MLOps processes and capabilities and why they’re important for successful adoption of ML-based systems.

The second part is a deep dive on the MLOps processes and capabilities. This part is for readers who want to understand the concrete details of tasks like running a continuous training pipeline, deploying a model, and monitoring predictive performance of an ML model.

Link: Book

Navigational hashtags: #armknowledgesharing #armbooks
General hashtags: #mlops

@data_science_weekly
👍6
CS324 - Large Language Models by Stanford University

The field of natural language processing (NLP) has been transformed by massive pre-trained language models. They form the basis of all state-of-the-art systems across a wide range of tasks and have shown an impressive ability to generate fluent text and perform few-shot learning. At the same time, these models are hard to understand and give rise to new ethical and scalability challenges. In this course, students will learn the fundamentals about the modeling, theory, ethics, and systems aspects of large language models, as well as gain hands-on experience working with them.

TABLE OF CONTENTS
- Introduction
- Capabilities
- Harms I
- Harms
- Data
- Security
- Legality
- Modeling
- Training
- Parallelism
- Scaling laws
- Selective architectures
- Adaptation
- Environmental impact

Link: Course

Navigational hashtags: #armknowledgesharing #armcourses
General hashtags: #nlp #llm #transformer

@data_science_weekly
👍6
Deep Learning with Python by François Chollet

Deep Learning with Python, Second Edition introduces the field of deep learning using Python and the powerful Keras library. In this revised and expanded new edition, Keras creator François Chollet offers insights for both novice and experienced machine learning practitioners. As you move through this book, you’ll build your understanding through intuitive explanations, crisp color illustrations, and clear examples. You’ll quickly pick up the skills you need to start developing deep-learning applications.

What's inside:
- Deep learning from first principles
- Image classification and image segmentation
- Time series forecasting
- Text classification and machine translation
- Text generation, neural style transfer, and image generation
- Printed in full color throughout

Link: Book

Navigational hashtags: #armknowledgesharing #armbooks
General hashtags: #dl #deeplearning #keras

@data_science_weekly
👍4
Competitive Programmer’s Handbook by Antti Laaksonen

The purpose of this book is to give you a thorough introduction to competitive programming. It is assumed that you already know the basics of programming, but no previous background in competitive programming is needed.

The book is especially intended for students who want to learn algorithms and possibly participate in the International Olympiad in Informatics (IOI) or in the International Collegiate Programming Contest (ICPC). Of course, the book is also suitable for anybody else interested in competitive programming.

It takes a long time to become a good competitive programmer, but it is also an opportunity to learn a lot. You can be sure that you will get a good general understanding of algorithms if you spend time reading the book, solving problems and taking part in contests.

Link: Book

Navigational hashtags: #armknowledgesharing #armbooks
General hashtags: #leetcode #programming #competitiveprogramming

@data_science_weekly
👍2
The Querynomicon. An Introduction to SQL for Weary Data Scientists

Upon first encountering SQL after two decades of Fortran, C, Java, and Python, author thought he had stumbled into hell. He quickly realized that was optimistic: after all, hell has rules.

Author have since realized that SQL does too, and that they are no more confusing or contradictory than those of most other programming languages. They only appear so because it draws on a tradition unfamiliar to those of us raised with derivatives of C. To quote Terry Pratchett, it is not mad, just differently sane.

Welcome, then, to a world in which the strange will become familiar, and the familiar, strange. Welcome, thrice welcome, to SQL.

Table of contents:

1. Introduction
2. Core Features
3. Tools
4. Advanced Features
5. Python
6. R
7. PostgreSQL
8. Conclusion

Link: Tutorial

Navigational hashtags: #armknowledgesharing #armtutorials
General hashtags: #sql

@data_science_weekly
👍2
How to Win a Kaggle Competition by Darek Kłeczek

Darek Kłeczek:
When I join a competition, I research winning solutions from past similar competitions. It takes a lot of time to read and digest them, but it's an incredible source of ideas and knowledge. But what if we could learn from all the competitions? We've been given a list of Kaggle writeups in this competition, but there are so many of them! If only we could find a way to extract some structured data and analyze it... Well, it turns out that large language models (LLMs) [1] can help us extract structured data from unstructured writeups.


In this essay, author starts by providing a quick overview of the process he uses to collect data. He then presents several insights from analyzing datasets. The focus is to understand what the community has learned over the past 2 years of working and experimenting with Kaggle competitions. Finally, he mentions some ideas for future research.

Link: Kaggle

Navigational hashtags: #armknowledgesharing #armtutorials
General hashtags: #kaggle #competitions

@data_science_weekly
👍3
MACHINE LEARNING @ Vrije Universiteit Amsterdam

This page contains all public information about the course Machine Learning at the VU University Amsterdam.

They provide the following materials:
- Lecture slides and videos.
- Worksheets
These are very brief Jupyter notebooks to help you get the software installed and to show the basics. They introduce the libraries Numpy, Matplotlib, Pandas, Sklearn and Keras.
- Homework
The homework consists of small pen-and-paper exercises to help you test that you’ve really understood the more technical points of the lectures. Answers are provided. If you are a registered student, please refer to the Canvas page instead. All material authored by Peter Bloem unless noted otherwise.

Link: Site

Navigational hashtags: #armknowledgesharing #armcourses
General hashtags: #machinelearning #ml #dl #deeplearning

@data_science_weekly
👍5
Spark in Action by Jean-Georges Perrin

Spark in Action, Second Edition, teaches you to create end-to-end analytics applications. In this entirely new book, you’ll learn from interesting Java-based examples, including a complete data pipeline for processing NASA satellite data. And you’ll discover Java, Python, and Scala code samples hosted on GitHub that you can explore and adapt, plus appendixes that give you a cheat sheet for installing tools and understanding Spark-specific terms.

Link: Book

Navigational hashtags: #armknowledgesharing #armbooks
General hashtags: #spark #bigdata #sql

@data_science_weekly
👍5
ML and LLM system design: 500 case studies to learn from

How do companies like Netflix, Airbnb, and Doordash apply AI to improve their products and processes? We put together a database of 500 case studies from 100+ companies that share practical ML use cases, including applications built with LLMs and Generative AI, and learnings from designing ML and LLM systems.

Navigation tips. You can play around with the database by filtering case studies by industry or ML use case. We added tags based on recurring themes. This is not a perfect or mutually exclusive division, but you can use the tags to quickly find:
- Generative AI use cases. Look for tags “generative AI” and “LLM” to find examples of real-world LLM applications.
- ML systems with different data types: computer vision (CV) or natural language processing (NLP).
- ML systems for specific use cases. The most popular are recommender systems, search and ranking, and fraud detection.
- We also labeled use cases where ML powers a specific user-facing "product feature": from grammatical error correction to generating outfit combinations.

Link: Site

Navigational hashtags: #armknowledgesharing #armsites
General hashtags: #mlsystemdesign #ml #systemdesign #llm

@data_science_weekly
👍10
Data Structures & Algorithms by Google

Familiarize yourself with common data structures and algorithms such as lists, trees, maps, graphs, Big-O analysis, and more!

Topics:
- Maps/Dictionaries
- Linked Lists
- Trees
- Stacks & Queues
- Heaps
- Graphs
- Runtime Analysis
- Searching & Sorting
- Recursion & DP

Link: Site

Navigational hashtags: #armknowledgesharing #armtutorials
General hashtags: #algorithms #leetcode #programming

@data_science_weekly
👍7
Feature Engineering A-Z by Emil Hvitfeldt

This book is written to be used as a reference guide to nearly all feature engineering methods you will encounter. This is reflected in the chapter structure. Any question a practitioner is having should be answered by looking at the index and finding the right chapter.

Each section tries to be as comprehensive as possible with the number of different methods and solutions that are presented. A section on dimensionality reduction should list all the practical methods that could be used, as well as a comparison between the methods to help the reader decide what would be most appropriate. This does not mean that all methods are recommended to use. A number of these methods have little and narrow use cases. Methods that are deemed too domain-specific have been excluded from this book.

Each chapter will cover a specific method or small group of methods. This will include motivations and explanations for the method. Whenever possible each method will be accompanied by mathematical formulas and visualizations to illustrate the mechanics of the method. A small pros and cons list is provided for each method. Lastly, each section will include code snippets showcasing how to implement the methods. This is done in R and Python, using tidymodels and scikit-learn respectively. This book is a methods book first, and a coding book second.

Links:
- Site
- Repository

Navigational hashtags: #armknowledgesharing #armbooks
General hashtags: #ml #machinelearning #featureengineering

@data_science_weekly
👍4
The Little Book of Deep Learning by François Fleuret

Although the bulk of deep learning is not difficult to understand, it combines diverse components such as linear algebra, calculus, probabilities, optimization, signal processing, programming, algorithmics, and high-performance computing, making it complicated to learn.

Instead of trying to be exhaustive, this little book is limited to the background necessary to understand a few important models. This proved to be a popular approach, resulting in more than 500,000 downloads of the PDF file in the 12 months following its announcement on Twitter.

Link: Site

Navigational hashtags: #armknowledgesharing #armbooks
General hashtags: #machinelearning #ml #deeplearning #dl

@data_science_weekly
👍3
👍4
Learning Spark by Jules S. Damji, Brooke Wenig, Tathagata Das, Denny Lee

Data is bigger, arrives faster, and comes in a variety of formats and it all needs to be processed at scale for analytics or machine learning. But how can you process such varied workloads efficiently? Enter Apache Spark.

Updated to include Spark 3.0, this second edition shows data engineers and data scientists why structure and unification in Spark matters. Specifically, this book explains how to perform simple and complex data analytics and employ machine learning algorithms. Through step-by-step walk-throughs, code snippets, and notebooks, you'll be able to:
- Learn Python, SQL, Scala, or Java high-level Structured APIs
- Understand Spark operations and SQL Engine
- Inspect, tune, and debug Spark operations with Spark configurations and Spark UI
- Connect to data sources: JSON, Parquet, CSV, Avro, ORC, Hive, S3, or Kafka
- Perform analytics on batch and streaming data using Structured Streaming
- Build reliable data pipelines with open source Delta Lake and Spark
- Develop machine learning pipelines with MLlib and productionize models using MLflow

Link: Book

Navigational hashtags: #armknowledgesharing #armbooks
General hashtags: #bigdata #spark #pyspark

@data_science_weekly
👍4
Trustworthy Online Controlled Experiments by Ron Kohavi

Getting numbers is easy; getting numbers you can trust is hard. This practical guide by experimentation leaders at Google, LinkedIn, and Microsoft will teach you how to accelerate innovation using trustworthy online controlled experiments, or A/B tests.

Based on practical experiences at companies that each run more than 20,000 controlled experiments a year, the authors share examples, pitfalls, and advice for students and industry professionals getting started with experiments, plus deeper dives into advanced topics for practitioners who want to improve the way they make data-driven decisions.

Learn how to:
- Use the scientific method to evaluate hypotheses using controlled experiments.
- Define key metrics and ideally an Overall Evaluation Criterion.
- Test for trustworthiness of the results and alert experimenters to violated assumptions.
- Build a scalable platform that lowers the marginal cost of experiments close to zero.
- Avoid pitfalls like carryover effects and Twyman's law
- Understand how statistical issues play out in practice.

Link: Book

Navigational hashtags: #armknowledgesharing #armbooks
General hashtags: #ab #statistics #abtests

@data_science_weekly
👍2
The Kaggle Book by Konrad Banachewicz and Luca Massaron

Millions of data enthusiasts from around the world compete on Kaggle, the most famous data science competition platform of them all. Participating in Kaggle competitions is a surefire way to improve your data analysis skills, network with an amazing community of data scientists, and gain valuable experience to help grow your career.

The first book of its kind, The Kaggle Book assembles in one place the techniques and skills you'll need for success in competitions, data science projects, and beyond. Two Kaggle Grandmasters walk you through modeling strategies you won't easily find elsewhere, and the knowledge they've accumulated along the way. As well as Kaggle-specific tips, you'll learn more general techniques for approaching tasks based on image, tabular, textual data, and reinforcement learning. You'll design better validation schemes and work more comfortably with different evaluation metrics.

Whether you want to climb the ranks of Kaggle, build some more data science skills, or improve the accuracy of your existing models, this book is for you.

Link: Book

Navigational hashtags: #armknowledgesharing #armbooks
General hashtags: #ml #machinelearning #featureengineering #kaggle #metrics #validation #hyperparameters #tabular #cv #nlp

@data_science_weekly
👍4
Grokking Algorithms. An illustrated guide for programmers and other curious people by Aditya Y. Bhargava

Grokking Algorithms is a friendly take on this core computer science topic. In it, you'll learn how to apply common algorithms to the practical programming problems you face every day.

You'll start with tasks like sorting and searching. As you build up your skills, you'll tackle more complex problems like data compression and artificial intelligence. Each carefully presented example includes helpful diagrams and fully annotated code samples in Python.

By the end of this book, you will have mastered widely applicable algorithms as well as how and when to use them.

Table of Contents:
1. Introduction to algorithms
2. Selection sort
3. Recursion
4. Quicksort
5. Hash tables
6. Breadth-first search
7. Dijkstras algorithm
8. Greedy algorithms
9. Dynamic programming
10. K-nearest neighbors

Link: Book

Navigational hashtags: #armknowledgesharing #armbooks
General hashtags: #algorithms #datastructures #leetcode

@data_science_weekly
👍1