Data Science Jupyter Notebooks
8.25K subscribers
80 photos
22 videos
9 files
195 links
Explore the world of Data Science through Jupyter Notebooksโ€”insights, tutorials, and tools to boost your data journey. Code, analyze, and visualize smarter with every post.
Download Telegram
๐ŸŽโณThese 6 steps make every future post on LLMs instantly clear and meaningful.

Learn exactly where Web Scraping, Tokenization, RLHF, Transformer Architectures, ONNX Optimization, Causal Language Modeling, Gradient Clipping, Adaptive Learning, Supervised Fine-Tuning, RLAIF, TensorRT Inference, and more fit into the LLM pipeline.

๏นŒ๏นŒ๏นŒ๏นŒ๏นŒ๏นŒ๏นŒ๏นŒ๏นŒ

ใ€‹ ๐—•๐˜‚๐—ถ๐—น๐—ฑ๐—ถ๐—ป๐—ด ๐—Ÿ๐—Ÿ๐— ๐˜€: ๐—ง๐—ต๐—ฒ ๐Ÿฒ ๐—˜๐˜€๐˜€๐—ฒ๐—ป๐˜๐—ถ๐—ฎ๐—น ๐—ฆ๐˜๐—ฒ๐—ฝ๐˜€

โœธ 1๏ธโƒฃ Data Collection (Web Scraping & Curation)

โ˜† Web Scraping: Gather data from books, research papers, Wikipedia, GitHub, Reddit, and more using Scrapy, BeautifulSoup, Selenium, and APIs.

โ˜† Filtering & Cleaning: Remove duplicates, spam, broken HTML, and filter biased, copyrighted, or inappropriate content.

โ˜† Dataset Structuring: Tokenize text using BPE, SentencePiece, or Unigram; add metadata like source, timestamp, and quality rating.

โœธ 2๏ธโƒฃ Preprocessing & Tokenization

โ˜† Tokenization: Convert text into numerical tokens using SentencePiece or GPTโ€™s BPE tokenizer.

โ˜† Data Formatting: Structure datasets into JSON, TFRecord, or Hugging Face formats; use Sharding for parallel processing.

โœธ 3๏ธโƒฃ Model Architecture & Pretraining

โ˜† Architecture Selection: Choose a Transformer-based model (GPT, T5, LLaMA, Falcon) and define parameter size (7Bโ€“175B).

โ˜† Compute & Infrastructure: Train on GPUs/TPUs (A100, H100, TPU v4/v5) with PyTorch, JAX, DeepSpeed, and Megatron-LM.

โ˜† Pretraining: Use Causal Language Modeling (CLM) with Cross-Entropy Loss, Gradient Checkpointing, and Parallelization (FSDP, ZeRO).

โ˜† Optimizations: Apply Mixed Precision (FP16/BF16), Gradient Clipping, and Adaptive Learning Rate Schedulers for efficiency.

โœธ 4๏ธโƒฃ Model Alignment (Fine-Tuning & RLHF)

โ˜† Supervised Fine-Tuning (SFT): Train on high-quality human-annotated datasets (InstructGPT, Alpaca, Dolly).

โ˜† Reinforcement Learning from Human Feedback (RLHF): Generate responses, rank outputs, train a Reward Model (PPO), and refine using Proximal Policy Optimization (PPO).

โ˜† Safety & Constitutional AI: Apply RLAIF, adversarial training, and bias filtering.

โœธ 5๏ธโƒฃ Deployment & Optimization

โ˜† Compression & Quantization: Reduce model size with GPTQ, AWQ, LLM.int8(), and Knowledge Distillation.

โ˜† API Serving & Scaling: Deploy with vLLM, Triton Inference Server, TensorRT, ONNX, and Ray Serve for efficient inference.

โ˜† Monitoring & Continuous Learning: Track performance, latency, and hallucinations;

โœธ 6๏ธโƒฃEvaluation & Benchmarking

โ˜† Performance Testing: Validate using HumanEval, HELM, OpenAI Eval, MMLU, ARC, and MT-Bench.
โ‰ฃโ‰ฃโ‰ฃโ‰ฃโ‰ฃโ‰ฃโ‰ฃโ‰ฃโ‰ฃโ‰ฃโ‰ฃโ‰ฃโ‰ฃโ‰ฃโ‰ฃโ‰ฃโ‰ฃโ‰ฃโ‰ฃโ‰ฃโ‰ฃโ‰ฃโ‰ฃโ‰ฃโ‰ฃโ‰ฃ

https://t.iss.one/DataScienceM โญ๏ธ
Please open Telegram to view this post
VIEW IN TELEGRAM
โค2
html-to-markdown

A modern, fully typed Python library for converting HTML to Markdown. This library is a completely rewritten fork of markdownify with a modernized codebase, strict type safety and support for Python 3.9+.

Features:
โญ๏ธ Full HTML5 Support: Comprehensive support for all modern HTML5 elements including semantic, form, table, ruby, interactive, structural, SVG, and math elements
โญ๏ธ Enhanced Table Support: Advanced handling of merged cells with rowspan/colspan support for better table representation
โญ๏ธ Type Safety: Strict MyPy adherence with comprehensive type hints
Metadata Extraction: Automatic extraction of document metadata (title, meta tags) as comment headers
โญ๏ธ Streaming Support: Memory-efficient processing for large documents with progress callbacks
โญ๏ธ Highlight Support: Multiple styles for highlighted text (<mark> elements)
โญ๏ธ Task List Support: Converts HTML checkboxes to GitHub-compatible task list syntax

nstallation
pip install html-to-markdown

Optional lxml Parser
For improved performance, you can install with the optional lxml parser:
pip install html-to-markdown[lxml]

The lxml parser offers:

๐Ÿ†˜ ~30% faster HTML parsing compared to the default html.parser
๐Ÿ†˜ Better handling of malformed HTML
๐Ÿ†˜ More robust parsing for complex documents

Quick Start
Convert HTML to Markdown with a single function call:
from html_to_markdown import convert_to_markdown

html = """
<!DOCTYPE html>
<html>
<head>
<title>Sample Document</title>
<meta name="description" content="A sample HTML document">
</head>
<body>
<article>
<h1>Welcome</h1>
<p>This is a <strong>sample</strong> with a <a href="https://example.com">link</a>.</p>
<p>Here's some <mark>highlighted text</mark> and a task list:</p>
<ul>
<li><input type="checkbox" checked> Completed task</li>
<li><input type="checkbox"> Pending task</li>
</ul>
</article>
</body>
</html>
"""

markdown = convert_to_markdown(html)
print(markdown)


Working with BeautifulSoup:

If you need more control over HTML parsing, you can pass a pre-configured BeautifulSoup instance:
from bs4 import BeautifulSoup
from html_to_markdown import convert_to_markdown

# Configure BeautifulSoup with your preferred parser
soup = BeautifulSoup(html, "lxml") # Note: lxml requires additional installation
markdown = convert_to_markdown(soup)


Github: https://github.com/Goldziher/html-to-markdown

https://t.iss.one/DataScienceN โญ๏ธ
Please open Telegram to view this post
VIEW IN TELEGRAM
โค3๐Ÿ‘1
This media is not supported in your browser
VIEW IN TELEGRAM
LangExtract

A Python library for extracting structured information from unstructured text using LLMs with precise source grounding and interactive visualization.

GitHub: https://github.com/google/langextract

https://t.iss.one/DataScience4 ๐Ÿ–•
Please open Telegram to view this post
VIEW IN TELEGRAM
๐Ÿ‘2โค1
This channels is for Programmers, Coders, Software Engineers.

0๏ธโƒฃ Python
1๏ธโƒฃ Data Science
2๏ธโƒฃ Machine Learning
3๏ธโƒฃ Data Visualization
4๏ธโƒฃ Artificial Intelligence
5๏ธโƒฃ Data Analysis
6๏ธโƒฃ Statistics
7๏ธโƒฃ Deep Learning
8๏ธโƒฃ programming Languages

โœ… https://t.iss.one/addlist/8_rRW2scgfRhOTc0

โœ… https://t.iss.one/Codeprogrammer
Please open Telegram to view this post
VIEW IN TELEGRAM
This media is not supported in your browser
VIEW IN TELEGRAM
โœ… open-source alternative to Perplexity.

โœ… Real-time web search with Firecrawl API
โœ… Advanced answers with GPT-4o-mini
โœ… Every sentence with reference and source
โœ… Automatic stock display with TradingView


โ”Œ ๐Ÿ” Fireplexity
โ”œ
๐Ÿฅต Website
โ”” ๐Ÿฑ GitHub-Repos

https://t.iss.one/DataScienceN ๐ŸŒŸ
Please open Telegram to view this post
VIEW IN TELEGRAM
โค2
This media is not supported in your browser
VIEW IN TELEGRAM
๐Ÿงฑ AI now generates worlds in the style of Minecraft โ€” presenting the GameFactory model

Researchers trained the model on 70 hours of Minecraft gameplay and achieved impressive results: 
GameFactory can create procedural game worlds โ€” from volcanoes to cherry blossom forests, just like in the iconic simulator.

๐Ÿ”ฅ Want your own endless world? Just set the parameters.

๐ŸŸ  Examples and code โ€” at the link: https://yujiwen.github.io/gamefactory/

๐ŸŸ Github: https://github.com/KwaiVGI/GameFactory

https://t.iss.one/DataScienceN ๐ŸŒŸ
Please open Telegram to view this post
VIEW IN TELEGRAM
โค2
python-docx: Create and Modify Word Documents #python

python-docx is a Python library for reading, creating, and updating Microsoft Word 2007+ (.docx) files.

Installation
pip install python-docx

Example
from docx import Document

document = Document()
document.add_paragraph("It was a dark and stormy night.")
<docx.text.paragraph.Paragraph object at 0x10f19e760>
document.save("dark-and-stormy.docx")

document = Document("dark-and-stormy.docx")
document.paragraphs[0].text
'It was a dark and stormy night.'

https://t.iss.one/DataScienceN ๐Ÿš—
Please open Telegram to view this post
VIEW IN TELEGRAM
โค2๐Ÿ‘2
This media is not supported in your browser
VIEW IN TELEGRAM
Data scientists, this is for you โ€” I dug up LeetCode for DS

DataLemur โ€” a powerful platform that collects real interview problems from Tesla, Facebook, Twitter, Microsoft, and other top companies

Inside: practical tasks on SQL, statistics, Python, and ML. You can filter by difficulty level and company

Top-notch for those preparing for interviews for Data Scientist / Data Analyst roles. Get it here ๐Ÿฏ

๐Ÿ‘‰ https://t.iss.one/DataScienceN ๐Ÿ‘
Please open Telegram to view this post
VIEW IN TELEGRAM
โค2
๐Ÿ”ฅ Trending Repository: Deep-Learning-Roadmap

๐Ÿ“ Description: :satellite: Organized Resources for Deep Learning Researchers and Developers

๐Ÿ”— Repository URL: https://github.com/astorfi/Deep-Learning-Roadmap

๐ŸŒ Website: https://machinelearningmindset.com/deep-learning-resources/

๐Ÿ“– Readme: https://github.com/astorfi/Deep-Learning-Roadmap#readme

๐Ÿ“Š Statistics:
๐ŸŒŸ Stars: 3.2K stars
๐Ÿ‘€ Watchers: 144
๐Ÿด Forks: 314 forks

๐Ÿ’ป Programming Languages: Python

๐Ÿท๏ธ Related Topics:
#reinforcement_learning #deep_learning


==================================
๐Ÿง  By: https://t.iss.one/DataScienceM
โค1
๐Ÿ”ฅ Trending Repository: awesome-transformer-nlp

๐Ÿ“ Description: A curated list of NLP resources focused on Transformer networks, attention mechanism, GPT, BERT, ChatGPT, LLMs, and transfer learning.

๐Ÿ”— Repository URL: https://github.com/cedrickchee/awesome-transformer-nlp

๐Ÿ“– Readme: https://github.com/cedrickchee/awesome-transformer-nlp#readme

๐Ÿ“Š Statistics:
๐ŸŒŸ Stars: 1.1K stars
๐Ÿ‘€ Watchers: 41
๐Ÿด Forks: 131 forks

๐Ÿ’ป Programming Languages: Not available

๐Ÿท๏ธ Related Topics:
#nlp #natural_language_processing #awesome #transformer #neural_networks #awesome_list #llama #transfer_learning #language_model #attention_mechanism #bert #gpt_2 #xlnet #pre_trained_language_models #gpt_3 #gpt_4 #chatgpt


==================================
๐Ÿง  By: https://t.iss.one/DataScienceM
๐Ÿ”ฅ Trending Repository: SemanticSegmentation_DL

๐Ÿ“ Description: Resources of semantic segmantation based on Deep Learning model

๐Ÿ”— Repository URL: https://github.com/tangzhenyu/SemanticSegmentation_DL

๐Ÿ“– Readme: https://github.com/tangzhenyu/SemanticSegmentation_DL#readme

๐Ÿ“Š Statistics:
๐ŸŒŸ Stars: 1.1K stars
๐Ÿ‘€ Watchers: 77
๐Ÿด Forks: 315 forks

๐Ÿ’ป Programming Languages: Jupyter Notebook - Python - Shell - sed

๐Ÿท๏ธ Related Topics: Not available

==================================
๐Ÿง  By: https://t.iss.one/DataScienceM
๐Ÿ”ฅ Trending Repository: awesome-jetpack-compose-learning-resources

๐Ÿ“ Description: ๐Ÿ‘“ A continuously updated list of learning Jetpack Compose for Android apps.

๐Ÿ”— Repository URL: https://github.com/androiddevnotes/awesome-jetpack-compose-learning-resources

๐Ÿ“– Readme: https://github.com/androiddevnotes/awesome-jetpack-compose-learning-resources#readme

๐Ÿ“Š Statistics:
๐ŸŒŸ Stars: 1.4K stars
๐Ÿ‘€ Watchers: 41
๐Ÿด Forks: 140 forks

๐Ÿ’ป Programming Languages: Kotlin

๐Ÿท๏ธ Related Topics:
#android #kotlin #awesome #mvvm #android_architecture #compose #beginner_friendly #android_apps #hacktoberfest #coroutines_android #mvvm_android #android_jetpack #first_issue #jetpack_android #learn_android #jetpack_compose #hacktoberfest2020 #android_compose #awesome_android


==================================
๐Ÿง  By: https://t.iss.one/DataScienceM
โค1
๐Ÿ”ฅ Trending Repository: awesome-learning

๐Ÿ“ Description: A curated list for DevOps learning resources. Join the slack channel to discuss more.

๐Ÿ”— Repository URL: https://github.com/Lets-DevOps/awesome-learning

๐Ÿ“– Readme: https://github.com/Lets-DevOps/awesome-learning#readme

๐Ÿ“Š Statistics:
๐ŸŒŸ Stars: 920 stars
๐Ÿ‘€ Watchers: 43
๐Ÿด Forks: 310 forks

๐Ÿ’ป Programming Languages: Not available

๐Ÿท๏ธ Related Topics:
#infrastructure #learning #devops


==================================
๐Ÿง  By: https://t.iss.one/DataScienceN
๐Ÿ”ฅ Trending Repository: Machine-Learning-Tutorials

๐Ÿ“ Description: machine learning and deep learning tutorials, articles and other resources

๐Ÿ”— Repository URL: https://github.com/ujjwalkarn/Machine-Learning-Tutorials

๐ŸŒ Website: https://ujjwalkarn.github.io/Machine-Learning-Tutorials

๐Ÿ“– Readme: https://github.com/ujjwalkarn/Machine-Learning-Tutorials#readme

๐Ÿ“Š Statistics:
๐ŸŒŸ Stars: 16.6K stars
๐Ÿ‘€ Watchers: 797
๐Ÿด Forks: 3.9K forks

๐Ÿ’ป Programming Languages: Not available

๐Ÿท๏ธ Related Topics:
#list #machine_learning #awesome #deep_neural_networks #deep_learning #neural_network #neural_networks #awesome_list #machinelearning #deeplearning #deep_learning_tutorial


==================================
๐Ÿง  By: https://t.iss.one/DataScienceN
โค2
๐Ÿ”ฅ Trending Repository: awesome-recursion-schemes

๐Ÿ“ Description: Resources for learning and using recursion schemes.

๐Ÿ”— Repository URL: https://github.com/passy/awesome-recursion-schemes

๐Ÿ“– Readme: https://github.com/passy/awesome-recursion-schemes#readme

๐Ÿ“Š Statistics:
๐ŸŒŸ Stars: 1.3K stars
๐Ÿ‘€ Watchers: 44
๐Ÿด Forks: 56 forks

๐Ÿ’ป Programming Languages: Not available

๐Ÿท๏ธ Related Topics:
#awesome #recursion_schemes #catamorphisms


==================================
๐Ÿง  By: https://t.iss.one/DataScienceN
โค1
๐Ÿ”ฅ Trending Repository: awesome-deeplearning-resources

๐Ÿ“ Description: Deep Learning and deep reinforcement learning research papers and some codes

๐Ÿ”— Repository URL: https://github.com/endymecy/awesome-deeplearning-resources

๐Ÿ“– Readme: https://github.com/endymecy/awesome-deeplearning-resources#readme

๐Ÿ“Š Statistics:
๐ŸŒŸ Stars: 2.9K stars
๐Ÿ‘€ Watchers: 221
๐Ÿด Forks: 666 forks

๐Ÿ’ป Programming Languages: Not available

๐Ÿท๏ธ Related Topics:
#nlp #video #reinforcement_learning #deep_learning #neural_network #code #paper #corpus #modelzoo


==================================
๐Ÿง  By: https://t.iss.one/DataScienceN
๐Ÿ”ฅ Trending Repository: Machine_Learning_Resources

๐Ÿ“ Description: :fish::fish::fish: ๆœบๅ™จๅญฆไน ้ข่ฏ•ๅคไน ่ต„ๆบ

๐Ÿ”— Repository URL: https://github.com/wangyuGithub01/Machine_Learning_Resources

๐Ÿ“– Readme: https://github.com/wangyuGithub01/Machine_Learning_Resources#readme

๐Ÿ“Š Statistics:
๐ŸŒŸ Stars: 1.2K stars
๐Ÿ‘€ Watchers: 10
๐Ÿด Forks: 179 forks

๐Ÿ’ป Programming Languages: Not available

๐Ÿท๏ธ Related Topics: Not available

==================================
๐Ÿง  By: https://t.iss.one/DataScienceN
๐Ÿ”ฅ Trending Repository: Awesome-Meta-Learning

๐Ÿ“ Description: A curated list of Meta Learning papers, code, books, blogs, videos, datasets and other resources.

๐Ÿ”— Repository URL: https://github.com/sudharsan13296/Awesome-Meta-Learning

๐Ÿ“– Readme: https://github.com/sudharsan13296/Awesome-Meta-Learning#readme

๐Ÿ“Š Statistics:
๐ŸŒŸ Stars: 1.5K stars
๐Ÿ‘€ Watchers: 68
๐Ÿด Forks: 298 forks

๐Ÿ’ป Programming Languages: Not available

๐Ÿท๏ธ Related Topics:
#one_shot_learning #zero_shot_learning #metalearning #few_shot_learning #deep_meta_learning #meta_reinforcement


==================================
๐Ÿง  By: https://t.iss.one/DataScienceN
๐Ÿ”ฅ Trending Repository: programming-math-science

๐Ÿ“ Description: This is a list of links to different freely available learning resources about computer programming, math, and science.

๐Ÿ”— Repository URL: https://github.com/bobeff/programming-math-science

๐Ÿ“– Readme: https://github.com/bobeff/programming-math-science#readme

๐Ÿ“Š Statistics:
๐ŸŒŸ Stars: 1.8K stars
๐Ÿ‘€ Watchers: 26
๐Ÿด Forks: 129 forks

๐Ÿ’ป Programming Languages: Not available

๐Ÿท๏ธ Related Topics:
#science #programming #math #awesome_list


==================================
๐Ÿง  By: https://t.iss.one/DataScienceN