Data science
121 subscribers
9 photos
5 videos
2 files
10 links
این کانال برای علاقه مندان به دیتاساینس و هوش مصنوعی
تشکیل شده، از همراهی شما خرسندم
Download Telegram
Channel created
Channel photo updated
Data science
Video
From Raw Data to Real Insights:

Understanding the Journey of a Modern Data

Pipeline

A data pipeline generally consists of a sequence of stages or components that transfer data from its origin to a destination for analysis and utilization.

Here's an overview of the common stages and components in a data pipeline.

1. Collect

- Purpose: Gather raw data from various sources. This data can be generated by applications, sensors, devices, databases, or user interactions.

Components:

- Data Store: Holds operational data, often a database (e.g., relational databases, NoSQL stores).

- Data Stream: Handles real-time data feeds, often using sources like IoT devices, transactional systems, or event logs.

- Application Data: Collects data directly from applications, APIs, or web services.

2. Ingest

Purpose: Move collected data into the pipeline, transforming and consolidating it for further use.

Components:

Data Load: Transfers data from data stores and applications into the processing system.

Event Queue: Manages the flow of data, particularly streaming data, using tools like Apache Kafka or AWS Kinesis.

- Outcome: Data enters the processing layer, often in a more structured format, with consistent formats and time-stamping.

3. Store

Purpose: Persist data so it can be easily

accessed and processed.

- Components:

- Data Lake: A centralized storage repository for large amounts of structured,

semi-structured, and unstructured data.

- Data Warehouse: Structured storage for processed data, optimized for querying and

reporting.

- Lakehouse: Combines elements of data lakes and data warehouses to provide both raw and processed data storage.

- Outcome: Data is stored in various formats (raw, transformed, aggregated) and is accessible for compute and analysis.

4. Compute

Purpose: Process data to prepare it for

analysis and use.

- Components:

- Batch Processing: Periodic processing of large datasets, using frameworks like Apache

Spark or Hadoop.

- Stream Processing: Real-time processing of data streams, often using Apache Flink, Apache Kafka Streams, or AWS Kinesis Data Analytics.

Outcome: Data is processed into usable forms, such as aggregated tables, machine learning features, or transformed datasets.

5. Consume

- Purpose: Deliver data insights and enable its

use across various applications and user groups.

Components:

Data Science, Business Analysis, ML