Zum Inhalt springen

Defining 15 Common Data Engineering Concepts

In an ever-evolving technological world, 90% of the global data was generated in the last two years. An estimated 2.5 quintillion bytes of data are generated daily, necessitating reliable storage and data processing systems. An economic shift saw an increase in internet service providers, prompting cheaper options and hence driving the number of individuals accessing the internet, leading to a surge in data collected. Data engineering as a discipline focuses on building data infrastructure whose purpose is to store, extract and transform data. The article will focus on distinct data engineering core concepts and, in some instances, make comparisons on similar data concepts applicable in the field.

Batch vs Streaming ingestion
Data pipelines built by data engineers store historical data and can be used to perform real-time data analysis. The techniques listed above fall under Extract, Transform and Load (ETL) processes. Batch processing is an automated ETL technique that involves processing large volumes of data into batches or chunks. Employing the use of tools such as Apache Airflow, the technique is efficient where data is processed without immediate action in cases such as data warehousing and periodic reporting. Streaming processing systems handle data in its real-time form. Streaming processing data sources that include social media, feeds, and other live data sources handle information that changes with time. Employing frameworks such as AWS Kinesis, the system’s scalability enables it to handle high data velocity.

Window
Windowing in Streaming
Classified under streaming ingestion, it involves the partitioning of continuous data streams into smaller feasible subsets for systematic data processing. Types of windowing under Real-time analytics include:

  1. Sliding windows – overlap and share information with other windows
  2. Tumbling windows – fixed-size, contiguous time intervals used for making definite data segments
  3. Session windows – ideally potent, their length is dependent on a user’s engagement.
    Change Data Capture (CDC)
    It’s a system used to track and document changes made to data. It’s basically a log motivated to maintain consistency while annexing modifications such as inserts, deletes and updates. Principles governing CDC include: Incremental updates- CDC centers around changed data minimizing network bandwidth, log-based tracking- keeps logs on data transactions, capturing extracted data changes, capture- focuses on data changes that involve inserts, updates and deletes, idempotent processing – ensures duplicates do not affect data integrity. Fields relying on the CDC include: financial services, healthcare, logistics and supply chain, telecommunications and commerce.
    Idempotency
    Factored as a CDC principle, it ensures API’s generating data requests produce similar results despite the number of iterations. Examples of HTTP techniques are GET, OPTIONS, PUT, HEAD, TRACE and DELETE. Idempotency principles solve error handling capabilities of a system, consistent outcomes, debugging, concurrency management and fault tolerance. Implementing idempotency keys, which are unique identifiers, requires the following approaches: generate unique keys, store and check keys and implement expiry for keys.
    ** Online Transaction Processing (OLTP) vs Online Analytical Processing (OLAP)**
    OLTP focuses on transactional processing and real-time data analysis, while OLAP is designed for complex data analysis and reporting. Separating both processes yields the following benefits: Performance optimisation- efficient data processing, improved data quality- reduced risk of errors, enhanced decision-making- independent scaling.
    Columnar vs Row-Based Storage
    Data storage under a columnar system is stored and organized in a column, while in a row-based storage system, data is stored under rows. Benefits of using a columnar system include: compressible data, multipurpose- provides a wide variety of big data applications, speed and efficiency- data is easier to find and self-indexing. Benefits of a row-based storage system include: simpler data manipulation and efficiency for transactional workloads.
    Partitioning.
    For scalability purposes, breaking down data not only helps process the databases but also improves the efficiency of the tools used during data manipulation. Types of data partitioning include: horizontal- data is split into rows housing same set of columns, vertical- split by columns using a partition key column present in all tables maintaining a logical relationship, range- data portioned is dependent on a range of values assigned to a specific table, hash – depends on a harsh function applied to a partition key composite a blend of two portioning techniques and list- a set of values determines partitioning. Partitioning is applicable in machine learning pipelines, log management, OLAP operations and distributed databases.
    Extract Transform Load (ETL) vs Extract Load Transform (ELT)
    An ETL entails extracting data from distinct sources, transforming it to suitable readable formats and loading it into data storage systems, while in an ELT process, data is loaded and then transformed. Notable differences between the 2 include: under ELT, data storage revolves around data warehouses, but more often data lakes holding unstructured data, and ETL makes data privacy compliance simpler due to the transformation activities carried out by data analysts. Challenges faced when migrating from one data architecture to another include: a difference in the logic and code will be experienced, a change in data security parameters prompted by interchanging the loading and transformation process and reconfiguring the data infrastructure.

ETL vs ELT
CAP Theorem
States that in a well-allocated data system, it is impossible to simultaneously attain consistency- all data notes have the same current up-to-date view, partition tolerance- despite a system failure due to failed node communication, the system is operational and availability- requests made in the system do not yield errors, data properties. An entity must choose only two. Trade-offs can be made between
1.1 AP (Availability and Partition Tolerance
2.1 CP (Consistency and Partition Tolerance
3.1 CA (Consistency and Availability
DAGS and Workflow Orchestration
Often created using Apache Airflow and Dagster DAGs, tasks execute in the correct order, preventing cycles which warrant efficient workflows. Uses of DAGs in workflows include task scheduling, dependency management, monitoring and error handling. Advantages include: better visibility- DAGs paints a clear visual representation of the workflow, enhanced observability and increased efficiency- automation of pipelines and workflows enables a data engineer to allocate time and resources to other objectives.
Retry Logic and Dead Letter Queues (DLQ)
Retry logic refers to strategies implemented on a failed system to ensure the reliability of software structures by automatically re-attempting failed operations. Retry logic encompasses maximum retries, backoff strategy, jittered backoff, constant backoff and exponential backoff.
DLQ serves as a storage unit housing problematic data while ensuring no message loss, ushering future re-processing. The two potential causes as to why messages are sent to the DLQ pipeline are erroneous message content and changes in the receiver’s system.
Backfilling and Reprocessing
Backfilling describes the process of replacing old records with new ones when processing historical data. Quality incidence and the presence of anomalies in data force data engineers to employ backfilling techniques. Backfilling impact is felt when it is applied to an ever-growing dataset. Examples of data backfilling include fixing a mistake in data, missing values or data, working with unstructured data and data from calculations.
Data Reprocessing involves recalculating data based on existing information. It’s triggered by manual initiation and driver change. Reprocessing is dependent on the following factors
• Number of rules in the database
• Number of vehicles in the database
• Data range to be reprocessed.

Time Travel and Data Versioning
Time travel enables organisations to conduct data audits or make changes over time. Using machine learning models, time travel techniques query tables as they existed across multiple warehouses or the same workspaces. However, data versioning focuses on tracking and managing changes to datasets over time. Unlike backfilling, data versioning restores datasets to their previous versions, saving time. It’s also a complimentary CDC log tool. Implementation techniques include: versioning approach- Valid_from/to Metadata, Full Duplication and First-class.

Framework
Data Governance
This is a system of rules, policies and processes employed by an organisation in managing individual data assets. It focuses on data security, availability and quality. A data governance framework involves numerous distinct teams addressing issues such as
• Data governance tools
• Organisation goals, roles and duties
• Data policies, processes and standards
• Auditing procedures
Time Travel and Data Versioning
Time travel enables organisations to conduct data audits or make changes over time. Using machine learning models, time travel techniques query tables as they existed across multiple warehouses or the same workspaces. However, data versioning focuses on tracking and managing changes to datasets over time.
Distributed Processing Concepts
Involves splitting computational tasks into smaller parts and analysing data over multiple interconnected devices or nodes. Benefits include: scalability, fault tolerance, efficient handling of large volumes of data and performance. Disadvantages of distributed data processing include: data consistency, network latency, ensuring data security and system complexity.

Schreibe einen Kommentar

Deine E-Mail-Adresse wird nicht veröffentlicht. Erforderliche Felder sind mit * markiert