As companies are becoming more reliant upon data to run and grow their business, and want to execute Machine learning use cases that require a solid data infrastructure, the role of a data engineer is getting increasingly important.
This hiring guide will cover all to know about data engineering, why it has become such a popular role, what a data engineer does, and how to hire your next data engineer successfully.
About Data Engineering
Data engineering is the process of developing and constructing large-scale data collection, storage, and analysis systems. It's a vast field that is applied in almost every industry. Data engineering teams gather and manage data on large scales, using their knowledge and the right technologies to ensure that the data is in a useful state by the time it reaches data scientists, analysts, and other consumers.
What data engineers do is build distributed systems that collect, handle, and convert raw data into usable information for data science teams, machine learning, and business intelligence experts to later use in various applications.
Data engineers design and build data pipelines that transform and transport large pieces of information into a highly usable format by the time it reaches the end-users. These pipelines typically collect data from various sources and store it in a single Data warehousing or Data lake repository that represents it uniformly as a single source of information.
The above diagram illustrates the workflow of a data platform, technologies commonly used in each step, and the scope of responsibilities for a data engineer. As you can see, a lot of work goes into data engineering before it is consumed by BI analysts or Data Scientists. Studies have shown that a staggering 80% of the effort spent on data-driven projects is about data engineering to get data ready to use while 20% is about creating value out of this data.
So it is no surprise that according to a Stack Overflow survey of 2021, the salaries of data engineers come down to an annual sum of $68.034 on average, which makes this vocation rise in the upper part of salary charts.
Why and when do you need to hire a Data Engineer?
You need to hire a data engineer if you are looking to build applications or data platforms (data warehouse, data lakes) that requires you to retrieve and consolidate data coming from various sources.
This need typically arises when either you have a Machine learning use case that requires vast amounts of data or when you are in need of a centralized repository that allows you to store all your structured and unstructured data, also known as a Data Lake or Data Warehouse.
Types of data engineers
Big Data-centric (data) engineer
A Big data engineer focuses on handling large datasets. The storage of the data is typically in distributed file systems or object storage systems rather than relational databases. To handle the amounts of data, a Big Data engineer uses data processing frameworks such as Spark, MapReduce, or Flink. Even though SQL is frequently used, most of the programming is done in languages such as Scala, Python, or Java, making the role more similar to a backend developer. A Big Data Engineer typically uses the ETL (Extract, Transform, Load) process and builds batch and streaming data pipelines which are typically orchestrated by tools such as Apache Airflow.
Database or Data Warehouse-centric data engineer
A database or Data Warehouse-centric data engineer focuses primarily on extracting and transforming structured data from relational databases The process here includes using a standard database management system, table-driven logic and methods, database servers and stored procedures.
What does a data engineer do day-to-day?
The responsibilities and tasks of a data engineer typically include:
- Identifying and implementing re-designs of infrastructure for scalability
- Optimizing data delivery
- Assembling large sets of data
- Building infrastructure for data extraction and loading
- Creating analytical tools for the data pipeline and providing insight for operational efficiency
We have also asked Mehmet Ozan Ünal, a data engineer at Proxify, about the day-to-day job tasks that this position entails, and he stated:
“Data engineers usually create ETL pipelines, design schemas, and monitor and schedule pipelines. Another crucial responsibility is designing and formatting the data infrastructures for the company. A data engineer should link the connection between data sources (for example SAP), also IoT (Internet of Things), and app data) and data consumers (data analysts, data scientists, business people, machine learning pipelines, business intelligence, and reporting systems).”Mehmet Ozan Ünal
In a nutshell, what a data engineer does is:
- Developing and maintaining data platforms
- In-depth analysis of raw data
- Improving the quality and efficiency of all data
- Developing and testing architectures for extracting and transforming data
- Building data pipelines
- Building algorithms to process data efficiently
- Researching methods for data reliability
- Support in the development of analytical tools
Interviewing a Data Engineer
Essential technologies and programming languages for a data engineer
Mehmet lists the top technologies a data engineer must know:
- Programming languages: SQL and either Python, Scala, or Java
- Tools and systems: Kafka, Spark, Apache Airflow (for data pipeline orchestration), Transactional Databases (MySQL, PostgreSQL), data formats (Parquet, Protobuf, Avro)Coding: Version Management (Git), Algorithms, and Data Structures
- Containerization: CI/CD systems and Docker
- Cloud: Azure, GCP, or AWS
Top tools your data engineer should be familiar with:
There are specific tools that make data engineering more efficient. The top five are listed below:
1. Data Warehouses
- Amazon Redshift: A cloud data warehouse for easy data setup and scaling.
- Google Big Query: A cloud data warehouse great for smaller starting businesses that want scaling.
- Snowflake: A SaaS that’s fully managed, providing one platform for multiple purposes like data engineering, data lakes, data warehousing, data app development, and more.
2. Data ingestion and extraction
- Apache Spark: An analytics engine of open source used to process data on a large scale. It is an open-source project but is also available as a web-based distribution created by the founders of Spark called Databricks.
- Google Cloud Data Fusion: A web UI for the creation of integration solutions for scalable data, in order to prepare and transform data, with no need for infrastructure altering.
- Azure Data Factory: An ETL (Extract, Transform, Load) service of Azure for the creation of dataflows and integration data with no need of server usage.
3. Data transformation
- Dbt: Data Build Tool, a tool for transforming data directly in the warehouse, through the usage of code for the overall process.
4. Data Lake and Lakehouse
- Databricks: A platform that is unified, open, and used for all data, precisely for scheduled and interactive data analyses.
- Amazon S3: A service for object-storing and scalability, performance, and security of data that is stored as an object format.
- Google Cloud storage: Google service for storing objects, and data in the Google cloud.
- Azure Data Lake: As part of Microsoft, the Azure Data Lake is a public platform with services and products for the analytics of big data.
5. Workflow orchestrators
- Apache Airflow: An open-source WMS (workflow management system) tool for organizing, scheduling, and monitoring workflows.
- Luigi: A Python tool package for building, scheduling and orchestrating pipelines.
6. Event and stream processing
- Google Cloud Pub/Sub: Messaging with quick and immediate alerts and notifications, enabling parallel processing and architectures that are event-driven.
- Apache Kafka: An event store and open-source platform for stream-processing actions.
Technical skills of a Data Engineer
A data engineer has to have these crucial technical skills:
- Data collecting—handling the volume of data, but also the variety and velocity.
- Coding—proficiency in programming is vital, so they need an excellent grasp of either Scala, Java, or Python which are the most commonly used languages for Data engineering systems and frameworks.
- Data transforming—the data engineer has to be well-versed in transforming, for example, cleansing (e.g. removing duplicates), joining data together, and aggregating the data.
- Data warehousing—knows how to divide the DW into tiers, and create fact tables by combining tables and aggregating them to make reporting more efficient.
- Data analyzing—knows how to draw insights from a dataset, especially when it comes to quality checks, e.g. distribution of data, duplicate checks, etc.
Additionally, Mehmet said:
“A good data engineer has to have hands-on knowledge and experience with coding and data warehousing. Along with this, basic machine learning, basic data modeling, and Linux and shell scripting.”
Top interview questions (and answers) for assessing Data Engineers
One excellent approach to assess the technical abilities of the candidate is to ask Data engineering-specific questions that will help you separate the wheat from the chaff. To test and assess the skills and expertise of a data engineer and to find the best candidate, you might want to enquire about:
- Elaborate on Hadoop and its components.
Expected answer: Hadoop is a framework, open-source, and used for practical and efficient large dataset storing and processing. These large sets could range from gigabyte to petabyte size variations, and Hadoop allows for simplified clustering of such large volumes.
Initially, there are four components (1 basic/common + 3 core ones) and additional in-depth components, as explained below.
- Hadoop Common—The set of standard, basic Hadoop libraries.
- Hadoop MapReduce—For the processing of large-scale data.
- Hadoop YARN—For resource management as well as task schedulings.
- HDFS—Hadoop Distributed File System.
Then, we also use:
- Hive and Pig—For data access
- Apache Flume, Chukwa, Sqoop—For data integration
- HBase—For data storage
- Avro and Thrift—For data sterilization
- Drill and Apache Mahout—For data intelligence
- Oozie, Zookeeper, and Ambari—For data management, orchestration and monitoring
- What are Block & Block Scanner of HDFS?
Expected answer: The blocks in HDFS are data files that are the smallest possible size, and they are the result of enormous-size files splitting by Hadoop.
The Block Scanner helps us track lists containing the blocks and identifies checksum issues through throttling (executing a function just once, regardless of event repetitions).
- Explain what Reducer is and its main methods.
Expected answer: When we process data in Hadoop, Mapper is the second stage, and Reducer is the third processing stage.
Reducer has three main methods:
- setup()—With this command, we can easily handle parameters, cache, and specify the size of input data.
- cleanup()—With this function, we can remove the files that are temporary.
- reduce()—We can call this just once for every task key, and this is the essence of the Reducer.
- What is the usage of args and kwargs?
Expected answer: Both args and kwargs are functions. If we need to specify and use a function in a command line and order, we use * args. Otherwise, if we need to point to a group of in-line arguments, unordered, that pass to functions, we use ** kwargs.
- Compare Star Schema and Snowflake Schema.
Expected answer: In data modeling, there are two types of schemas for designs, Star Schema and Snowflake Schema.
Star Schema has denormalized dimensions or values repetition in one table, and the writing of queries is easy, without the need to write more joins. It is easy to set up and design too. With Star Schema, we have redundant data stored in dimension tables. Regarding storage space, Star Schema requires more of it for speedy performance compared to Snowflake Schema.
Snowflake Schema has a data structure, normalized with hierarchies for dimensions that are neatly presented in separate tables. This schema is notably more complex for overall maintenance than Star. With Snowflake, the query writing is not simplified, and we need more joins for linking extra tables. Redundant data is avoided with Snowflake Schema, and dimension tables are normalized. With Snowflake Schema, there is no large storage space requirement due to the redundancy factor (which is more significant in Star, above).
- Explain what Secondary NameNode is and mention its functions.
Expected answer: Secondary NameNode is used for receiving logs of edit changes from a specific name node. This enables us to set and keep a limit on the log size of the edits.
The functions of Secondary NameNode are:
- Checkpoint—We use this to confirm and assure that we don’t have corrupted data.
- FsImage—We use this if we need to save a copy of FsImage or an EditLog file.
- Update—For an automatic process of updating the FsImage and EditLog files, we use Update.
- NameNode crash—In case NameNode does indeed fail and crash, then we can recreate NameNode through the usage of FsImage.
- Can you list and elaborate on the various approaches to data validation?
Expected answer: We can check the following points listed below with the data validation approaches:
- Data—The data type check allows for accuracy checking of the data. So, for example, we could not use letter-based data in a field that would accept just number-based data.
- Range—If we need to check whether a predefined range contains input data, we do a Range check.
- Code—When we select a particular field from a values list, we must ensure that list is valid, and there is accurate formatting of that field, so we do this with a code check.
- Format—We will encounter a predefined structure with a lot of data types. With a format check, we check if all dates throughout are in a proper format (DD/MM-YYYY or YYYY/MM/DD, for example).
- Uniqueness—With a uniqueness check, we ensure that there are no repetitive item entries of items that have to be unique by nature, such as email address or ID content.
- Consistency—With a consistency check, we check whether we rely on logic when we enter data, with a clear order and hierarchy of items. For example, the ‘date of production’ of something should be followed by a ‘release date’, not the other way around.
- What is DAG in Apache Spark?
Expected answer: DAG (Directed Acrylic Graph) represents a Vertices set used for code interpretations. This is a graph that stores and presents every operation on the RDD, and it’s an excellent way to create and manage the flow of operations. It offers a topological order and sequencing, and a neat visual presentation of RDDs.
- What is the Spark “lazy evaluation”?
Expected answer: ‘Lazy evaluation’ in Spark represents a process where action must be called, otherwise a process will not be executed. Hence, Spark is not too active or concerned when we are working on transformations, but when we call an action, a DAG (Directed Acrylic Graph) is created. Naturally, all transformations are slow and lazy, so operations in RDD do not start immediately. This is when Spark comes in handy—it sends the transformations to a DAG that will execute when there is a data request by a driver.
- In what way is Spark different when compared to Hadoop MapReduce?
Expected answer: Spark is a tool for data pipeline handling, a distributed processing system, open-source, and used for working with big data. Hadoop MapReduce is a framework for simplified writing of apps that handle a large scope of data. Now, when we compare Spark and Hadoop MapReduce, we can conclude that Mapreduce is processing the data on the disk, but Spark is processing data in memory, and retains it there too. Also, MapReduce is slower when it comes to processing speed, compared to Spark.
- What is the difference between left, right, and inner join?
Expected answer: Left, Right, and Inner join all represent keywords in SQL, used for row combining in a minimum of two or more tables, and when they have a common column somewhere in between them.
- Inner join—Used for selection of all rows altogether from the tables. This keyword combines the rows of the tables which have the same identical common field value.
- Left join—This keyword positions the table rows on the left from the join, and also synchronizes the rows on the right side table.
- Right join—This keyword brings back the rows in that table which is on the right of the join, but synchronizes the rows for the left side table. It is quite similar to the Left join.
- Can you define data normalization?
Expected answer: The process of creating clean data is generally known as data normalization, i.e., organizing data and defining its unified format in multiple fields and records. Data normalization contributes to removing of data that is not structured or duplicated, thus leaving only data storage that is logical and used.
Possible challenges during the hiring of a Data Engineer
There are always challenges when hiring a new person, depending on the vocation and job requirements.
One major challenge is that the scope of the Data Engineer has become large and confusing. It should not be mixed up with the following related roles:
- Database Administrator (DBA) which is more focused on the creation and optimization of OLTP databases.
- Data Analysts who are typically more focused on driving business value by creating dashboards and building ad-hoc reports.
- Analytics Engineer which is similar to a Data Analyst’s role but with more of a software engineer’s skill-set (version control, CI/CD, use of Python/Scala/Java) and typically focuses on data warehousing SQL pipelines and optimization.
- Machine Learning Engineer who is skilled in deploying ML models built by Data Scientists into production. It requires a more deep understanding of statistics, algo’s, and math. Some Data Engineers possess this knowledge but for a medium to large-sized data team, it should be a role on its own.
It needs to be emphasized that hiring managers/employers quite often have the habit of offering a lower salary, or salary below market value to data engineers. This type of ‘challenge’ belongs on the spectrum side of the hiring managers, who mistakenly focus intensely on anything except the long-term benefit of having a skilled data engineer in the company.
What distinguishes a great Data Engineer from a good one?
Selecting the right candidate that’s best for the Data Engineer role can be tricky, especially if at least two candidates have similar experience and expertise. However, one will always stand out, through in-depth knowledge, mastering technical skills, and a proactive, dynamic way of thinking.
The great Data Engineer:
Creates solutions that easily maintained. For example, if a manual data mapping is required to get the data cleansed, does the developer hard code the values or create a config file that can be easily updated?
Understands business needs and doesn’t over-engineer solutions. It is easy to fall in the trap of building more complex solutions than needed. This could for example be building a near-real time streaming pipeline when the actual data refreshness need is daily.
Let’s not forget that being a team player is especially important because a data engineer will need to regularly communicate with others in the team who have different roles (Data Scientist, Data Analyst, ML Engineer etc.) and other company teams.
The value of Data Engineering
Any business can benefit from data engineering because they are enabling companies to become more data-driven. It might sound vague but data engineering is the foundation you need to make data easily consumed and accurate and enable advanced analytics and machine learning use cases. As mentioned above, 80% of the effort in any data project is spent on data engineering.
In summary, Mehmet explains:
“Data engineers are responsible for designing the overall flow of data through the company and creating and automating data pipelines to implement this flow.”
With such an individual or a team, a company can trust and rely on data and know it’s in good hands and those data engineers will collect, store and process the data flawlessly which is the first step to becoming a data-driven company.
Why should you consider data engineering?
All the most successful and agile companies nowadays are data-driven. Data engineering enables companies to grow quicker and base plenty of sales, business, and marketing strategies on proven data instead of assumptions. They can also lower their security risks and align the tasks of different team members for better productivity.
Especially for younger and fast-growing companies, data engineering can be the first step into gathering learnings and creating a proper pipeline that will speed up progress and skyrocket growth.
What is big data engineering?
Massive volumes of consumer, product, and operational data, often in the terabyte and petabyte ranges, are called big data. Big data analytics may be utilized to improve important business and operational processes, reduce compliance and regulatory risks, and generate new revenue sources.
An information technology (IT) expert who designs, builds, tests, and maintains complicated data processing systems that work with massive data volumes is known as a big data engineer. Data specialists aggregate, cleanse, process, and enrich various types of data so that downstream data consumers, such as business analysts and data scientists, may extract information in a systematic manner. Hence, big data engineer jobs are a little harder to find the right person for.
Data science vs. data engineering
As we already mentioned, data engineers and data scientists often work with the same data but in different stages of “crunching” the information.
A data engineer creates, tests, and maintains data pipelines and architectures for usage by data scientists. They help the data scientist deliver correct metrics by doing the homework.
On the other hand, to solve business challenges, a data scientist cleans and analyzes data, solves queries, and gives measurements. Other similar job positions revolving around data are cloud computing experts, data architects, and computer science engineers.