What is Hadoop Big Data Testing?
Big Data means a vast collection of structured and unstructured data, which is very expansive & complicated to process by conventional database and software techniques. In many organizations, the volume of data is enormous, and it moves too fast in recent days and exceeds the current processing capacity. Compilation of databases that are not being processed by conventional computing techniques, efficiently. Testing involves specialized tools, frameworks, and methods to handle these huge amounts of data. Testing of Big data is meant to the creation of data and its storage, retrieving of data and analysis them which is significant regarding its volume and variety of speed
What is Hadoop and name its components?
When “Big Data” emerged as a problem, Hadoop evolved as a solution to it. Hadoop is a framework that provides us various services or tools to store and process Big Data. It helps in analyzing Big Data and making business decisions out of it, which can’t be done efficiently and effectively using traditional systems.
The main components of Hadoop, i.e.:
- Storage unit– HDFS (NameNode, DataNode)
- Processing framework– YARN (ResourceManager, NodeManager)
How do we validate Big Data?
In Hadoop, engineers authenticate the processing of quantum of data used by the Hadoop cluster with supportive elements. Testing of Big data needs asks for extremely skilled professionals, as the handling is swift. Processing is three types namely Batch, Real-Time, & Interactive.
What is Data Staging?
The initial step in the validation, which engages in process verification. Data from a different source like social media, RDBMS, etc. are validated, so that accurate uploaded data to the system. We should then compare the data source with the uploaded data into HDFS to ensure that both of them match. Lastly, we should validate that the correct data has been pulled, and uploaded into specific HDFS. There are many tools available, e.g., Talend, Datameer, which are mostly used for validation of data staging.
What is Hadoop Map Reduce and how does it work?
For processing large data sets in parallel across a Hadoop cluster, Hadoop MapReduce framework is used. Data analysis uses a two-step map and reduces the process.
In MapReduce, during the map phase, it counts the words in each document, while in the reduce phase it aggregates the data as per the document spanning the entire collection. During the map phase, the input data is divided into splits for analysis by map tasks running in parallel across the Hadoop framework.
What is NameNode in Hadoop?
NameNode in Hadoop is the node, where Hadoop stores all the file location information in HDFS (Hadoop Distributed File System). In other words, NameNode is the centerpiece of an HDFS file system. It keeps the record of all the files in the file system and tracks the file data across the cluster or multiple machines
What is NodeManager?
NodeManager runs on slave machines and is responsible for launching the application’s containers (where applications execute their part), monitoring their resource usage (CPU, memory, disk, network) and reporting these to the ResourceManager.
Explain what is JobTracker in Hadoop? What are the actions followed by Hadoop?
In Hadoop for submitting and tracking MapReduce jobs, JobTracker is used. Job tracker runs on its own JVM process
Job Tracker performs the following actions in Hadoop
- Client application submit jobs to the job tracker
- JobTracker communicates to the Name mode to determine data location
- Near the data or with available slots JobTracker locates TaskTracker nodes
- On chosen TaskTracker Nodes, it submits the work
- When a task fails, Job tracker notifies and decides what to do then.
- The TaskTracker nodes are monitored by JobTracker
What is HDFS?
HDFS (Hadoop Distributed File System) is the storage unit of Hadoop. It is responsible for storing different kinds of data as blocks in a distributed environment. It follows master and slave topology.
The HDFS components too are.
- NameNode: NameNode is the master node in the distributed environment and it maintains the metadata information for the blocks of data stored in HDFS like block location, replication factors, etc.
- DataNode: DataNodes are the slave nodes, which are responsible for storing data in the HDFS. NameNode manages all the DataNodes.
What is a heartbeat in HDFS?
Heartbeat is referred to a signal used between a data node and Name node, and between task tracker and job tracker, if the Name node or job tracker does not respond to the signal, then it is considered there are some issues with data node or task tracker
What happens when a data node fails?
When a data node fails
- Jobtracker and namenode detect the failure
- On the failed node all tasks are re-scheduled
- Namenode replicates the user’s data to another node
What is Speculative Execution?
In Hadoop during Speculative Execution, a certain number of duplicate tasks are launched. On a different slave node, multiple copies of the same map or reduce task can be executed using Speculative Execution. In simple words, if a particular drive is taking a long time to complete a task, Hadoop will create a duplicate task on another disk. A disk that finishes the task first is retained and disks that do not finish first are killed.
What are the three modes in which Hadoop can run?
The three modes in which Hadoop can run are as follows:
- Standalone (local) mode: This is the default mode if we don’t configure anything. In this mode, all the components of Hadoop, such as NameNode, DataNode, ResourceManager, and NodeManager, run as a single Java process. This uses the local filesystem.
- Pseudo-distributed mode: A single-node Hadoop deployment is considered as running Hadoop system in pseudo-distributed mode. In this mode, all the Hadoop services, including both the master and the slave services, were executed on a single compute node.
- Fully distributed mode: A Hadoop deployments in which the Hadoop master and slave services run on separate nodes, are stated as a fully distributed mode.
What are the main configuration parameters in a “MapReduce” program?
The main configuration parameters which users need to specify in the “MapReduce” framework are:
- Job’s input locations in the distributed file system
- Job’s output location in the distributed file system
- The input format of data
- The output format of data
- Class containing the map function
- Class containing the reduce function
- JAR file containing the mapper, reducer and driver classes
What is “MapReduce” Validation?
MapReduce is the second phase of the validation process of Big Data testing. This stage involves the developer to verify the validation of the logic of business on every single systemic node and validating the data after executing on all the nodes, determining that:
1. Proper Functioning, of Map-Reduce.
2. Rules for Data segregation are being implemented.
3. Pairing & Creation of Key-value.
4. Correct Verification of data following the completion of Map Reduce.
What is Performance Testing?
Performance testing consists of testing of the duration to complete the job, utilization of memory, the throughput of data, and parallel system metrics. Any failover test services aim to confirm that data is processed seamlessly in any case of data node failure. Performance Testing of Big Data primarily consists of two functions. First, is Data ingestion whereas the second is Data Processing
What is a “Combiner”?
A “Combiner” is a mini “reducer” that performs the local “reduce” task. It receives the input from the “mapper” on a particular “node” and sends the output to the “reducer”. “Combiners” help in enhancing the efficiency of “MapReduce” by reducing the quantum of data that is required to be sent to the “reducers”.
What are the differences between an RDBMS and Hadoop?
RDBMS | Hadoop |
---|---|
RDBMS is a relational database management system | Hadoop is a node based flat structure |
It used for OLTP processing whereas Hadoop | It is currently used for analytical and for BIG DATA processing |
In RDBMS, the database cluster uses the same data files stored in a shared storage | In Hadoop, the storage data can be stored independently in each processing node. |
You need to preprocess data before storing it | You don’t need to preprocess data before storing it |
What are the data components used by Hadoop?
Data components used by Hadoop are
- Pig
- Hive
How will you write a custom partitioner?
You write a custom partitioner for a Hadoop job, you follow the following path
- Create a new class that extends Partitioner Class
- Override method getPartition
- In the wrapper that runs the MapReduce
- Add the custom partitioner to the job by using the method set Partitioner Class or – add the custom partitioner to the job as a config file
List out Hadoop’s three configuration files?
The three configuration files are
- core-site.xml
- mapred-site.xml
- hdfs-site.xml
What is a Task Tracker in Hadoop?
A Task Tracker in Hadoop is a slave node daemon in the cluster that accepts tasks from a JobTracker. It also sends out the heartbeat messages to the JobTracker, every few minutes, to confirm that the JobTracker is still alive.
What are the general approaches in Performance Testing?
Method of testing the performance of the application constitutes the validation of a large amount of unstructured and structured data, which needs specific approaches in testing to validate such data.
1. Setting up of the Application
2. Designing & identifying the task.
3. Organizing the Individual Clients
4. Execution and Analysis of the workload
5. Optimizing the Installation setup
6. Tuning of Components and Deployment of the system
What are the challenges in performance testing? Following are some of the different challenges faced while validating Big Data:
There are no technologies available, which can help a developer from start-to-finish. Examples are, NoSQL does not validate message queues.
Scripting: A high level of scripting skills is required to design test cases.
Environment: A Specialized test environment is needed due to its size of data.
Supervising Solution are limited that can scrutinize the entire testing environment
The solution needed for diagnosis: Customized way outs are needed to develop and wipe out the bottleneck to enhance the performance.
Name some of the tools for Big Data Testing.
Following are the various types of tools available for Big Data Testing:
1. Big Data Testing
2. ETL Testing & Data Warehouse
3. Testing of Data Migration
4. Enterprise Application Testing / Data Interface /
5. Database Upgrade Testing
What is Query Surge? And explain the architecture of Query Surge
Query Surge is one of the solutions for Big Data testing. It ensures the quality of data quality and the shared data testing method that detects bad data while testing and provides an excellent view of the health of data. It makes sure that the data extracted from the sources stay intact on the target by examining and pinpointing the differences in the Big Data wherever necessary.
Query Surge Architecture consists of the following components:
1. Tomcat – The Query Surge Application Server
2. The Query Surge Database (MySQL)
3. Query Surge Agents – At least one has to be deployed
4. Query Surge Execution API, which is optional.