What is HIVE?
Hive is an open-source project under Apache Software Foundation, its a data warehouse software ecosystem in Hadoop. Which manages vast amount of structured Data sets, by using HQI language; its similar to SQL
What is the definition of Hive? What is the present version of Hive? Explain ACID transactions in Hive.
Hive is an open-source data warehouse system. We can use Hive for analyzing and querying large datasets. It’s similar to SQL. The present version of Hive is 0.13.1. Hive supports ACID (Atomicity, Consistency, Isolation, and Durability) transactions. ACID transactions are provided at row levels. Following are the options Hive uses to support ACID transactions:
- Insert
- Delete
- Update
Differentiate between Pig and Hive.
Criteria |
Apache Pig |
Apache Hive |
|
Nature | Procedural data flow language | Declarative SQL-like language | |
Application | Used for programming | Used for report creation | |
Used by | Researchers and programmers | Mainly Data Analysts | |
Operates on | The client-side of a cluster | The server-side of a cluster | |
Accessing raw data | Not as fast as HiveQL | Faster with in-built features | |
Schema or data type | Always defined in the script itself | Stored in the local database | |
Ease of learning | Takes little extra time and effort to master | Easy to learn from database experts |
What are the three different modes in which hive can be run?
- Local mode
- Distributed mode
- Pseudodistributed mode
What kind of applications is supported by Apache Hive?
Hive supports all those client applications that are written in:
- Java
- PHP
- Python
- C++
- Ruby
What kind of data warehouse application is suitable for Hive? What are the types of tables in Hive?
Hive is not considered a full database. The design rules and regulations of Hadoop and HDFS have put restrictions on what Hive can do. However, Hive is most suitable for data warehouse applications because it:
- Analyzes relatively static data
- Has less responsive time
- Does not make rapid changes in data
Although Hive doesn’t provide fundamental features required for Online Transaction Processing (OLTP), it is suitable for data warehouse applications in large datasets. There are two types of tables in Hive:
- Managed tables
- External tables
What are the components used in Hive Query Processor?
Following are the components of a Hive Query Processor:
- Parse and Semantic Analysis (ql/parse)
- Metadata Layer (ql/metadata)
- Type Interfaces (ql/typeinfo)
- Sessions (ql/session)
- Map/Reduce Execution Engine (ql/exec)
- Plan Components (ql/plan)
- Hive Function Framework (ql/udf)
- Tools (ql/tools)
- Optimizer (ql/optimizer)
What are Buckets in Hive?
Buckets in Hive are used in segregating Hive table data into multiple files or directories. They are used for efficient querying.
How to skip header rows from a table in Hive?
Imagine that header records in a table are as follows:
System=…
Version=…
Sub-version=…
Suppose, we do not want to include the above three lines of headers in our Hive query. To skip the header lines from our table in Hive, we will set a table property.
CREATE EXTERNAL TABLE employee (
name STRING,
job STRING,
dob STRING,
id INT,
salary INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘ ‘ STORED AS TEXTFILE
LOCATION ‘/user/data’
TBLPROPERTIES("skip.header.line.count"="2”);
What is the precedence order of Hive configuration?
We are using a precedence hierarchy for setting properties:
- The SET command in Hive
- The command-line –hiveconf option
- Hive-site.XML
- Hive-default.xml
- Hadoop-site.xml
- Hadoop-default.xml
Explain the functionality of ObjectInspector.
ObjectInspector helps analyze the internal structure of a row object and the individual structure of columns in Hive. It also provides a uniform way to access complex objects that can be stored in multiple formats in the memory.
- An instance of Java class
- A standard Java object
- A lazily initialized object
ObjectInspector tells the structure of the object and also the ways to access the internal fields inside the object.
Differentiate between Hive and HBase.
Hive | HBase |
Enables most SQL queries | Does not allow SQL queries |
Operations do not run in real time | Operations run in real time |
A data warehouse framework | A NoSQL database |
Runs on top of MapReduce | Runs on top of HDFS |
Mention various date types supported by Hive.
The timestamp data type stores date in the java.sql.timestamp format.
Three collection data types in Hive are:
- Arrays
- Maps
- Structs
What is the default database provided by Apache Hive for metastore?
By default, Hive provides an embedded Derby database instance backed by the local disk for the metastore. This is called the embedded metastore configuration.
Is it possible to change the default location of a managed table?
Yes, it is possible to change the default location of a managed table. It can be achieved by using the clause – LOCATION ‘<hdfs_path>’.
What is a partition in Hive?
Hive organizes tables into partitions for grouping similar type of data together based on a column or partition key. Each Table can have one or more partition keys to identify a particular partition. Physically, a partition is nothing but a sub-directory in the table directory.
Why do we perform partitioning in Hive?
Partitioning provides granularity in a Hive table and therefore, reduces the query latency by scanning only relevant partitioned data instead of the whole data set.
For example, we can partition a transaction log of an e – commerce website based on month like Jan, February, etc. So, any analytics regarding a particular month, say Jan, will have to scan the Jan partition (sub – directory) only instead of the whole table data.
What is the default maximum dynamic partition that can be created by a mapper/reducer? How can you change it?
By default the number of maximum partition that can be created by a mapper or reducer is set to 100. One can change it by issuing the following command:
SET hive.exec.max.dynamic.partitions.pernode = <value>
Note: You can set the total number of dynamic partitions that can be created by one statement by using: SET hive.exec.max.dynamic.partitions = <value>
Why do we need buckets?
There are two main reasons for performing bucketing to a partition:
- A map side join requires the data belonging to a unique join key to be present in the same partition. But what about those cases where your partition key differs from that of join key? Therefore, in these cases you can perform a map side join by bucketing the table using the join key.
- Bucketing makes the sampling process more efficient and therefore, allows us to decrease the query time.
What is indexing and why do we need it?
One of the Hive query optimization methods is Hive index. Hive index is used to speed up the access of a column or set of columns in a Hive database because with the use of index the database system does not need to read all rows in the table to find the data that one has selected.
What is dynamic partitioning and when is it used?
Dynamic partitioning values for partition columns are known in the runtime. In other words, it is known during loading of the data into a Hive table.
Usage:
While we Load data from an existing non-partitioned table, in order to improve the sampling. Thus it decreases the query latency.
Also, while we do not know all the values of the partitions beforehand. Thus, finding these partition values manually from a huge dataset is a tedious task.
How Hive distributes the rows into buckets?
By using the formula: hash_function (bucketing_column) modulo (num_of_buckets) Hive determines the bucket number for a row. Basically, hash_function depends on the column data type. Although, hash_function for integer data type will be:
hash_function (int_type_column)= value of int_type_column
What is the use of Hcatalog?
Basically, to share data structures with external systems we use Hcatalog. It offers access to hive metastore to users of other tools on Hadoop. Hence, they can read and write data to hive’s data warehouse.
What is ObjectInspector functionality?
To analyze the structure of individual columns and the internal structure of the row objects we use ObjectInspector. Basically, it provides access to complex objects which can be stored in multiple formats in Hive.
Explain about the different types of join in Hive.
There are 4 types of joins in Hive:
- JOIN- It is very similar to Outer Join in SQL
- FULL OUTER JOIN – This join Combines the records of both the left and right outer tables. Basically, that fulfill the join condition.
- LEFT OUTER JOIN- Through this Join, All the rows from the left table are returned even if there are no matches in the right table.
- RIGHT OUTER JOIN – Here also, all the rows from the right table are returned even if there are no matches in the left table.