How do I install Apache Hive

1. Hive background

In the Mapreduce calculation process, structured data is analyzed in most scenarios, SQL statements are more convenient to process and development costs are lower. Using Mapreduce for development increases both the labor cost and the cost of writing code significantly.

2. What is beehive?

The HQL (dialect version of SQL) programming is provided externally. The underlying data is stored in HDFS and the HQL statement is converted into a data warehouse for the Mapreduce program operation.

Hive is a tool based on Hadoop that provides HQL programming tools. Hive offers us a new Hadoop operating mode.

Hive is synonymous with another form of Hadoop customer

3. Beehive architecture

1) User interface layer

Direct user operation command line jdbc webUI interface

2) Thrift class

If a cross-language service platform is used, only the jdbc method is used

3) metabase layer

Data stored in the beehive

① The data in the table is saved on hdfs, according to an hdfs path

② Metadata: data describing the original data (data in the table) (note the corresponding relationship between the table data and the path of hdfs).

For Hive, metadata stores are structured data, and metadata isn't that big. Metadata is stored in traditional relational databases. It is stored in MySQL in production, and hive's metadata is stored in the Derby database by default.

4) driver core driver layer including:

① Interpreter: interpret hql as an abstract syntax tree

② Compiler: Compile hql into the Mapreduce program

③ Optimizer: Optimize the results of the previous step

④ Actor: Submit the final optimization results for execution

4. When installing hive, the default derby is used as the metadata storage database

1) preparation

Hive is based on Hadoop, so you need to make sure that Hadoop is available

Install jdk, set up the hadoop environment, and configure the jdk and hadoop environment variables

2) How many nodes installed hive?

Hive is like a client, so only one node is required

3) installation

Use Derby as a metabase

Upload the installation package

Unzip

Configure the hive / bin environment variable

(Startup: hive reports an error if it is started directly and the instance object of the metabase cannot be initialized.)

Initialize the metadata database of hive: schematool -dType derby -initSchema (the initialization is completed with two more files derby.log and metastore_db)

metastore_db: store information about data in the database

derby.log: Stores the log information of the Derby database

Start: beehive

This method still reports an error when changing directories and starting the Hive (metadata initialization error).

That means, in which directory is to be initialized, the log and storage data directory is generated in the current directory. As soon as the directory has been changed and executed, access to the files generated by the initialization is no longer possible.

In general production, set the metadata database to MySQL

5. The Hive installation uses Derby as the metadata store database

1) preparation

Hive is based on Hadoop, so you need to make sure that Hadoop is available

Install jdk, set up the hadoop environment, and configure the jdk and hadoop environment variables

2) installation

Use MySQL as a metabase

① You can install MySQL

② decompression and configuration of the beehive

Upload the installation package

Unzip

Configure environment variables

Create the Hive configuration file conf / hive-site.xml

③ Introduce the MySQL driver package

Place the MySQL driver package in the lib directory of the Hive installation directory

④ Initialize the meta database configuration

Initialize the metadata database: schematool -dType mysql -initSchema