This is the second of two posts examining the use of Hive for interaction with HBase tables. This is a hands-on exploration so the first post isn’t required reading for consuming this one. Still, it might be good context.
“Nick!” you exclaim, “that first post had too many words and I don’t care about JIRA tickets. Show me how I use this thing!”
This is post is exactly that: a concrete, end-to-end example of consuming HBase over Hive. The whole mess was tested to work on a tiny little 5-node cluster running HDP-1.3.2, which means Hive 0.11.0 and HBase 0.94.6.1.
Grab some data and register it in Hive
We’ll need some data to work with. For this purpose, grab some traffic stats from wikipedia. Once we have some data, copy it up to HDFS.
$ mkdir pagecounts ; cd pagecounts $ for x in {0..9} ; do wget "http://dumps.wikimedia.org/other/pagecounts-raw/2008/2008-10/pagecounts-20081001-0${x}0000.gz" ; done $ hadoop fs -copyFromLocal $(pwd) ./
For reference, this is what the data looks like.
$ zcat pagecounts-20081001-000000.gz | head -n5 aa.b Special:Statistics 1 837 aa Main_Page 4 41431 aa Special:ListUsers 1 5555 aa Special:Listusers 1 1052 aa Special:PrefixIndex/Comparison_of_Guaze%27s_Law_and_Coulomb%27s_Law 1 4332
As I understand it, each record is a count of page views of a specific page on
Wikipedia. The first column is the language code, second is the
page name, third is the number of page views, and fourth is the size of the
page in bytes. Each file contains an hour’s worth of aggregated data. None of
the above pages were particularly popular that hour.
Now that we have data and understand its raw schema, create a Hive table over
it. To do that, we’ll use a DDL script that looks like this.
$ cat 00_pagecounts.ddl -- define an external table over raw pagecounts data CREATE TABLE IF NOT EXISTS pagecounts (projectcode STRING, pagename STRING, pageviews STRING, bytes STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' LINES TERMINATED BY '\n' STORED AS TEXTFILE LOCATION '/user/ndimiduk/pagecounts';
Run the script to register our dataset with Hive.
$ hive -f 00_pagecounts.ddl OK Time taken: 2.268 seconds
Verify that the schema mapping works by calculating a simple statistic over the dataset.
$ hive -e "SELECT count(*) FROM pagecounts;" Total MapReduce jobs = 1 Launching Job 1 out of 1 ... OK 36668549 Time taken: 25.31 seconds, Fetched: 1 row(s)
Hive says the 10 files we downloaded contain just over 36.5mm records. Let’s just confirm things are working as expected by getting a second opinion. This isn’t that much data, so confirm on the command line.
$ zcat * | wc -l 36668549
The record counts match up — excellent.
Transform the schema for HBase
The next step is to transform the raw data into a schema that makes sense for HBase. In our case, we’ll create a schema that allows us to calculate aggregate summaries of pages according to their titles. To do this, we want all the data for a single page grouped together. We’ll manage that by creating a Hive view that represents our target HBase schema. Here’s the DDL.
$ cat 01_pgc.ddl -- create a view, building a custom hbase rowkey CREATE VIEW IF NOT EXISTS pgc (rowkey, pageviews, bytes) AS SELECT concat_ws('/', projectcode, concat_ws('/', pagename, regexp_extract(INPUT<strong>FILE</strong>NAME, 'pagecounts-(\d{8}-\d{6})\..*$', 1))), pageviews, bytes FROM pagecounts;
The SELECT
statement uses hive to build a compound rowkey for HBase. It concatenates the project code, page name, and date, joined by the '/'
character. A handy trick: it uses a simple regex to extract the date from the source file names. Run it now.
$ hive -f 01_pgc.ddl OK Time taken: 2.712 seconds
This is just a view, so the SELECT
statement won’t be evaluated until we query it for data. Registering it with hive doesn’t actually process any data. Again, make sure it works by querying Hive for a subset of the data.
$ hive -e "SELECT * FROM pgc WHERE rowkey LIKE 'en/q%' LIMIT 10;" Total MapReduce jobs = 1 Launching Job 1 out of 1 ... OK en/q:Special:Search/Blues/20081001-090000 1 1168 en/q:Special:Search/rock/20081001-090000 1 985 en/qadam_rasul/20081001-090000 1 1108 en/qarqay/20081001-090000 1 933 en/qemu/20081001-090000 1 1144 en/qian_lin/20081001-090000 1 918 en/qiang_(spear)/20081001-090000 1 973 en/qin_dynasty/20081001-090000 1 1120 en/qinghe_special_steel_corporation_disaster/20081001-090000 1 963 en/qmail/20081001-090000 1 1146 Time taken: 40.382 seconds, Fetched: 10 row(s)
Register the HBase table
Now that we have a dataset in Hive, it’s time to introduce HBase. The first step is to register our HBase table in Hive so that we can interact with it using Hive queries. That means another DDL statement. Here’s what it looks like.
$ cat 02_pagecounts_hbase.ddl -- create a table in hbase to host the view CREATE TABLE IF NOT EXISTS pagecounts_hbase (rowkey STRING, pageviews STRING, bytes STRING) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ('hbase.columns.mapping' = ':key,f:c1,f:c2') TBLPROPERTIES ('hbase.table.name' = 'pagecounts');
This statement will tell Hive to go create an HBase table named pagecounts
with the single column family f
. It registers that HBase table in the Hive metastore by the name pagecounts_hbase
with 3 columns: rowkey
, pageviews
, and bytes
. The SerDe property hbase.columns.mapping
makes the association from Hive column to HBase column. It says the Hive column rowkey
is mapped to the HBase table’s rowkey, the Hive column pageviews
to the HBase column f:c1
, and bytes
to the HBase column f:c2
. To keep the example simple, we have Hive treat all these columns as the STRING
type.
In order to use the HBase library, we need to make the HBase jars and configuration available to the local Hive process (at least until HIVE-5518 is resolved). Do that by specifying a value for the HADOOP_CLASSPATH
environment variable before executing the statement.
$ export HADOOP_CLASSPATH=/etc/hbase/conf:/usr/lib/hbase/hbase-0.94.6.1.3.2.0-111-security.jar:/usr/lib/zookeeper/zookeeper.jar $ hive -f 02_pagecounts_hbase.ddl OK Time taken: 4.399 seconds
Populate the HBase table
Now it’s time to write data to HBase. This is done using a regular Hive INSERT
statement, sourcing data from the view with SELECT
. There’s one more bit of administration we need to take care of though. This INSERT
statement will run a mapreduce job that writes data to HBase. That means we need to tell Hive to ship the HBase jars and dependencies with the job.
Note that this is a separate step from the classpath modification we did previously. Normally you can do this with an export
statement from the shell, the same way we specified the HADOOP_CLASSPATH
. However there’s a bug in HDP-1.3 that requires me to use Hive’s SET
statement in the script instead.
$ cat 03_populate_hbase.hql -- ensure hbase dependency jars are shipped with the MR job -- Should export HIVE_AUX_JARS_PATH but this is broken in HDP-1.3.x SET hive.aux.jars.path = file:///etc/hbase/conf/hbase-site.xml,file:///usr/lib/hive/lib/hive-hbase-handler-0.11.0.1.3.2.0-111.jar,file:///usr/lib/hbase/hbase-0.94.6.1.3.2.0-111-security.jar,file:///usr/lib/zookeeper/zookeeper-3.4.5.1.3.2.0-111.jar;</p> <p>-- populate our hbase table FROM pgc INSERT INTO TABLE pagecounts_hbase SELECT pgc.* WHERE rowkey LIKE 'en/q%' LIMIT 10;
Note there’s a big ugly bug in Hive 0.12.0 which means this doesn’t work with that version. Never fear though, we have a patch in progress. Follow along at HIVE-5515.
If you choose to use a different method for setting Hive’s auxpath, be advised that it’s a tricky process — depending on how you specify it (HIVE_AUX_JARS_PATH
, --auxpath
), Hive will interpret the argument differently. HIVE-2349 seeks to remedy this unfortunate state of affairs.
$ hive -f 03_populate_hbase.hql Total MapReduce jobs = 1 Launching Job 1 out of 1 Number of reduce tasks determined at compile time: 1 ... OK Time taken: 40.296 seconds
Be advised also that this step is currently broken on secured HBase deployments. Follow along HIVE-5523 if that’s of interest to you.
Query data from HBase-land
40 seconds later, you now have data in HBase. Let’s have a look using the HBase shell.
$ echo "scan 'pagecounts'" | hbase shell HBase Shell; enter 'help<RETURN>' for list of supported commands. Type "exit<RETURN>" to leave the HBase Shell Version 0.94.6.1.3.2.0-111, r410a7a1c151ca953553eae68aa84e2a9f0d6e4ca, Mon Aug 19 19:00:12 PDT 2013 scan 'pagecounts' ROW COLUMN+CELL en/q:Pan%27s_Labyrinth/20081001-080000 column=f:c1, timestamp=1381534232485, value=1 en/q:Pan%27s_Labyrinth/20081001-080000 column=f:c2, timestamp=1381534232485, value=1153 en/q:Special:Search/Jazz/20081001-080000 column=f:c1, timestamp=1381534232485, value=1 en/q:Special:Search/Jazz/20081001-080000 column=f:c2, timestamp=1381534232485, value=980 en/q:Special:Search/peinture/20081001-080000 column=f:c1, timestamp=1381534232485, value=1 en/q:Special:Search/peinture/20081001-080000 column=f:c2, timestamp=1381534232485, value=989 en/q:Special:Search/rock/20081001-080000 column=f:c1, timestamp=1381534232485, value=1 en/q:Special:Search/rock/20081001-080000 column=f:c2, timestamp=1381534232485, value=980 en/qadi/20081001-080000 column=f:c1, timestamp=1381534232485, value=1 en/qadi/20081001-080000 column=f:c2, timestamp=1381534232485, value=1112 en/qalawun%20complex/20081001-080000 column=f:c1, timestamp=1381534232485, value=1 en/qalawun%20complex/20081001-080000 column=f:c2, timestamp=1381534232485, value=942 en/qalawun/20081001-080000 column=f:c1, timestamp=1381534232485, value=1 en/qalawun/20081001-080000 column=f:c2, timestamp=1381534232485, value=929 en/qari'/20081001-080000 column=f:c1, timestamp=1381534232485, value=1 en/qari'/20081001-080000 column=f:c2, timestamp=1381534232485, value=929 en/qasvin/20081001-080000 column=f:c1, timestamp=1381534232485, value=1 en/qasvin/20081001-080000 column=f:c2, timestamp=1381534232485, value=921 en/qemu/20081001-080000 column=f:c1, timestamp=1381534232485, value=1 en/qemu/20081001-080000 column=f:c2, timestamp=1381534232485, value=1157 10 row(s) in 0.4960 seconds
Here we have 10 rows with two columns each containing the data loaded using Hive. It’s now accessible in your online world using HBase. For example, perhaps you receive an updated data file and have a corrected value for one of the stats. You can update the record in HBase with a regular PUT
command.
Verify data from from Hive
The HBase table remains available to you Hive world; Hive’s HBaseStorageHandler
works both ways, after all.
Note that this command expects that the HADOOP_CLASSPATH
is still set and HIVE_AUX_JARS_PATH
as well if your query is complex.
$ hive -e "SELECT * from pagecounts_hbase;" OK en/q:Pan%27s_Labyrinth/20081001-080000 1 1153 en/q:Special:Search/Jazz/20081001-080000 1 980 en/q:Special:Search/peinture/20081001-080000 1 989 en/q:Special:Search/rock/20081001-080000 1 980 en/qadi/20081001-080000 1 1112 en/qalawun%20complex/20081001-080000 1 942 en/qalawun/20081001-080000 1 929 en/qari'/20081001-080000 1 929 en/qasvin/20081001-080000 1 921 en/qemu/20081001-080000 1 1157 Time taken: 2.554 seconds, Fetched: 10 row(s)
Continue using Hive for analysis
Since the HBase table is accessible from Hive, you can continue to use Hive for your ETL processing with mapreduce. Keep in mind that the auxpath considerations apply here too, so I’ve scripted out the query instead of just running it directly at the command line.
$ cat 04_query_hbase.hql -- ensure hbase dependency jars are shipped with the MR job -- Should export HIVE_AUX_JARS_PATH but this is broken in HDP-1.3.x SET hive.aux.jars.path = file:///etc/hbase/conf/hbase-site.xml,file:///usr/lib/hive/lib/hive-hbase-handler-0.11.0.1.3.2.0-111.jar,file:///usr/lib/hbase/hbase-0.94.6.1.3.2.0-111-security.jar,file:///usr/lib/zookeeper/zookeeper-3.4.5.1.3.2.0-111.jar;
-- query hive data SELECT count(*) from pagecounts_hbase;
Run it the same way we did the others.
$ hive -f 04_query_hbase.hql Total MapReduce jobs = 1 Launching Job 1 out of 1 Number of reduce tasks determined at compile time: 1 ... OK 10 Time taken: 19.473 seconds, Fetched: 1 row(s)
There you have it: a hands-on, end to end demonstration of interacting with HBase from Hive. You can learn more about the nitty-gritty details in Enis’s deck on the topic, or see the presentation he and Ashutosh gave at HBaseCon. If you’re inclined to make the intersection of these technologies work better (faster, stronger), I encourage you to pick up any of the JIRA issues mentioned in this post or the previous.
Happy hacking!
The post Using Hive to interact with HBase, Part 2 appeared first on Hortonworks.