Jeff Holoman's Blog

Using Hadoop / HBase with APEX

Posted by jholoman on August 21, 2012

This blog is a cross-post from my company website,

Increasingly we are getting requests from our customers surrounding Big Data. In this post we’ll explain in detail one easy way to integrate Big Data technologies with a more traditional Oracle stack.

Recently BIAS was approached by one of our large customers with an interesting problem. The specifics are abstracted, but essentially they conduct a huge number of transactions daily across many thousands of locations, using a Point-of-Sale style system. Additionally, digital images are captured and associated with each transaction.

These images are useful in certain situations, but are not critical for the day-to-day operations of the business. The images are small(50Kb), consistent in size and are sent to a centralized location via xml after being base64 encoded.
The images come over the wire as such:

<transaction id="GUID" timestamp="timestamp">
<image1>base_64 encoded jpg</image1>
<image2>base_64 encoded jpg</image2>
<transaction id="GUID" timestamp="timestamp">
<image1>base_64 encoded jpg</image1>
<image2>base_64 encoded jpg</image2>

Our client wanted to explore the possibility of not storing the BLOBs directly in the database, but still wanted efficient random-access on-demand. They are already running Cloudera’s Distribution including Apache Hadoop (CDH3) and asked if we could help develop a solution to process and display the images quickly. We put together this very basic POC over a weekend based on HBase as a potential solution, and another way to utilize their Hadoop cluster.

As it was the weekend, access to sample data was not available, so I took some license and generated it myself. I started with two images of approximately 50kb. I used these two images for all of the images in this POC.

Here’s a short script to generate the xml file using Python. It was useful to be able to move the number of transactions up and down to test the MapReduce jobs that I was running


import datetime
import uuid
a = open("pic_to_shrink.jpg", "rb").read().encode("base64")
b = open("pic_to_shrink2.jpg", "rb").read().encode("base64")
c = uuid.uuid4()
print c
e =
d = '<?xml version="1.0" encoding="UTF-8" ?>\n'
d = d + '<transactions>\n'
i = 0
while i < 5000:
        d = d + '<transaction id="%s" timestamp="%s">\n' % (c, e)
        d = d + '<image1>'+a+'</image1>'
        d = d + '<image2>'+b+'</image2>'
        d = d + '</transaction>'
d = d + '</transactions>'
file = open("transactions.xml", 'w')

This produces XML like this (content of <imageN tags trimmed)

<?xml version="1.0" encoding="UTF-8" ?>
<transaction id="c59d6ca3-d9a8-4e27-9ca8-274d3aa21fc3" timestamp="2012-07-26 18:05:06.875927">

This script generates the data in XML which is not ideal for storage. In a production application we would likely utilize native Java MapReduce to handle the processing but it was more convenient for the POC to use a combination of Hadoop Streaming, Python and Pig.

A look at the Hadoop wiki shows that tags can be configured to determine which splits are sent to the map tasks.

I found a nice summary by David Hill for processing xml in python with eTree Here. The script below borrows heavily.

import sys
import cStringIO
import elementtree.ElementTree as ET

if __name__ == '__main__':
    in_it = False

    for line in sys.stdin:
        line = line.strip()
        if line.find("<transaction id=") != -1:
        # now that we are in the transaction, we want need to write it out to the StringIO for parsing with the elementtree
            Sbuffer= cStringIO.StringIO()
        elif line.find("</transaction>") != -1:
            xml_root = ET.fromstring(value)
            id = (xml_root.get("id"))
            ts = (xml_root.get("timestamp"))
            image1 = (xml_root.find('image1').text)
            image2 = (xml_root.find('image2').text)
            print '%s\t%s\t%s\t%s' % (id,ts,image1,image2)
            if in_it:

In my testing I was getting duplicate keys sent to the reducer, so I wanted to make sure that I was only emitting 1 line per transaction-id that was in the original file.
I have this in the queue to research a bit more. I suspect that this has to do with the StreamXMLRecordReader implementation as there’s been some other buzz about this out on the ‘net.

#!/usr/bin/env python
import sys
prev_key = 'b'
if __name__ == '__main__':
    for line in sys.stdin:
        #just grab the key
    #compare the keys
        if current_key != prev_key:
            prev_key = current_key
            print current_line

Next is to execute the job.

hadoop jar /usr/lib/hadoop/contrib/streaming/hadoop-streaming-0.20.2-cdh3u2.jar \
-mapper ./ \
-file ./ \
-reducer ./ \
-file ./ \
-inputreader "StreamXmlRecordReader,begin=<transaction id,end=</transaction>" \
-input blog/transactions.xml \
-output blog/output

Now the data has been parsed out of xml format and contains one record per line.

At this point it’s probably worthwhile to take a moment and discuss HBase and Pig.
HBase is an open-source, distributed, versioned, column-oriented store modeled after Google’s BigTable. At its core it’s a Map, like an associative array or JavaScript object, and the map is indexed by a row key, column key, and a timestamp.
The row keys are kept in sorted order and it’s built on a distributed filesystem, HDFS. Additionally HBase provides a great REST interface called Stargate that will be utilized here.

A simple table with one column family can be created easily:

hbase> create 'images', {NAME => 'image'}

Pig is a higher level abstraction platform for writing jobs against Hadoop. Pig provides a scripting language that makes data analysis and processing easier. In this case, I want to read our data from the results of the last MapReduce job and load the data into HBase. Also, because the REST Server for HBase base64 encodes values, I just want to store the original blob in the HBase table. To that end, I wrote a simple User-Defined function to decode from base64.

package myudfs;
import org.apache.pig.EvalFunc;
import org.apache.commons.codec.binary.Base64;
import org.apache.pig.impl.util.WrappedIOException;

public class DECODE64 extends EvalFunc<DataByteArray>
    public DataByteArray exec(Tuple input) throws IOException {
            if (input == null || input.size() == 0)
                        return null;
                 String str = (String)input.get(0);
                 byte[] decoded = Base64.decodeBase64(str);
                DataByteArray val = new DataByteArray(decoded);
                return val;
                }catch(Exception e){
   throw WrappedIOException.wrap("Caught exception processing input row ", e);

And the pig script:

REGISTER myudfs.jar;
data = LOAD '/user/jholoman/blog/output/part-00000' using PigStorage() as (id:chararray, ts:chararray, image1:chararray, image2:chararray);

A = FOREACH data generate id, ts, myudfs.DECODE64(image1), myudfs.DECODE64(image2);
store A into 'hbase://images' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('image:ts image:image1 image:image2');

$pig image_load_to_hbase.pig

Verification that the data has been successfully loaded:

hbase(main):001:0> count 'images'
Current count: 1000, row: 363b8d19-d6df-46d8-beb5-321ba40b63f5                                                                        
Current count: 2000, row: 6aad1a37-0a4a-4f66-a8d8-115a24ac53ee                                                                        
Current count: 3000, row: 9c6605ef-4ec9-4961-9e46-6cabdab3d7ab                                                                        
Current count: 4000, row: ce2493ac-7a16-44f0-838a-fd474e38918c                                                                        
Current count: 5000, row: fffaebe5-2268-4961-8524-f34df95062c5                                                                        
5000 row(s) in 6.4160 seconds
hbase(main):002:0> scan 'images', {STARTROW => '363b8d19-d6df-46d8-beb5-321ba40b63f5', COLUMNS => 'image:ts', LIMIT => 1}
ROW                                COLUMN+CELL                                                                                        
 363b8d19-d6df-46d8-beb5-321ba40b6 column=image:ts, timestamp=1343353321129, value=2012-07-26 18:05:07.284005                         
1 row(s) in 0.0470 seconds

Starting the HBase REST Service is easy:

$ /usr/lib/hbase/bin/ start rest -p 8000

And works as expected:

curl -H "Accept: text/xml" http://localhost:8000/images/363b8d19-d6df-46d8-beb5-321ba40b63f5/image:ts<?xml version="1.0" encoding="UTF-8" standalone="yes"?><CellSet><Row key="MzYzYjhkMTktZDZkZi00NmQ4LWJlYjUtMzIxYmE0MGI2M2Y1"><Cell timestamp="1343353321129" column="aW1hZ2U6dHM=">MjAxMi0wNy0yNiAxODowNTowNy4yODQwMDU=</Cell></Row></CellSet>

We previously demonstrated the Oracle Big Data Connectors and used an APEX environment to show off the Direct Connector for HDFS. It made sense to use that environment to demonstrate displaying the images from HBase.
First, the on-demand process:

  l_clob clob;
  l_buffer         varchar2(32767);
  l_amount         number;
  l_offset         number;
  l_rowkey         varchar2(1000);
   l_rowkey := apex_application.g_x01;
  l_clob := apex_web_service.make_rest_request(
              p_url => 'http://hbase_host:8000/images/'||l_rowkey,
              p_http_method => 'GET');
    l_amount := 32000;
    l_offset := 1;
   l_clob, l_amount, l_offset, l_buffer );
            l_offset := l_offset + l_amount;
            l_amount := 32000;
        end loop;
        when no_data_found then

And the code to call it:

function getHBaseImages(pRowKey){
var image1;
var image2;
var ts;
var get = new htmldb_Get(null, $v('pFlowId'),'APPLICATION_PROCESS=getHBaseImages',$v('pFlowStepId'));

  get.addParam('x01', pRowKey);
  gReturn = get.get();
  get = null;
  gXML = apex.jQuery.parseXML(gReturn);
  $(vals).each(function(i,o) {
   var returnCell = atob($(this).attr("column"));
   switch (returnCell) {
        case 'image:image1':
            image1 = $(this).text();
         case 'image:image2':
            image2 = $(this).text();
         case 'image:ts':
            ts = atob($(this).text());

      $('.image1').attr('src', 'data:image/jpg;base64,'+image1);
      $('.image2').attr('src', 'data:image/jpg;base64,'+image2);

As mentioned previously, the HBase REST service encodes all the values. For the images this is no problem as the browser can display the images directly. The code above calls HBase, parses the resultant xml and sets the source for 2 image elements on the page and a span for the timestamp. From here setting up a link on the table is trivial, and on-click, the images are pulled from HBase.

Check this out in action!

In this post we covered one potential scenario for integrating Big Data technologies with Oracle. It didn’t take very long to put this POC together and demonstrates one technique for accessing data in Hadoop/HBase. Stay tuned for more Big Data examples…

Posted in APEX, Big Data, Hadoop, HBase | 3 Comments »

Using images in APEX Interactive Reports

Posted by jholoman on March 9, 2012

With all of the features of interactive reports, sometimes the data itself could use a little UI improvement.

For example, if a user is scanning this report, it can be difficult to determine quickly if the product is available or not.

Adding in visual elements to the report will improve the overall UX.

Read the rest of this entry »

Posted in APEX, Oracle | 2 Comments »

Apex 4.1 Page Items to Submit

Posted by jholoman on February 26, 2012

I really haven’t taken the time to blog at all for quite some time. I made a resolution to do so more often, so here goes.

As I was setting up a quick example today on, I noticed a feature in the reports, the Page Items To Submit section beneath the Region source:


I admittedly haven’t been through all the new features and so I checked out:

The new attribute “Page Items to Submit” has been added to regions of type IR, Classic Report, Tabular Form and plug-ins having that feature enabled. The values of the specified page items will be submitted with a “Partial Page Refresh” request to set them in session state, so that they can be used in the WHERE clause of the regions SQL statement. A “Partial Page Refresh” can be triggered with the “Refresh” dynamic action, by changing the sorting or by paginating.

Very cool. If you are doing refreshes with ajax in your application, which I end up doing pretty frequently, you end up doing calls to set the session state of an item like

var get = new htmldb_Get(null, $v('pFlowid'),'APPLICATION_PROCESS=dummy',0);
get.add('P24_DEPTNO', $v('P24_DEPTNO'));

Or perhaps you’ve been including a dynamic action of type Execute PL/SQL code that uses the existing Page Items to Submit functionality to save the value in session state:

Using this method, with a Change event attached to P24_DEPTNO, you’ll see that 2 calls are fired when deptno is changed:

The first is the call to set the value, then another to pull the report.

But with the new option we can consolidate these actions into one call, set the Page Items to Submit value to P24_DEPTNO in the report Region, and have just a single dynamic action to refresh the report:

And here’s the link section in the IR above:

DEMO here:

Posted in Uncategorized | 3 Comments »

Google Wave

Posted by jholoman on May 30, 2009

There’s been a lot of buzz about the Google Wave unveiling. Here’s the entire presentation.

Pretty cool. I’m looking forward to see this develop and perhaps find ways to integrate with APEX.

Posted in Uncategorized | Leave a Comment »

Adding Time to jQuery Datepicker

Posted by jholoman on May 15, 2009

A number of folks have blogged about using the jQuery datepicker in APEX. One of the redeeming qualities of the APEX datepicker widget is it’s support for a time component, which the jQuery datepicker lacks. I’m sure most of the time the ability to pick just the date is enough.  There are however times when you need finer granularity.   There are some seperate timepicker plugins, but having the date/time tied to one element is more natural in relation to the standard APEX datepicker. I’m posting to share one solution, but mostly to get other ideas on the best way to handle this occasion. I like the functionality and ease of implementation of the jQuery datepicker, but I’m not crazy about the work-around shown here to handle the time.

Check out this small demo in action here if you don’t want to read the whole post.

The jQuery docs show that you can use ‘…’ – literal text in your format string.

This allows the opportunity to append a time component to the standard format mask.  The solution presented here requires the user to manually edit the time component, so in order to accommodate this, a database function is created to validate the date. This could be done of course just using JavaScript but I think using the database to do the validation is easier.

create or replace function is_date(p_date in varchar2, p_format_mask_in in varchar2)
return boolean
  l_date date;
  l_format_mask varchar2(50);
  l_format_mask := 'fx'||p_format_mask_in;
  l_date :=  to_date(p_date, l_format_mask);
   return true;
exception when others then
  return false;

An on-demand application process, VALIDATE_DATE  is created to call the validation function, along with a corresponding javascript function to call it. I kept the is_date function a separate db function instead of rolling the functionality into the app process in case I had need for it elsewhere.

 l_return number(1);
 l_date varchar2(50);
l_date := wwv_flow.g_x01;
 if (is_date(l_date, v('PICK_DATE_FORMAT_MASK')))
  l_return := 1;
  l_return := 0;
 end if;


function validateDate (pDate, pOraFormatMask) {
 var get = new htmldb_Get(null,$v('pFlowId'), 'APPLICATION_PROCESS=VALIDATE_DATE', 0);
 var lDate = $(pDate).val();
 get.addParam('x01', lDate);
 var gReturn = get.get();
 get = null;
 if (gReturn==0)
 {   alert('Invalid Date Format Entered.\n Dates should be in '+ pOraFormatMask +' format.'); }

We’ll need two application items and corresponding computations to hold the different format masks, PICK_DATE_FORMAT_MASK and JS_DATE_FORMAT_MASK.

The format mask I want to use for Oracle  is ‘YYYY-MM-DD HH24:MI’, the jQuery datepicker mask that matches that is ‘yy-mm-dd’, but no time component.

I then create a hidden item on page zero, P0_JS_SYSDATE, with a source value of:

to_char(sysdate, ‘&PICK_DATE_FORMAT_MASK.’);

This will allow me to grab the time component easily for the datepicker.

All that’s left now is to call the the function that creates the datepicker:

function dpWithTime(pOraFormatMask, pDpFormatMask, pSysdate) {
   var timeFormat = " " + pSysdate.split(" ")[1];
        dateFormat: pDpFormatMask + timeFormat,
        showOn: "button",
        buttonImage: "/i/asfdcldr.gif",
        buttonImageOnly: true
        }).change(function(event) {validateDate(this, pOraFormatMask)});

In the html header for the page, or the page template:

<script type="text/javascript">
var oraFormatMask = "&PICK_DATE_FORMAT_MASK.";
var jsFormatMask = "&JS_DATE_FORMAT_MASK.";
var l_sysdate =  $('#P0_JS_SYSDATE').val();
dpWithTime(oraFormatMask, jsFormatMask, l_sysdate);

Again the example is here.

Let me know if you have any other ideas, like I said at the start, I’m not thrilled with this approach, but if the rest of your app is already using the jQuery datepicker, and you need one with a time component maybe this will work. Thanks to Tyler Muth for his earlier blog/sample app using jQuery datepicker and for the bit about using the fieldset selector which he blogged about here.

Posted in APEX, Oracle | Tagged: , , , | 3 Comments »

Interesting Video

Posted by jholoman on April 9, 2009

I’m not one of those people who spends an inordinate amount of time on YouTube, but my Mom sent me a video that I thought was pretty interesting.

One thing it really underscores is that staying current is very important, and as “other stuff” comes up in life, I’m making it a personal goal to continue learning…

Posted in Uncategorized | Leave a Comment »

Arrays over DB Link

Posted by jholoman on April 8, 2009

I recently had a problem come up where we needed to return a resultset via database link. I was aware that the data types allowed over DB link were somewhat restricted: no ref-cursors, no object types (including XMLType) to name a few.

I wasn’t aware that you could use PL/SQL Associative arrays (index-by tables) via db link. Here’s a small example:

On the remote db:

SQL> create or replace package aa_test
    type my_emp_type is table of emp%rowtype index by pls_integer;
    function get_emp(p_empno in number) return my_emp_type;

Package created.
SQL> create or replace package body aa_test
    function get_emp(p_empno in number) return my_emp_type
    l_emp_rec my_emp_type;
     select * bulk collect into l_emp_rec from emp where empno = p_empno;
    return l_emp_rec;
    exception when no_data_found then
   end get_emp;
   end aa_test;

Package body created.

And Locally:

SQL> declare
    l_local_emp_rec aa_test.my_emp_type@holoman;
    l_local_emp_rec := aa_test.get_emp@holoman(7654);

PL/SQL procedure successfully completed.

Always nice to learn something new.

Posted in Oracle, PL/SQL | Leave a Comment »

Show / Hide Report Columns with jQuery

Posted by jholoman on April 7, 2009

I have been a benefactor of others work regarding the Hide/Show columns functionality that I believe Carl first implemented here:

Vikas and  Denes also have examples. The one thing that always kind of bothered me about the example was that in order to save the user’s preference, you needed to click a button. This doesn’t seem so intuitive to me, so I decided to revamp the example a bit and save the preference based on a cookie.

You can see this in action here:

I rewrote the code with jQuery and strayed a bit from the original implementation, but I think it works ok

Posted in APEX | Tagged: , , | Leave a Comment »

Franklin Street

Posted by jholoman on April 7, 2009

Last night my brother and a bunch of other friends made the trip to Chapel Hill to watch the Tarheels crush Michigan St. I couldn’t make it up there, but it looked crazy:

Posted in Personal | Leave a Comment »