Hadoop as a service

It’s been a fun year learning new stuff, and along the way Andy Piper helped out with a bite sized architectural debate while I was experimenting with a Hadoop service on Bluemix. Having a short lived/disposable memory I thought it would be worth posting the discussion here for future reference…

‏@jtonline: Still pondering how a hadoop buildpack might compare to a hadoop service

@andypiper: @jtonline why would you want a buildpack for Hadoop – surely data store = service (broadly) not runtime. #cloudfoundry

@jtonline: @andypiper hmm, maybe, but you want to bring the processing to the data don’t you? Currently seems like services will hold big data in silos

@jtonline: @andypiper for example, I might want to use one of the address verification services from my map reduce job. I’m probably missing something.

@andypiper: @jtonline multiple services can be bound to multiple apps. And you can call jobs in those services from those apps.

@andypiper: @jtonline PivotalHD ships as a service in PivotalCF – obviously you may need data access libraries in the buildpack for the app.

@jtonline: @andypiper not convinced hadoop is just a data store. Do I need apps on runtime to kick off oozie jobs with details of other services?

@andypiper: @jtonline the runtime/service debate on CF has been a long one but I think fairly clean/clear. I’d see Hadoop as a shared resource.

@andypiper: @jtonline bear in mind buildpack -> droplet -> runnable containerised app instance.

@jtonline: @andypiper agreed. Maybe what I’m missing is an easy way to wire services together?

@andypiper: @jtonline yeah maybe – you end up with apps acting as service coordinators I guess.

@andypiper: @jtonline coupled with the fact that apps are intentionally short-lived and best stateless… interesting architectural debate :-)

@andypiper: @jtonline (for “short-lived” read “disposable” my bad)

It should be an interesting 2015.

BigInsights Quicker Start

I’ve been taking a break from Liberty and JAX-RS recently to start tinkering with IBM’s BigInsights Hadoop distribution. To make things easier/more interesting my first attempts were using the Analytics for Hadoop service on BlueMix. In case it helps anyone, here’s what I ended up with before needing to install BigInsights myself:

And this is the script I used to upload data in the video (unfortunately I didn’t have any luck using the HttpFS API):



curl -iv –user ${BIUSER}:${BIPASSWORD} –insecure -X POST “${BIURI}/user/${BIUSER}/sample-data?isfile=false”

curl -iv –user ${BIUSER}:${BIPASSWORD} –insecure -X POST “${BIURI}/user/${BIUSER}/sample-data/orgdata.unl” –header “Content-Type:application/octet-stream” –header “Transfer-Encoding:chunked” -T “orgdata.unl”

curl -iv –user ${BIUSER}:${BIPASSWORD} –insecure -X POST “${BIURI}/user/${BIUSER}/sample-data/persondata.unl” –header “Content-Type:application/octet-stream” –header “Transfer-Encoding:chunked” -T “persondata.unl”


  • since recording the demo Bluemix has added a United Kingdom region, however it looks like the Analytics for Hadoop service is currently only available on the US South region.
  • there is also now a BigInsights service on Bluemix which allows you to provision multi-node Hadoop clusters.