Separate forest for Hadoop deployment?

  • Last Post 26 May 2016
Ravi.Sabharanjak posted this 25 May 2016

Hello all,
I am wondering if anyone has first hand experience with deploying a hadoop cluster that uses AD.  According to our Hadoop folks, it is "industry standard practice" to deploy the hadoop cluster into a separate forest
How true / valid is this?

Order By: Standard | Newest | Votes
gkirkpatrick posted this 25 May 2016

My understanding is that Hadoop doesn’t actually work with AD, but instead with MIT Kerberos. You have to set up a separate KDC and

realm for Hadoop and then create an external trust relationship to your AD domain.


That’s just what I read when I set up Hadoop… I didn’t go through the exercise myself.





Ravi.Sabharanjak posted this 25 May 2016

It does seem to work (we have a separate AD forest here for the deployment).


gkirkpatrick posted this 25 May 2016

That’s good to know.


I can’t imagine why a separate forest would be necessary though.





joe posted this 25 May 2016

FWIW, ours is set up as Gil suggested. It is an MIT Kerb realm and an external trust but there is no actual additional AD forest.
So much of the Hadoop stack feels like I'm being dragged back a decade in terms of infrastructure design for identity. I try to spend all my time on OAuth these days and someone really wants to set up an MIT Kerb trust? For real?
Joe K.


gkirkpatrick posted this 25 May 2016

IKR? It’s kind of embarrassing, especially for something as ostensibly modern as Hadoop.





Techman06 posted this 26 May 2016

We set ours up using AD, in our forest, but it requires creating a lot of SPN’s and keytabs, which I was less than enthused about.  It can be done, albeit with some compromises. 


Sent from my Windows 10 phone



Ravi.Sabharanjak posted this 26 May 2016

Is your setup a separate forest, or the same one that you use for users, desktops etc?
The SPNs and keytabs can be delegated + automated, but did you run into any reasons to justify a separate forest such as performance, security etc?


Parzival posted this 26 May 2016


There are two options you can use with most implementations, either use a KDC on the Head Node with Kerberos Realm trust to the Corp AD. The advantage is that the cluster itself takes care of adding/removing nodes and ensuring the keytab files are created/distributed

for the worker nodes.  

But you can also integrate directly with AD (regular Linux join), but you'd have to take care of the keytab files yourself.. 

According to the Cloudera documentation for AD integration, they only advise to use separate domain controllers for the clusters (if they grow very big) due to the sheer load on the KDC's. They want to make sure that cluster performance does not degrade

due to latency in the authentication (it uses that a lot apparently).. 


As your cluster grows, so will the volume of Authentication Service (AS) and Ticket Granting Service (TGS) interaction between the services on each cluster server. Consider evaluating the volume of this interaction against the Active Directory domain controllers

you have configured for the cluster before rolling this feature out to a production environment. If cluster performance suffers, over time it might become necessary to dedicate a set of AD domain controllers to larger deployments.

Next is the actual separate forest, this is not so much a technical issue, but rather an administrative boundary, usually the administrators taking care of the HPC/Hadoop/etc clusters are not the ones taking care of backend "office" systems.. as such, they

say that using a separate AD forest will allow the HPC team to be more effective to resolve issues and when adding/removing nodes.. 


Troubleshooting the cluster's operations, especially for Kerberos-enabled services, will need to include AD administration resources. Evaluate your organizational processes for engaging the AD administration team, and how to escalate in case a cluster outage

occurs due to issues with Kerberos authentication against AD services. In some situations it might be necessary to enable

Kerberos event logging
 to address desktop and KDC issues within windows environments.

And looking at some customers, they usually increase/decrease these worker nodes pretty fast.. it is not uncommon to have a request to initiate 400 worker nodes within a few hours

Or you go the full way by using AD, and circumventing the generation of the keytab files using a 3rd party plugin .. 

Hortonworks actually has a pretty good overview of the Pro/Cons for using the KDC service on the cluster, or using Active Directory: