0
假设在Hadoop集群中,我们有2个机架rck1和rck2。每个机架有5个节点。 Namenode将如何知道节点1属于机架1,节点3属于机架2.在Hadoop中,Namenode将如何获取机架细节和属于机架的datanode?
假设在Hadoop集群中,我们有2个机架rck1和rck2。每个机架有5个节点。 Namenode将如何知道节点1属于机架1,节点3属于机架2.在Hadoop中,Namenode将如何获取机架细节和属于机架的datanode?
您必须配置系统以指定如何确定机架信息。例如,this Cloudera link会告诉您如何为Cloudera Manager中的主机配置机架。
或者,this Apache link说明如何通过配置文件在java类的外部脚本中指定此信息。
虽然您可以使用更深层次的结构,但拓扑结构通常为form/myrack/myhost。他们在python中有以下例子,它假定每个机架都有一个/ 24子网,因此会提取IP地址的前三个字节作为机架号 - 如果您可以相应地设置节点IP地址,则可以采用类似的方法,或者编写你自己的脚本来确定每个节点上的IP地址或其他可用信息的机架(即使例如主机名和机架之间的简单硬编码映射也适用于你的例子,节点相对较少)。
#!/usr/bin/python
# this script makes assumptions about the physical environment.
# 1) each rack is its own layer 3 network with a /24 subnet, which
# could be typical where each rack has its own
# switch with uplinks to a central core router.
#
# +-----------+
# |core router|
# +-----------+
# / \
# +-----------+ +-----------+
# |rack switch| |rack switch|
# +-----------+ +-----------+
# | data node | | data node |
# +-----------+ +-----------+
# | data node | | data node |
# +-----------+ +-----------+
#
# 2) topology script gets list of IP's as input, calculates network address, and prints '/network_address/ip'.
import netaddr
import sys
sys.argv.pop(0) # discard name of topology script from argv list as we just want IP addresses
netmask = '255.255.255.0' # set netmask to what's being used in your environment. The example uses a /24
for ip in sys.argv: # loop over list of datanode IP's
address = '{0}/{1}'.format(ip, netmask) # format address string so it looks like 'ip/netmask' to make netaddr work
try:
network_address = netaddr.IPNetwork(address).network # calculate and print network address
print "/{0}".format(network_address)
except:
print "/rack-unknown" # print catch-all value if unable to calculate network address