【原创】大叔问题定位分享,13HBase Region频繁下线

问题现象:hive执行sql报错

select count(*) from test_hive_table;

报错

Error: java.io.IOException: org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after attempts=36, exceptions:

Wed May 16 10:22:17 CST 2018, null, java.net.SocketTimeoutException: callTimeout=60000, callDuration=68270: row \'6495922803d09\' on table \'test_hbase_table\' at region=test_hbase_table,6495922803d09,1526207008130.6060407a4aa5e23c8df09614dd3fe650., hostname=server121,16020,1526398855988, seqNum=47468

直接原因一:region状态

这个hive表是外部表,底层数据存放在hbase中,出现问题的region状态:

2018-05-16 03:18:16 在RegionServer server121 上线;

2018-05-16 06:44:13 RegionServer server121 挂了,region下线;(有15个小时处于下线状态)

2018-05-16 21:25:33 在RegionServer server92 上线;

2018-05-17 04:10:44 RegionServer server92 挂了,region下线;(有10个小时处于下线状态)

2018-05-17 14:27:33 在RegionServer server114 上线;

2018-05-16 10:22:17 查询报错的时间正好处于region下线状态,所以报错的直接原因是hbase有一个region当时不可用,问题转移到Hbase,以下为详细hbase日志:

HMaster server132日志:

--------------------------------------

2018-05-16 03:18:12,838 INFO [ProcedureExecutor-15] master.RegionStates: Transition {6060407a4aa5e23c8df09614dd3fe650 state=OPEN, ts=1526370112734, server=server76,16020,1526343293931} to {

6060407a4aa5e23c8df09614dd3fe650 state=OFFLINE, ts=1526411892838, server=server76,16020,1526343293931}

2018-05-16 03:18:15,861 INFO [server132,16000,1525800676562-GeneralBulkAssigner-9] master.RegionStates: Transition {6060407a4aa5e23c8df09614dd3fe650 state=OFFLINE, ts=1526411895787, server=server76,16020,1526343293931} to {6060407a4aa5e23c8df09614dd3fe650 state=PENDING_OPEN, ts=1526411895861, server=server121,16020,1526398855988}

2018-05-16 03:18:16,311 INFO [AM.ZK.Worker-pool2-t9606] master.RegionStates: Transition {6060407a4aa5e23c8df09614dd3fe650 state=PENDING_OPEN, ts=1526411895861, server=server121,16020,1526398855988}

to {6060407a4aa5e23c8df09614dd3fe650 state=OPENING, ts=1526411896311, server=server121,16020,1526398855988}

2018-05-16 03:18:16,445 INFO [AM.ZK.Worker-pool2-t9610] master.RegionStates: Transition {6060407a4aa5e23c8df09614dd3fe650 state=OPENING, ts=1526411896311, server=server121,16020,1526398855988} to {

6060407a4aa5e23c8df09614dd3fe650 state=OPEN, ts=1526411896445, server=server121,16020,1526398855988}

2018-05-16 03:18:16,480 INFO [AM.ZK.Worker-pool2-t9618] master.RegionStates: Offlined 6060407a4aa5e23c8df09614dd3fe650 from server76,16020,1526343293931

2018-05-16 14:21:44,565 INFO [B.defaultRpcServer.handler=183,queue=3,port=16000] master.MasterRpcServices: Client=hadoop//client assign test_hbase_table,6495922803d09,1526207008130.6060407a4aa5e23c8df09614dd3fe650.

2018-05-16 14:21:44,565 INFO [B.defaultRpcServer.handler=183,queue=3,port=16000] master.RegionStates: Transition {6060407a4aa5e23c8df09614dd3fe650 state=OPEN, ts=1526411896480, server=server121,16020,1526398855988} to {6060407a4aa5e23c8df09614dd3fe650 state=OFFLINE, ts=1526451704565, server=server121,16020,1526398855988}

2018-05-16 14:21:44,566 INFO [B.defaultRpcServer.handler=183,queue=3,port=16000] master.AssignmentManager: Skip assigning test_hbase_table,6495922803d09,1526207008130.6060407a4aa5e23c8df09614dd3fe650., it

is on a dead but not processed yet server: server121,16020,1526398855988

2018-05-16 21:25:32,030 INFO [ProcedureExecutor-9] master.RegionStates: Transitioning {6060407a4aa5e23c8df09614dd3fe650 state=OFFLINE, ts=1526451704566, server=server121,16020,1526398855988} will be handled by ServerCrashProcedure for server121,16020,1526398855988

2018-05-16 21:25:32,489 INFO [ProcedureExecutor-9] procedure.ServerCrashProcedure: Reassigning region {6060407a4aa5e23c8df09614dd3fe650 state=OFFLINE, ts=1526451704566, server=server121,16020,1526398855988} and clearing zknode if exists

2018-05-16 21:25:32,938 INFO [server132,16000,1525800676562-GeneralBulkAssigner-9] master.RegionStates: Transition {6060407a4aa5e23c8df09614dd3fe650 state=OFFLINE, ts=1526477132893, server=server121,16020,1526398855988} to {6060407a4aa5e23c8df09614dd3fe650 state=PENDING_OPEN, ts=1526477132938, server=server92,16020,1526473170596}

2018-05-16 21:25:33,078 INFO [AM.ZK.Worker-pool2-t10955] master.RegionStates: Transition {6060407a4aa5e23c8df09614dd3fe650 state=PENDING_OPEN, ts=1526477132938, server=server92,16020,1526473170596}

to {6060407a4aa5e23c8df09614dd3fe650 state=OPENING, ts=1526477133078, server=server92,1526473170596}

2018-05-16 21:25:33,895 INFO [AM.ZK.Worker-pool2-t10956] master.RegionStates: Transition {6060407a4aa5e23c8df09614dd3fe650 state=OPENING, ts=1526477133078, server=server92,16020,1526473170596} to {

6060407a4aa5e23c8df09614dd3fe650 state=OPEN, ts=1526477133895, server=server92,16020,1526473170596}

2018-05-16 21:25:33,940 INFO [AM.ZK.Worker-pool2-t10956] master.RegionStates: Offlined 6060407a4aa5e23c8df09614dd3fe650 from server121,16020,1526398855988

RegionServer server121日志:

--------------------------------------

2018-05-16 03:18:16,390 INFO [RS_OPEN_REGION-server121:16020-0] regionserver.HRegion: Onlined 6060407a4aa5e23c8df09614dd3fe650; next sequenceid=47468

2018-05-16 06:44:13,077 FATAL [regionserver/server121/server121:16020.logRoller] regionserver.HRegionServer: ABORTING region server server121,16020,1526398855988: Failed log close in log roller

2018-05-16 06:44:13,218 INFO [StoreCloserThread-test_hbase_table,6495922803d09,1526207008130.6060407a4aa5e23c8df09614dd3fe650.-1] regionserver.HStore: Closed cf_p

2018-05-16 06:44:13,218 INFO [RS_CLOSE_REGION-server121:16020-0] regionserver.HRegion: Closed test_hbase_table,6495922803d09,1526207008130.6060407a4aa5e23c8df09614dd3fe650.

RegionServer server92日志:

--------------------------------------

2018-05-16 21:25:33,837 INFO [RS_OPEN_REGION-server92:16020-1] regionserver.HRegion: Onlined 6060407a4aa5e23c8df09614dd3fe650; next sequenceid=48261

2018-05-17 04:10:44,590 FATAL [regionserver/server92/server92:16020.logRoller] regionserver.HRegionServer: ABORTING region server server92,16020,1526473170596: Failed log close in log roller

2018-05-17 04:10:44,716 INFO [StoreCloserThread-test_hbase_table,6495922803d09,1526207008130.6060407a4aa5e23c8df09614dd3fe650.-1] regionserver.HStore: Closed cf_p

2018-05-17 04:10:44,716 INFO [RS_CLOSE_REGION-server92:16020-1] regionserver.HRegion: Closed test_hbase_table,6495922803d09,1526207008130.6060407a4aa5e23c8df09614dd3fe650.

直接原因二:RegionServer进程频繁退出

另外最近RegionServer非常不稳定,以RegionServer server121为例,最近几天进程频繁退出

RegionServer退出日志:

--------------------------------------

2018-05-15 00:09:26,412 FATAL [regionserver/server121/server121:16020.logRoller] regionserver.HRegionServer: ABORTING region server server121,16020,1526299719028: Failed log close in log roller

2018-05-15 08:07:39,293 FATAL [regionserver/server121/server121:16020.logRoller] regionserver.HRegionServer: ABORTING region server server121,16020,1526314169508: Failed log close in log roller

2018-05-15 12:58:12,082 FATAL [regionserver/server121/server121:16020.logRoller] regionserver.HRegionServer: ABORTING region server server121,16020,1526342866588: Failed log close in log roller

2018-05-15 21:40:19,032 FATAL [regionserver/server121/server121:16020.logRoller] regionserver.HRegionServer: ABORTING region server server121,16020,1526360301267: Failed log close in log roller

2018-05-15 23:40:47,297 FATAL [regionserver/server121/server121:16020.logRoller] regionserver.HRegionServer: ABORTING region server server121,16020,1526391631046: Failed log close in log roller

2018-05-16 06:44:13,077 FATAL [regionserver/server121/server121:16020.logRoller] regionserver.HRegionServer: ABORTING region server server121,16020,1526398855988: Failed log close in log roller

2018-05-16 16:16:13,501 FATAL [regionserver/server121/server121:16020.logRoller] regionserver.HRegionServer: ABORTING region server server121,16020,1526449759526: Failed log close in log roller

2018-05-16 21:54:02,776 FATAL [regionserver/server121/server121:16020.logRoller] regionserver.HRegionServer: ABORTING region server server121,16020,1526458576939: Failed log close in log roller

2018-05-17 03:27:20,017 FATAL [regionserver/server121/server121:16020.logRoller] regionserver.HRegionServer: ABORTING region server server121,16020,1526478847956: Failed log close in log roller

2018-05-17 16:03:16,978 FATAL [regionserver/server121/server121:16020.logRoller] regionserver.HRegionServer: ABORTING region server server121,16020,1526498846730: Failed log close in log roller

RegionServer启动日志:

--------------------------------------

Tue May 15 00:09:28 CST 2018 Starting regionserver on server121

Tue May 15 08:07:45 CST 2018 Starting regionserver on server121

Tue May 15 12:58:20 CST 2018 Starting regionserver on server121

Tue May 15 21:40:29 CST 2018 Starting regionserver on server121

Tue May 15 23:40:54 CST 2018 Starting regionserver on server121

Wed May 16 06:44:20 CST 2018 Starting regionserver on server121

Wed May 16 13:49:18 CST 2018 Starting regionserver on server121

Wed May 16 16:16:15 CST 2018 Starting regionserver on server121

Wed May 16 21:54:06 CST 2018 Starting regionserver on server121

Thu May 17 03:27:25 CST 2018 Starting regionserver on server121

Thu May 17 16:03:21 CST 2018 Starting regionserver on server121

问题一:为什么Hbase的RegionServer会频繁退出?

2018-05-16 06:44:13,077 FATAL [regionserver/server121/server121:16020.logRoller] regionserver.HRegionServer: ABORTING region server server121,16020,1526398855988: Failed log close in log roller

org.apache.hadoop.hbase.regionserver.wal.FailedLogCloseException: hdfs://hdfs_name/hbase/WALs/server121,16020,1526398855988/server121%2C16020%2C1526398855988.default.1526421652184, unflushedEntries=133

at org.apache.hadoop.hbase.regionserver.wal.FSHLog.replaceWriter(FSHLog.java:907)

at org.apache.hadoop.hbase.regionserver.wal.FSHLog.rollWriter(FSHLog.java:709)

at org.apache.hadoop.hbase.regionserver.LogRoller.run(LogRoller.java:148)

at java.lang.Thread.run(Thread.java:745)

Caused by: org.apache.hadoop.hbase.regionserver.wal.FailedSyncBeforeLogCloseException: org.apache.hadoop.hbase.regionserver.wal.DamagedWALException: Append sequenceId=10739149, requesting roll of WAL

at org.apache.hadoop.hbase.regionserver.wal.FSHLog$SafePointZigZagLatch.checkIfSyncFailed(FSHLog.java:1650)

at org.apache.hadoop.hbase.regionserver.wal.FSHLog$SafePointZigZagLatch.waitSafePoint(FSHLog.java:1669)

at org.apache.hadoop.hbase.regionserver.wal.FSHLog.replaceWriter(FSHLog.java:858)

... 3 more

Caused by: org.apache.hadoop.hbase.regionserver.wal.DamagedWALException: Append sequenceId=10739149, requesting roll of WAL

at org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.append(FSHLog.java:1970)

at org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1814)

at org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1724)

at com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:128)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

... 1 more

Caused by: java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[server82:50010], original=[server82:50010]).

The current failed datanode replacement policy is ALWAYS, and a client may configure this via \'dfs.client.block.write.replace-datanode-on-failure.policy\' in its configuration.

at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:1040)

at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1106)

at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1253)

at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:1004)

at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:548)

这里报错是因为获取新的datanode失败,HDFS代码解析详见 https://www.cnblogs.com/barneywill/p/10114504.html

大概过程:由于机架过多,maxNodesPerRack计算结果为1(计算过程见问题二),即每个机架最多只能分配一个datanode,然后整个hdfs集群只有一个机架可用,最后numOfAvailableNodes计算结果为0,然后会抛出NotEnoughReplicasException

问题二:为什么出问题的HDFS文件只有一个副本?

hdfs://hdfs_name/hbase/WALs/server121,16020,1526398855988/server121%2C16020%2C1526398855988.default.1526421652184

Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[server82:50010], original=[server82:50010]).

HDFS代码解析详见 https://www.cnblogs.com/barneywill/p/10114504.html

注释:

int maxNodesPerRack = (totalNumOfReplicas-1)/numOfRacks + 2;

结果:maxNodesPerRack = (2-1)/5+2 = 2;

if (maxNodesPerRack == totalNumOfReplicas) {

maxNodesPerRack--;

}

结果:maxNodesPerRack = 1;

由于机架数量过多,导致一个机架最多只能分配一个datanode,结合目前的HDFS机架结构,只有一个机架可用,所以只有1个副本

问题三:为什么Region从OFFLINE到ONLINE需要很长时间?

以出问题的Region从2018-05-16 06:44:13下线到2018-05-16 21:25:33上线过程来看

Hmaster日志:

2018-05-16 06:44:13,437 INFO [main-EventThread] zookeeper.RegionServerTracker: RegionServer ephemeral node deleted, processing expiration [server121,16020,1526398855988]

2018-05-16 06:44:13,633 INFO [ProcedureExecutor-2] procedure.ServerCrashProcedure: Start processing crashed server121,16020,1526398855988

2018-05-16 06:44:13,895 INFO [ProcedureExecutor-8] master.SplitLogManager: dead splitlog workers [server121,16020,1526398855988]

2018-05-16 06:44:13,896 INFO [ProcedureExecutor-8] master.SplitLogManager: Started splitting 3 logs in [hdfs://hdfs_name/hbase/WALs/server121,16020,1526398855988-splitting] for [server121,16020,1526398855988]

2018-05-16 06:44:13,911 INFO [main-EventThread] coordination.SplitLogManagerCoordination: task /hbase/splitWAL/WALs%2Fserver121%2C16020%2C1526398855988-splitting%2Fserver121%252C16020%252C1526398855988.default.1526418989998 acquired by server94,16020,1526360475758

2018-05-16 06:44:13,919 INFO [main-EventThread] coordination.SplitLogManagerCoordination: task /hbase/splitWAL/WALs%2Fserver121%2C16020%2C1526398855988-splitting%2Fserver121%252C16020%252C1526398855988.default.1526424253042 acquired by server124,16020,1526398058280

2018-05-16 06:44:13,919 INFO [main-EventThread] coordination.SplitLogManagerCoordination: task /hbase/splitWAL/WALs%2Fserver121%2C16020%2C1526398855988-splitting%2Fserver121%252C16020%252C1526398855988.default.1526421652184 acquired by server119,16020,1526383578672

region server挂掉之后master开始server crash处理,首先尝试split log,这里一共有3个log,其中2个log很快处理完:

2018-05-16 06:44:13,964 INFO [main-EventThread] coordination.SplitLogManagerCoordination: Done splitting /hbase/splitWAL/WALs%2Fserver121%2C16020%2C1526398855988-splitting%2Fserver121%

252C16020%252C1526398855988.default.1526424253042

2018-05-16 06:46:01,875 INFO [main-EventThread] coordination.SplitLogManagerCoordination: Done splitting /hbase/splitWAL/WALs%2Fserver121%2C16020%2C1526398855988-splitting%2Fserver121%

252C16020%252C1526398855988.default.1526421652184

但是有1个一直在处理并且反复报错:

2018-05-16 07:56:32,848 WARN [main-EventThread] coordination.SplitLogManagerCoordination: Error splitting /hbase/splitWAL/WALs%2Fserver121%2C16020%2C1526398855988-splitting%2Fserver121%252C16020%252C1526398855988.default.1526418989998

2018-05-16 09:08:56,761 WARN [main-EventThread] coordination.SplitLogManagerCoordination: Error splitting /hbase/splitWAL/WALs%2Fserver121%2C16020%2C1526398855988-splitting%2Fserver121%252C16020%252C1526398855988.default.1526418989998

2018-05-16 10:21:29,120 WARN [main-EventThread] coordination.SplitLogManagerCoordination: Error splitting /hbase/splitWAL/WALs%2Fserver121%2C16020%2C1526398855988-splitting%2Fserver121%252C16020%252C1526398855988.default.1526418989998

2018-05-16 11:47:19,093 WARN [main-EventThread] coordination.SplitLogManagerCoordination: Error splitting /hbase/splitWAL/WALs%2Fserver121%2C16020%2C1526398855988-splitting%2Fserver121%252C16020%252C1526398855988.default.1526418989998

2018-05-16 12:59:40,008 WARN [main-EventThread] coordination.SplitLogManagerCoordination: Error splitting /hbase/splitWAL/WALs%2Fserver121%2C16020%2C1526398855988-splitting%2Fserver121%252C16020%252C1526398855988.default.1526418989998

2018-05-16 14:12:18,594 WARN [main-EventThread] coordination.SplitLogManagerCoordination: Error splitting /hbase/splitWAL/WALs%2Fserver121%2C16020%2C1526398855988-splitting%2Fserver121%252C16020%252C1526398855988.default.1526418989998

2018-05-16 15:24:46,758 WARN [main-EventThread] coordination.SplitLogManagerCoordination: Error splitting /hbase/splitWAL/WALs%2Fserver121%2C16020%2C1526398855988-splitting%2Fserver121%252C16020%252C1526398855988.default.1526418989998

2018-05-16 16:37:17,790 WARN [main-EventThread] coordination.SplitLogManagerCoordination: Error splitting /hbase/splitWAL/WALs%2Fserver121%2C16020%2C1526398855988-splitting%2Fserver121%252C16020%252C1526398855988.default.1526418989998

2018-05-16 18:06:36,864 WARN [main-EventThread] coordination.SplitLogManagerCoordination: Error splitting /hbase/splitWAL/WALs%2Fserver121%2C16020%2C1526398855988-splitting%2Fserver121%252C16020%252C1526398855988.default.1526418989998

2018-05-16 19:18:50,573 WARN [main-EventThread] coordination.SplitLogManagerCoordination: Error splitting /hbase/splitWAL/WALs%2Fserver121%2C16020%2C1526398855988-splitting%2Fserver121%252C16020%252C1526398855988.default.1526418989998

2018-05-16 20:37:30,438 WARN [main-EventThread] coordination.SplitLogManagerCoordination: Error splitting /hbase/splitWAL/WALs%2Fserver121%2C16020%2C1526398855988-splitting%2Fserver121%252C16020%252C1526398855988.default.1526418989998

2018-05-16 21:17:56,062 WARN [main-EventThread] coordination.SplitLogManagerCoordination: Error splitting /hbase/splitWAL/WALs%2Fserver121%2C16020%2C1526398855988-splitting%2Fserver121%252C16020%252C1526398855988.default.1526418989998

发现该log文件的split过程从6点44到21点多一直处于报错重试过程,具体报错看第一次split过程日志:

2018-05-16 06:44:13,911 INFO [SplitLogWorker-server94:16020] coordination.ZkSplitLogWorkerCoordination: worker server94,16020,1526360475758 acquired task /hbase/splitWAL/WALs%2Fserver121%2C16020%2C1526398855988-splitting%2Fserver121%252C16020%252C1526398855988.default.1526418989998

2018-05-16 06:44:13,932 INFO [RS_LOG_REPLAY_OPS-server94:16020-0] wal.WALSplitter: Splitting wal: hdfs://hdfs_name/hbase/WALs/server121,16020,1526398855988-splitting/server121%2C16020%2C1526398855988.default.1526418989998, length=83

2018-05-16 06:44:13,932 INFO [RS_LOG_REPLAY_OPS-server94:16020-0] wal.WALSplitter: DistributedLogReplay = false

2018-05-16 06:44:13,932 INFO [RS_LOG_REPLAY_OPS-server94:16020-0] util.FSHDFSUtils: Recover lease on dfs file hdfs://hdfs_name/hbase/WALs/server121,16020,1526398855988-splitting/server121%2C16020%2C1526398855988.default.1526418989998

2018-05-16 06:44:13,934 INFO [RS_LOG_REPLAY_OPS-server94:16020-0] util.FSHDFSUtils: Failed to recover lease, attempt=0 on file=hdfs://hdfs_name/hbase/WALs/server121,16020,1526398855988-splitting/server121%2C16020%2C1526398855988.default.1526418989998 after 1ms

2018-05-16 06:44:17,935 INFO [RS_LOG_REPLAY_OPS-server94:16020-0] util.FSHDFSUtils: Failed to recover lease, attempt=1 on file=hdfs://hdfs_name/hbase/WALs/server121,16020,1526398855988-splitting/server121%2C16020%2C1526398855988.default.1526418989998 after 4002ms

2018-05-16 06:45:22,051 INFO [RS_LOG_REPLAY_OPS-server94:16020-0] util.FSHDFSUtils: Failed to recover lease, attempt=2 on file=hdfs://hdfs_name/hbase/WALs/server121,16020,1526398855988-splitting/server121%2C16020%2C1526398855988.default.1526418989998 after 68118ms

2018-05-16 06:47:30,834 INFO [RS_LOG_REPLAY_OPS-server94:16020-0] util.FSHDFSUtils: Failed to recover lease, attempt=3 on file=hdfs://hdfs_name/hbase/WALs/server121,16020,1526398855988-splitting/server121%2C16020%2C1526398855988.default.1526418989998 after 196901ms

2018-05-16 06:50:43,622 INFO [RS_LOG_REPLAY_OPS-server94:16020-0] util.FSHDFSUtils: Failed to recover lease, attempt=4 on file=hdfs://hdfs_name/hbase/WALs/server121,16020,1526398855988-splitting/server121%2C16020%2C1526398855988.default.1526418989998 after 389689ms

2018-05-16 06:55:00,314 INFO [RS_LOG_REPLAY_OPS-server94:16020-0] util.FSHDFSUtils: Failed to recover lease, attempt=5 on file=hdfs://hdfs_name/hbase/WALs/server121,16020,1526398855988-splitting/server121%2C16020%2C1526398855988.default.1526418989998 after 646381ms

2018-05-16 07:00:20,600 INFO [RS_LOG_REPLAY_OPS-server94:16020-0] util.FSHDFSUtils: Failed to recover lease, attempt=6 on file=hdfs://hdfs_name/hbase/WALs/server121,16020,1526398855988-splitting/server121%2C16020%2C1526398855988.default.1526418989998 after 966667ms

2018-05-16 07:01:09,660 WARN [RS_LOG_REPLAY_OPS-server94:16020-0] hdfs.DFSClient: Failed to connect to /server28:50010 for block, add to deadNodes and continue. java.io.IOException: Got error for OP_READ_BLOCK, self=/server94:26354, remote=/server28:50010, for file /hbase/WALs/server121,16020,1526398855988-splitting/server121%2C16020%2C1526398855988.default.1526418989998, for pool BP-436366437-server131-1493982655699 block 1361314257_288238951

2018-05-16 07:01:59,868 WARN [RS_LOG_REPLAY_OPS-server94:16020-0] hdfs.DFSClient: Failed to connect to /server28:50010 for block, add to deadNodes and continue. java.io.IOException: Got error for OP_READ_BLOCK, self=/server94:28464, remote=/server28:50010, for file /hbase/WALs/server121,16020,1526398855988-splitting/server121%2C16020%2C1526398855988.default.1526418989998, for pool BP-436366437-server131-1493982655699 block 1361314257_288238951

2018-05-16 07:02:56,853 WARN [RS_LOG_REPLAY_OPS-server94:16020-0] hdfs.DFSClient: Failed to connect to /server28:50010 for block, add to deadNodes and continue. java.io.IOException: Got error for OP_READ_BLOCK, self=/server94:30766, remote=/server28:50010, for file /hbase/WALs/server121,16020,1526398855988-splitting/server121%2C16020%2C1526398855988.default.1526418989998, for pool BP-436366437-server131-1493982655699 block 1361314257_288238951

2018-05-16 07:03:56,370 WARN [RS_LOG_REPLAY_OPS-server94:16020-0] hdfs.DFSClient: Failed to connect to /server28:50010 for block, add to deadNodes and continue. java.io.IOException: Got error for OP_READ_BLOCK, self=/server94:33214, remote=/server28:50010, for file /hbase/WALs/server121,16020,1526398855988-splitting/server121%2C16020%2C1526398855988.default.1526418989998, for pool BP-436366437-server131-1493982655699 block 1361314257_288238951

发现是读hdfs文件超时,具体报错:

2018-05-16 07:01:09,660 WARN [RS_LOG_REPLAY_OPS-server94:16020-0] hdfs.DFSClient: Failed to connect to /server28:50010 for block, add to deadNodes and continue. java.io.IOException: Got error for OP_READ_BLOCK, self=/server94:26354, remote=/server28:50010, for file /hbase/WALs/server121,16020,1526398855988-splitting/server121%2C16020%2C1526398855988.default.1526418989998, for pool BP-436366437-server131-1493982655699 block 1361314257_288238951

java.io.IOException: Got error for OP_READ_BLOCK, self=/server94:26354, remote=/server28:50010, for file /hbase/WALs/server121,16020,1526398855988-splitting/server121%2C16020%2C

1526398855988.default.1526418989998, for pool BP-436366437-server131-1493982655699 block 1361314257_288238951

at org.apache.hadoop.hdfs.RemoteBlockReader2.checkSuccess(RemoteBlockReader2.java:445)

at org.apache.hadoop.hdfs.RemoteBlockReader2.newBlockReader(RemoteBlockReader2.java:410)

at org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReader(BlockReaderFactory.java:819)

at org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:697)

at org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:354)

at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:576)

at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:800)

at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:848)

at java.io.DataInputStream.read(DataInputStream.java:100)

at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:299)

at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:267)

at org.apache.hadoop.hbase.wal.WALSplitter.getReader(WALSplitter.java:853)

at org.apache.hadoop.hbase.wal.WALSplitter.getReader(WALSplitter.java:777)

at org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:298)

at org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:236)

at org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:104)

at org.apache.hadoop.hbase.regionserver.handler.WALSplitterHandler.process(WALSplitterHandler.java:72)

at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:129)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

at java.lang.Thread.run(Thread.java:745)

2018-05-16 07:01:09,661 INFO [RS_LOG_REPLAY_OPS-server94:16020-0] hdfs.DFSClient: Could not obtain BP-436366437-server131-1493982655699:blk_1361314257_288238951 from any node: java.io.IOException: No live

nodes contain block BP-436366437-server131-1493982655699:blk_1361314257_288238951 after checking nodes = [server127:50010, server28:50010], ignoredNodes = null No live nodes contain current block Block

locations: server127:50010 server28:50010 Dead nodes: server127:50010 server28:50010. Will get new block locations from namenode and retry...

2018-05-16 07:01:09,661 WARN [RS_LOG_REPLAY_OPS-server94:16020-0] hdfs.DFSClient: DFS chooseDataNode: got # 1 IOException, will wait for 1196.0361088580376 msec.

2018-05-16 07:01:14,861 INFO [RS_LOG_REPLAY_OPS-server94:16020-0] ipc.Client: Retrying connect to server: server127/server127:50020. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)

直到2018-05-16 21:25:31终于split完成

2018-05-16 21:25:31,858 INFO [main-EventThread] coordination.SplitLogManagerCoordination: task /hbase/splitWAL/WALs%2Fserver121%2C16020%2C1526398855988-splitting%2Fserver121%252C16020%252C1526398855988.default.1526418989998 entered state: DONE server76,16020,1526424280100

2018-05-16 21:25:31,868 INFO [main-EventThread] wal.WALSplitter: Archived processed log hdfs://hdfs_name/hbase/WALs/server121,16020,1526398855988-splitting/server121%2C16020%2C1526398855988.default.1526418989998 to hdfs://hdfs_name/hbase/oldWALs/server121%2C16020%2C1526398855988.default.1526418989998

2018-05-16 21:25:31,869 INFO [main-EventThread] coordination.SplitLogManagerCoordination: Done splitting /hbase/splitWAL/WALs%2Fserver121%2C16020%2C1526398855988-splitting%2Fserver121%252C16020%252C1526398855988.default.1526418989998

2018-05-16 21:25:31,870 INFO [ProcedureExecutor-11] master.SplitLogManager: finished splitting (more than or equal to) 83 bytes in 1 log files in [hdfs://hdfs_name/hbase/WALs/server121,16020,1526398855988-splitting] in 455631ms

然后有很多region被assign,包括出问题的region,这个region在RegionServer server92 上线,详见之前的日志,assign过程代码详见:

org.apache.hadoop.hbase.master.AssignmentManager.assign

问题总结

归根结底,所有的问题都是由于HDFS机架配置导致的,有的服务器配置机架信息,有的服务器没有配置机架信息,导致机架数量特别多,而且机架中的服务器分配不均,由hdfs分配策略可计算出这种情况下一个机架最多只能分配一个datanode,同时大量的机架上的datanode剩余空间为0,所以很多文件只有1个副本,这样就造成单点,一旦有datanode节点状态异常,就可能导致大量文件不可用,进而导致一系列的其他问题,比如上边的hbase region长时间处于下线状态。

问题发现于hive,直接原因是hbase,根本原因是hdfs,本质是因为rack配置不当;

hdfs中机架相关配置:

<property>

<name>net.topology.script.file.name</name>

<value>/export/App/hadoop-2.6.1/etc/hadoop/rack.py</value>

</property>