事故表现
跨服活动在比赛开启的时候服务进程(Go语言开发)异常宕机,重启进程之后比赛正常进行。该服务已经运行多年,第一次出现此问题。
问题分析
进程宕机时收集到的堆栈如下:
|
|
首先, 宕机的信息显示panic的原因是
invalid memory address or nil pointer dereference
,使用空指针导致异常宕机的想法立刻浮现,通过查看go语言源码:1234// insertValue is a convenience wrapper for insert(&Element{Value: v}, at).func (l *List) insertValue(v interface{}, at *Element) *Element {return l.insert(&Element{Value: v}, at)}宕机信息中container/list/list.go:105对应源码
return l.insert(&Element{Value: v}, at)
,如果是因为使用了空指针,根据代码不难确定只有List为空才会发生宕机。按照List为空这个思路查下去,却发现List在使用之前明确的被初始化过,不可能为Nil。习惯害死人,即使明确List不可能为Nil,还是固执地认为可能存在隐藏问题导致List为Nil,因为这个惯性思维耽误了不少时间。
期间也发现了一个不确定的信息是:panic堆栈(goroutine 743264)与堆栈(goroutine 743272,如下 )相似且Room相同(0xc2084b63c0):12345goroutine 743272 [select]:main.(*Room).doSort(0xc2084b63c0)/home/sf/M_00009/server/tags/release_4.2.0/src/master/room.go:214 +0x5b6created by main.(*Room).Run/home/sf/M_00009/server/tags/release_4.2.0/src/master/room.go:105 +0x2b5但是这个信息没有引起重视,完全迷失在List被重置为Nil的幻觉中。。。
然后,就在百思不得其解时,查看了一下进程宕机时段的日志,摘要如下:
123456789101112131415161718192021222324252627282930[2018/03/25 20:30:01.930863] [INFO] (/home/sf/M_00009/server/tags/release_4.2.0/src/master/state.go:386) CrossPK: save type: 4[2018/03/25 20:30:01.931084] [INFO] (/home/sf/M_00009/server/tags/release_4.2.0/src/master/state.go:338) CrossPK: 1 top1 boot room[2018/03/25 20:30:01.931115] [INFO] (/home/sf/M_00009/server/tags/release_4.2.0/src/master/state.go:386) CrossPK: save type: 4[2018/03/25 20:30:01.931472] [INFO] (/home/sf/M_00009/server/tags/release_4.2.0/src/master/state.go:338) CrossPK: 2 top1 boot room[2018/03/25 20:30:01.931528] [INFO] (/home/sf/M_00009/server/tags/release_4.2.0/src/master/state.go:386) CrossPK: save type: 4[2018/03/25 20:30:01.931763] [INFO] (/home/sf/M_00009/server/tags/release_4.2.0/src/master/state.go:338) CrossPK: 3 top1 boot room[2018/03/25 20:30:01.931794] [INFO] (/home/sf/M_00009/server/tags/release_4.2.0/src/master/state.go:386) CrossPK: save type: 4[2018/03/25 20:30:01.932042] [INFO] (/home/sf/M_00009/server/tags/release_4.2.0/src/master/state.go:338) CrossPK: 4 top1 boot room[2018/03/25 20:30:01.932077] [INFO] (/home/sf/M_00009/server/tags/release_4.2.0/src/master/state.go:386) CrossPK: save type: 4[2018/03/25 20:30:01.932479] [INFO] (/home/sf/M_00009/server/tags/release_4.2.0/src/master/state.go:338) CrossPK: 5 top1 boot room[2018/03/25 20:30:01.932505] [INFO] (/home/sf/M_00009/server/tags/release_4.2.0/src/master/state.go:386) CrossPK: save type: 4[2018/03/25 20:30:01.932974] [INFO] (/home/sf/M_00009/server/tags/release_4.2.0/src/master/state.go:338) CrossPK: 6 top1 boot room[2018/03/25 20:30:00.579081] [INFO] (/home/sf/M_00009/server/tags/release_4.2.0/src/master/state.go:386) CrossPK: save type: 3[2018/03/25 20:30:00.580716] [INFO] (/home/sf/M_00009/server/tags/release_4.2.0/src/master/state.go:386) CrossPK: save type: 3[2018/03/25 20:30:00.582094] [INFO] (/home/sf/M_00009/server/tags/release_4.2.0/src/master/state.go:386) CrossPK: save type: 3[2018/03/25 20:30:00.583440] [INFO] (/home/sf/M_00009/server/tags/release_4.2.0/src/master/state.go:386) CrossPK: save type: 3[2018/03/25 20:30:00.584705] [INFO] (/home/sf/M_00009/server/tags/release_4.2.0/src/master/state.go:386) CrossPK: save type: 3[2018/03/25 20:30:00.585861] [INFO] (/home/sf/M_00009/server/tags/release_4.2.0/src/master/state.go:386) CrossPK: save type: 3[2018/03/25 20:30:01.577092] [INFO] (/home/sf/M_00009/server/tags/release_4.2.0/src/master/state.go:386) CrossPK: save type: 4[2018/03/25 20:30:01.577349] [INFO] (/home/sf/M_00009/server/tags/release_4.2.0/src/master/state.go:338) CrossPK: 1 top1 boot room[2018/03/25 20:30:01.577424] [INFO] (/home/sf/M_00009/server/tags/release_4.2.0/src/master/state.go:386) CrossPK: save type: 4[2018/03/25 20:30:01.577717] [INFO] (/home/sf/M_00009/server/tags/release_4.2.0/src/master/state.go:338) CrossPK: 2 top1 boot room[2018/03/25 20:30:01.577761] [INFO] (/home/sf/M_00009/server/tags/release_4.2.0/src/master/state.go:386) CrossPK: save type: 4[2018/03/25 20:30:01.578035] [INFO] (/home/sf/M_00009/server/tags/release_4.2.0/src/master/state.go:338) CrossPK: 3 top1 boot room[2018/03/25 20:30:01.578185] [INFO] (/home/sf/M_00009/server/tags/release_4.2.0/src/master/state.go:386) CrossPK: save type: 4[2018/03/25 20:30:01.578475] [INFO] (/home/sf/M_00009/server/tags/release_4.2.0/src/master/state.go:338) CrossPK: 4 top1 boot room[2018/03/25 20:30:01.578561] [INFO] (/home/sf/M_00009/server/tags/release_4.2.0/src/master/state.go:386) CrossPK: save type: 4[2018/03/25 20:30:01.578803] [INFO] (/home/sf/M_00009/server/tags/release_4.2.0/src/master/state.go:338) CrossPK: 5 top1 boot room[2018/03/25 20:30:01.578834] [INFO] (/home/sf/M_00009/server/tags/release_4.2.0/src/master/state.go:386) CrossPK: save type: 4[2018/03/25 20:30:01.579156] [INFO] (/home/sf/M_00009/server/tags/release_4.2.0/src/master/state.go:338) CrossPK: 6 top1 boot room起先因为代码是四年前写的(对代码记忆已经模糊),日志中并未发现错误信息(都是INFO类型),反而更坚定了List为Nil的假设。但是仔细查看日志发现:Room在极短的时间被boot 两次,进而发现进程的时间居然被重置过:
12[2018/03/25 20:30:01.932974] [INFO] (/home/sf/M_00009/server/tags/release_4.2.0/src/master/state.go:338) CrossPK: 6 top1 boot room[2018/03/25 20:30:00.579081] [INFO] (/home/sf/M_00009/server/tags/release_4.2.0/src/master/state.go:386) CrossPK: save type: 3下一条日志的时间(20:30:00.579081)早于上面一条日志(20:30:01.932974)(日志同步写入,正常不可能出现时间倒转问题),豁然开朗!!!
进程时间重置导致比赛再次被启动,每次启动都会创建一个相应的Goruntine,List被两个Goruntine同时访问产生了数据竞争问题,进而引发宕机。
这也解释了上文中panic日志有两个相似的堆栈,一切也都能解释的通了。到此,我们知道引发这一系列问题的真实原因是时间被重置。那又是谁修改了进程时间?
1*/6 * * * * /usr/sbin/ntpdate ntp-jh.uuzu.com >> /var/log/uptime.log 2>&1 || /usr/sbin/ntpdate ntp-yd.uuzu.com >> /var/log/uptime.log 2>&1;/sbin/hwclock -wcrontab中包含上面这段操作,用来同步服务器时间。查看uptime.log中包含
25 Mar 20:30:00 ntpdate[12587]: step time server 10.0.0.6 offset -2.353620 sec
,于是乎就明朗了,游戏进程中使用的时间是wall clock
而不是monotonic clock
,服务器脚本定时使用NTP机制同步系统时间进而影响游戏时间。
结论
- 惯性思维害人;
- 查BUG就像破案,蛛丝马迹,认真仔细,全面分析,冷静思考;
防范措施
- 代码中增加判断步骤,防止时间重置引发多次启动;
- 游戏中时间wall clock考虑改为monotonic clock;
- NTP脚本执行时间错开活动开启时间(治标不治本)。