案例分析 - 2018/03/25跨服活动异常导致进程宕机

事故表现

跨服活动在比赛开启的时候服务进程(Go语言开发)异常宕机,重启进程之后比赛正常进行。该服务已经运行多年,第一次出现此问题。

问题分析

进程宕机时收集到的堆栈如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xb code=0x1 addr=0x8 pc=0x4b6901]
goroutine 743264 [running]:
runtime.panic(0x819240, 0xb37813)
/usr/local/go/src/pkg/runtime/panic.c:279 +0xf5
container/list.(*List).insertValue(0xc20847efc0, 0x86ee40, 0xc208f90cc0, 0xc20b0de660, 0x7fc1581e3c50)
/usr/local/go/src/pkg/container/list/list.go:105 +0x71
container/list.(*List).PushBack(0xc20847efc0, 0x86ee40, 0xc208f90cc0, 0x1)
/usr/local/go/src/pkg/container/list/list.go:139 +0x52
main.(*Room).Sort(0xc2084b63c0, 0xc208f90cc0)
/home/sf/M_00009/server/tags/release_4.2.0/src/master/room.go:308 +0x8bb
main.(*Room).doSort(0xc2084b63c0)
/home/sf/M_00009/server/tags/release_4.2.0/src/master/room.go:243 +0x58d
created by main.(*Room).Run
/home/sf/M_00009/server/tags/release_4.2.0/src/master/room.go:105 +0x2b5
  • 首先, 宕机的信息显示panic的原因是invalid memory address or nil pointer dereference使用空指针导致异常宕机的想法立刻浮现,通过查看go语言源码:

    1
    2
    3
    4
    // insertValue is a convenience wrapper for insert(&Element{Value: v}, at).
    func (l *List) insertValue(v interface{}, at *Element) *Element {
    return l.insert(&Element{Value: v}, at)
    }

    宕机信息中container/list/list.go:105对应源码return l.insert(&Element{Value: v}, at),如果是因为使用了空指针,根据代码不难确定只有List为空才会发生宕机。按照List为空这个思路查下去,却发现List在使用之前明确的被初始化过,不可能为Nil习惯害死人,即使明确List不可能为Nil,还是固执地认为可能存在隐藏问题导致List为Nil,因为这个惯性思维耽误了不少时间。期间也发现了一个不确定的信息是:panic堆栈(goroutine 743264)与堆栈(goroutine 743272,如下 )相似且Room相同(0xc2084b63c0):

    1
    2
    3
    4
    5
    goroutine 743272 [select]:
    main.(*Room).doSort(0xc2084b63c0)
    /home/sf/M_00009/server/tags/release_4.2.0/src/master/room.go:214 +0x5b6
    created by main.(*Room).Run
    /home/sf/M_00009/server/tags/release_4.2.0/src/master/room.go:105 +0x2b5

    但是这个信息没有引起重视,完全迷失在List被重置为Nil的幻觉中。。。

  • 然后,就在百思不得其解时,查看了一下进程宕机时段的日志,摘要如下:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    [2018/03/25 20:30:01.930863] [INFO] (/home/sf/M_00009/server/tags/release_4.2.0/src/master/state.go:386) CrossPK: save type: 4
    [2018/03/25 20:30:01.931084] [INFO] (/home/sf/M_00009/server/tags/release_4.2.0/src/master/state.go:338) CrossPK: 1 top1 boot room
    [2018/03/25 20:30:01.931115] [INFO] (/home/sf/M_00009/server/tags/release_4.2.0/src/master/state.go:386) CrossPK: save type: 4
    [2018/03/25 20:30:01.931472] [INFO] (/home/sf/M_00009/server/tags/release_4.2.0/src/master/state.go:338) CrossPK: 2 top1 boot room
    [2018/03/25 20:30:01.931528] [INFO] (/home/sf/M_00009/server/tags/release_4.2.0/src/master/state.go:386) CrossPK: save type: 4
    [2018/03/25 20:30:01.931763] [INFO] (/home/sf/M_00009/server/tags/release_4.2.0/src/master/state.go:338) CrossPK: 3 top1 boot room
    [2018/03/25 20:30:01.931794] [INFO] (/home/sf/M_00009/server/tags/release_4.2.0/src/master/state.go:386) CrossPK: save type: 4
    [2018/03/25 20:30:01.932042] [INFO] (/home/sf/M_00009/server/tags/release_4.2.0/src/master/state.go:338) CrossPK: 4 top1 boot room
    [2018/03/25 20:30:01.932077] [INFO] (/home/sf/M_00009/server/tags/release_4.2.0/src/master/state.go:386) CrossPK: save type: 4
    [2018/03/25 20:30:01.932479] [INFO] (/home/sf/M_00009/server/tags/release_4.2.0/src/master/state.go:338) CrossPK: 5 top1 boot room
    [2018/03/25 20:30:01.932505] [INFO] (/home/sf/M_00009/server/tags/release_4.2.0/src/master/state.go:386) CrossPK: save type: 4
    [2018/03/25 20:30:01.932974] [INFO] (/home/sf/M_00009/server/tags/release_4.2.0/src/master/state.go:338) CrossPK: 6 top1 boot room
    [2018/03/25 20:30:00.579081] [INFO] (/home/sf/M_00009/server/tags/release_4.2.0/src/master/state.go:386) CrossPK: save type: 3
    [2018/03/25 20:30:00.580716] [INFO] (/home/sf/M_00009/server/tags/release_4.2.0/src/master/state.go:386) CrossPK: save type: 3
    [2018/03/25 20:30:00.582094] [INFO] (/home/sf/M_00009/server/tags/release_4.2.0/src/master/state.go:386) CrossPK: save type: 3
    [2018/03/25 20:30:00.583440] [INFO] (/home/sf/M_00009/server/tags/release_4.2.0/src/master/state.go:386) CrossPK: save type: 3
    [2018/03/25 20:30:00.584705] [INFO] (/home/sf/M_00009/server/tags/release_4.2.0/src/master/state.go:386) CrossPK: save type: 3
    [2018/03/25 20:30:00.585861] [INFO] (/home/sf/M_00009/server/tags/release_4.2.0/src/master/state.go:386) CrossPK: save type: 3
    [2018/03/25 20:30:01.577092] [INFO] (/home/sf/M_00009/server/tags/release_4.2.0/src/master/state.go:386) CrossPK: save type: 4
    [2018/03/25 20:30:01.577349] [INFO] (/home/sf/M_00009/server/tags/release_4.2.0/src/master/state.go:338) CrossPK: 1 top1 boot room
    [2018/03/25 20:30:01.577424] [INFO] (/home/sf/M_00009/server/tags/release_4.2.0/src/master/state.go:386) CrossPK: save type: 4
    [2018/03/25 20:30:01.577717] [INFO] (/home/sf/M_00009/server/tags/release_4.2.0/src/master/state.go:338) CrossPK: 2 top1 boot room
    [2018/03/25 20:30:01.577761] [INFO] (/home/sf/M_00009/server/tags/release_4.2.0/src/master/state.go:386) CrossPK: save type: 4
    [2018/03/25 20:30:01.578035] [INFO] (/home/sf/M_00009/server/tags/release_4.2.0/src/master/state.go:338) CrossPK: 3 top1 boot room
    [2018/03/25 20:30:01.578185] [INFO] (/home/sf/M_00009/server/tags/release_4.2.0/src/master/state.go:386) CrossPK: save type: 4
    [2018/03/25 20:30:01.578475] [INFO] (/home/sf/M_00009/server/tags/release_4.2.0/src/master/state.go:338) CrossPK: 4 top1 boot room
    [2018/03/25 20:30:01.578561] [INFO] (/home/sf/M_00009/server/tags/release_4.2.0/src/master/state.go:386) CrossPK: save type: 4
    [2018/03/25 20:30:01.578803] [INFO] (/home/sf/M_00009/server/tags/release_4.2.0/src/master/state.go:338) CrossPK: 5 top1 boot room
    [2018/03/25 20:30:01.578834] [INFO] (/home/sf/M_00009/server/tags/release_4.2.0/src/master/state.go:386) CrossPK: save type: 4
    [2018/03/25 20:30:01.579156] [INFO] (/home/sf/M_00009/server/tags/release_4.2.0/src/master/state.go:338) CrossPK: 6 top1 boot room

    起先因为代码是四年前写的(对代码记忆已经模糊),日志中并未发现错误信息(都是INFO类型),反而更坚定了List为Nil的假设。但是仔细查看日志发现:Room在极短的时间被boot 两次,进而发现进程的时间居然被重置过:

    1
    2
    [2018/03/25 20:30:01.932974] [INFO] (/home/sf/M_00009/server/tags/release_4.2.0/src/master/state.go:338) CrossPK: 6 top1 boot room
    [2018/03/25 20:30:00.579081] [INFO] (/home/sf/M_00009/server/tags/release_4.2.0/src/master/state.go:386) CrossPK: save type: 3

    下一条日志的时间(20:30:00.579081)早于上面一条日志(20:30:01.932974)(日志同步写入,正常不可能出现时间倒转问题),豁然开朗!!!进程时间重置导致比赛再次被启动,每次启动都会创建一个相应的Goruntine,List被两个Goruntine同时访问产生了数据竞争问题,进而引发宕机。这也解释了上文中panic日志有两个相似的堆栈,一切也都能解释的通了。

  • 到此,我们知道引发这一系列问题的真实原因是时间被重置。那又是谁修改了进程时间?

    1
    */6 * * * * /usr/sbin/ntpdate ntp-jh.uuzu.com >> /var/log/uptime.log 2>&1 || /usr/sbin/ntpdate ntp-yd.uuzu.com >> /var/log/uptime.log 2>&1;/sbin/hwclock -w

    crontab中包含上面这段操作,用来同步服务器时间。查看uptime.log中包含25 Mar 20:30:00 ntpdate[12587]: step time server 10.0.0.6 offset -2.353620 sec,于是乎就明朗了,游戏进程中使用的时间是wall clock而不是monotonic clock,服务器脚本定时使用NTP机制同步系统时间进而影响游戏时间。

结论

  1. 惯性思维害人;
  2. 查BUG就像破案,蛛丝马迹,认真仔细,全面分析,冷静思考;

防范措施

  1. 代码中增加判断步骤,防止时间重置引发多次启动;
  2. 游戏中时间wall clock考虑改为monotonic clock;
  3. NTP脚本执行时间错开活动开启时间(治标不治本)。