错误 #4277
【Rel_3.1.3_Pre1T6_E500_128UE】大上行配比同时上下行UDP业务,下行灌大量包(700M)session服务启动失败导致DU挂死
0%
描述
128多用户接入后做上下行UDP业务,上行整站灌包600M,下行整站灌包700M,业务10s左右,基站DU挂死,原因:session服务启动失败导致DU进程相关的线程崩溃
文件
历史记录
由 惠 帅帅 更新于 大约 2 个月 之前
今日尝试原场景复现(1D3U下行700M+上行600M),未出现session会话创建跑死。
出现上下调度指示接口ysUlm_FAPI_HdlUlschInd跑死,立伟分析可能原因为内存被踩,需要跟上MAC_COMMON模块log继续复现。
备注:coredump出现前提示:
1)统计看瞬间EVT_NRUP_EGTPU_DATA_INDICATION消息处理线程队列存在大量消息积压,内存不能快速释放。
2)udpDlMsgQ1队列满。
alloc_nrup_dat_ind_succ1 = 37913939, alloc_nrup_dat_ind_succ2 = 0, delete_nrup_dat_ind_succ = 37913939!
alloc_nrup_dat_ind_succ1 = 37924159, alloc_nrup_dat_ind_succ2 = 0, delete_nrup_dat_ind_succ = 37924113!
alloc_nrup_dat_ind_succ1 = 38007369, alloc_nrup_dat_ind_succ2 = 0, delete_nrup_dat_ind_succ = 37999934!
Buffer is Full 1, size=40960, write=5846, read=5847, n_write=41006806, n_read=40965847, nWriteFail=0, nReadFail=0, rBuf=0xe7a5d8
udpDlMsgQ1 is Full, size=40960, rBuf=0xe7a5d8
alloc_nrup_dat_ind_succ1 = 38106380, alloc_nrup_dat_ind_succ2 = 0, delete_nrup_dat_ind_succ = 38065527!
./run.sh: line 14: 2391 Segmentation fault (core dumped) ./gnb_du
3)coredump解析内容
#0 0x00000000008b3748 in ysUlm_FAPI_HdlUlschInd (cellCb=cellCb@entry=0x7da54d6430, pRx_UlschIndMsg=pRx_UlschIndMsg@entry=0x68dda60 <gFapiCPlan+2294280>,
pool_id=pool_id@entry=0 '\000') at /home/xss/du_push/ran/nr_hl_du/src/5gnrclms/ys_fapi_task.c:2305
(gdb) p pUlPduInfo
$1 = (L1RxDataPduInfo_t *) 0x68ddab8 <gFapiCPlan+2294368>
(gdb) p *(L1RxDataPduInfo_t *) 0x68ddab8
$2 = {
Handle = 70780895,
RNTI = 1672,
HarqID = 66 'B',
rev = 3 '\003',
numTLV = 4129425091,
TLVs = 0x68ddac4 <gFapiCPlan+2294380>
}
由rnti非法值分析可能该块内存被踩。
由 惠 帅帅 更新于 大约 2 个月 之前
MAC_COMMON LOG打开后接入后做业务很快内存申请不到,MAC同事正在分析。
alloc_nrup_dat_ind_succ1 = 135497272, alloc_nrup_dat_ind_succ2 = 0, delete_nrup_dat_ind_succ = 135497272!
alloc_nrup_dat_ind_succ1 = 135497281, alloc_nrup_dat_ind_succ2 = 0, delete_nrup_dat_ind_succ = 135497281!
alloc_nrup_dat_ind_succ1 = 135497288, alloc_nrup_dat_ind_succ2 = 0, delete_nrup_dat_ind_succ = 135497288!
alloc_nrup_dat_ind_succ1 = 135497289, alloc_nrup_dat_ind_succ2 = 0, delete_nrup_dat_ind_succ = 135497289!
alloc_nrup_dat_ind_succ1 = 135512522, alloc_nrup_dat_ind_succ2 = 0, delete_nrup_dat_ind_succ = 135512522!
alloc_nrup_dat_ind_succ1 = 135626343, alloc_nrup_dat_ind_succ2 = 0, delete_nrup_dat_ind_succ = 135595938!
Buffer is Full 1, size=40960, write=26592, read=26593, n_write=148056032, n_read=148015073, nWriteFail=0, nReadFail=0, rBuf=0xe7d5d8
udpDlMsgQ1 is Full, size=40960, rBuf=0xe7d5d8
SPstWTsk failed get memory event172 size88 listValidBktSet0
allocator::new failed ,size108,ret1
allocator::new failed ,size108,ret1
allocator::new failed ,size108,ret1
allocator::new failed ,size108,ret1
allocator::new failed ,size108,ret1
allocator::new failed ,size108,ret1
allocator::new failed ,size108,ret1
allocator::new failed ,size108,ret1
allocator::new failed ,size108,ret1
allocator::new failed ,size108,ret1
allocator::new failed ,size108,ret1
目前看,消息EVT_NRUP_EGTPU_DATA_INDICATION消息处理确实在瞬间存在线程队列中存在大量积压,108size内存申请不出来了。
由 惠 帅帅 更新于 9 天 之前
- 状态 从 进行中 变更为 审视
- 指派给 从 惠 帅帅 变更为 周 立伟
解决策略同4380单描述:
问题背景:
E500环境偶现DU业务sysrepo访问数据库进行写操作时跑死。
初步分析为DU业务和GNB_AGENT同时sysrepo访问数据库时存在锁竞争导致。
解决方案:
sysrepo写数据库操作尽可能由单一模块单一线程完成,以避免锁竞争。
DU业务写数据库动作应交由DU_AGENT模块实施。
修改内容:代码合入在故障单“#4483-DU与AGENT-UDP通信接口整改”中体现
1)支持单动作单路径xpath参数修改
2)支持单动作多路径xpath参数修改
3)支持AGENT异常业务自行修改xpath
测试用例:
反复尝试重建小区5次,小均可以建立正常。