Compute노드 중에 하나가 또 느려져서 보니까 messages가 난리가 났네요. 메모리 큰 잡들을 돌리고 있는데 저 노드만 저러고 있어요. CentOS 7.5 쓰고 있습니다. IB 문제같은데 비슷한 키워드는 나와도 같은 에러 메시지는 아니더라고요. 어찌 방법을 찾아야 할까요?
Feb 25 22:42:14 compute-01-09 kernel: kworker/12:0: page allocation failure: order:10, mode:0x80d0
Feb 25 22:42:14 compute-01-09 kernel: CPU: 12 PID: 70801 Comm: kworker/12:0 Kdump: loaded Not tainted 3.10.0-862.el7.x86_64 #1
Feb 25 22:42:14 compute-01-09 kernel: Hardware name: Dell Inc. PowerEdge R7425/08V001, BIOS 1.5.6 08/17/2018
Feb 25 22:42:14 compute-01-09 kernel: Workqueue: events xprt_rdma_connect_worker [rpcrdma]
Feb 25 22:42:14 compute-01-09 kernel: Call Trace:
Feb 25 22:42:14 compute-01-09 kernel: [] dump_stack+0x19/0x1b
Feb 25 22:42:14 compute-01-09 kernel: [] warn_alloc_failed+0x110/0x180
Feb 25 22:42:14 compute-01-09 kernel: [] ? drain_pages+0xb0/0xb0
Feb 25 22:42:14 compute-01-09 kernel: [] __alloc_pages_nodemask+0x9b4/0xbb0
Feb 25 22:42:14 compute-01-09 kernel: [] dma_generic_alloc_coherent+0x8f/0x140
Feb 25 22:42:14 compute-01-09 kernel: [] x86_swiotlb_alloc_coherent+0x21/0x50
Feb 25 22:42:14 compute-01-09 kernel: [] mlx5_dma_zalloc_coherent_node+0xb4/0x110 [mlx5_core]
Feb 25 22:42:14 compute-01-09 kernel: [] mlx5_buf_alloc_node+0x4d/0xc0 [mlx5_core]
Feb 25 22:42:14 compute-01-09 kernel: [] mlx5_buf_alloc+0x14/0x20 [mlx5_core]
Feb 25 22:42:14 compute-01-09 kernel: [] create_kernel_qp.isra.62+0x42e/0x72c [mlx5_ib]
Feb 25 22:42:14 compute-01-09 kernel: [] create_qp_common+0x67d/0x13c0 [mlx5_ib]
Feb 25 22:42:14 compute-01-09 kernel: [] ? internal_add_timer+0x70/0x70
Feb 25 22:42:14 compute-01-09 kernel: [] ? kmem_cache_alloc_trace+0x1d6/0x200
Feb 25 22:42:14 compute-01-09 kernel: [] mlx5_ib_create_qp+0x10b/0x4d0 [mlx5_ib]
Feb 25 22:42:14 compute-01-09 kernel: [] ? list_del+0xd/0x30
Feb 25 22:42:14 compute-01-09 kernel: [] ? wait_for_completion_interruptible_timeout+0x131/0x170
Feb 25 22:42:14 compute-01-09 kernel: [] ib_create_qp+0x7f/0x330 [ib_core]
Feb 25 22:42:14 compute-01-09 kernel: [] rdma_create_qp+0x34/0xb0 [rdma_cm]
Feb 25 22:42:14 compute-01-09 kernel: [] rpcrdma_ep_connect+0x183/0x3e0 [rpcrdma]
Feb 25 22:42:14 compute-01-09 kernel: [] xprt_rdma_connect_worker+0x3c/0xc0 [rpcrdma]
Feb 25 22:42:14 compute-01-09 kernel: [] process_one_work+0x17f/0x440
Feb 25 22:42:14 compute-01-09 kernel: [] worker_thread+0x126/0x3c0
Feb 25 22:42:14 compute-01-09 kernel: [] ? manage_workers.isra.24+0x2a0/0x2a0
Feb 25 22:42:14 compute-01-09 kernel: [] kthread+0xd1/0xe0
Feb 25 22:42:14 compute-01-09 kernel: [] ? insert_kthread_work+0x40/0x40
Feb 25 22:42:14 compute-01-09 kernel: [] ret_from_fork_nospec_begin+0xe/0x21
Feb 25 22:42:14 compute-01-09 kernel: [] ? insert_kthread_work+0x40/0x40
Feb 25 22:42:14 compute-01-09 kernel: Mem-Info:
Feb 25 22:42:14 compute-01-09 kernel: active_anon:11826266 inactive_anon:1659285 isolated_anon:0#012 active_file:24938058 inactive_file:25116811 isolated_file:0#012 unevictable:0 dirty:134 writeback:865547 unstable:1566927#012 slab_reclaimable:379760 slab_unreclaimable:172249#012 mapped:26843 shmem:43502 pagetables:30261 bounce:0#012 free:1077590 free_pcp:8 free_cma:0
Feb 25 22:42:14 compute-01-09 kernel: Node 0 DMA free:15864kB min:12kB low:12kB high:16kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15992kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:40kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
Feb 25 22:42:14 compute-01-09 kernel: lowmem_reserve[]: 0 1481 31659 31659
±×·¡µµ ¹®Á¦ÇØ°áÀ» À§ÇØ
ÄÉÀÌºí ¿¬°áÀ» ´Ù½Ã È®ÀÎÇÏ°í
½Ã½ºÅÛ Á¾·á ÈÄ Àü¿ø °ø±ÞÀ» Áß´ÜÇÏ¿´´Ù°¡
´Ù½Ã ½Ã½ºÅÛÀ» ½ÃÀÛÇÏ´Â ¹æ¹ýµµ
½ÃµµÇØ º¸½Ç ¼ö ÀÖ°Ú½À´Ï´Ù.
Ȥ½Ã³ªÁö¸¸ ºÒ·® ³ëµå »©¼ ¸Þ¸ð¸® /CPU ºÒ·®ÀÌ ÀÖ´ÂÁö Å×½ºÆ® °¡´ÉÇϽÅÁö¿ä??
±× ¿Ü´Â ³ëµå¿¡ ºÎÆÿë Çϵ峪 SSD°¡ ÀÖ´Ù¸é ¹èµå ¼½ÅÍ Ã¤Å©Çغ¸´Â °Íµµ ¾îµð°¡ ºÒ·®ÀÎÁö äũÇÏ´Â ¹æ¹ý Áß Çϳª ÀÏ µíÇÕ´Ï´Ù..
CentOS¶ó¸é pagesize ¸í·É¾î³ª pagesize(), getpagesize() µîµî ¿¬°üµÈ ÇÔ¼ö·Î ±× Å©±â¸¦ ±¸ÇÒ ¼ö ÀÖ°í, ±× ÆäÀÌÁö»çÀÌÁî¿¡ ¸Â°Ô ÇÁ·Î±×·¥ Â¥¾ßÇÏ°í, ¸Þ¸ð¸®¸¦ ÇÒ´ç/ÇØÁ¦ÇؾßÇÕ´Ï´Ù.
¿ä·¡ ¾È§ ÇÁ·Î±×·¥À» µ¹¸®¸é ¹Ýµå½Ã ½Ã½ºÅÛÀÌ ´Ù¿îµÇÁö´Â ¾ÊÁö¸¸ memory I/O È¿À²ÀÌ ¶³¾îÁö°Ô µË´Ï´Ù. <- À¢¸¸ÇÑ ¸®´ª½º °³¹ßÀÚµéÀº ÀÌÁ¤µµ´Â ´Ù~ ÁؼöÇÏ°í °è½ÇÅ×°í...
Àâ¼³(?)ÀÌ ±æ¾ú½¿µÂ~
Áï, page allocation failure ´Â ¸Þ¸ð¸® ÇÒ´ç À̽´À̱¸¿ä, ¸Þ¸ð¸® diagnostic ÇÁ·Î±×·¥ - memtest86 °°Àº°Í - Çѹø µ¹·Áº¸½Ã°í
Ȥ½Ã³ª, °£È¤°¡´Ù, CPU³»ÀÇ ¸Þ¸ð¸® ÄÁÆ®·Ñ·¯°¡ »ß¸®~ÇÑ °æ¿ì¿¡ ºÎÇϸ¦ ¸¹ÀÌÁÖ¸é ¸Þ¸ð¸®°¡ ÀÌ»ó¾ø´Â °æ¿ì¿¡µµ ¹ß»ýÇϱ⵵ ÇÕ´Ï´Ù.
Âü°í¸¸ Çϼ¼¿ä~
CPU°¡ ÀÎÅÚ²¨¶ó¸é https://downloadcenter.intel.com/download/19792/Intel-Processor-Diagnostic-Tool ¿ä°Å µ¹¸®±â
(¾Æ½±°Ôµµ ¸®´ª½º ¹öÀüÀº ¾Èº¸À̴µí... )
±×¸®°í mcelog ¼³Ä¡Çؼ ·Î±×¿¡ ¹º°¡ ¶¹³ª È®ÀÎÇϱâ... µîÀÌÁÒ ¹¹^^
High Performance Linpack °°Àº°Å °·ÁÅ©ÇÏ°Ô µ¹·Áº¼¸¸ µµ ÇÕ´Ï´Ù.
¸®´ª½º¿ë ÀÎÅÚ ¸°ÆÑÀº ¿ä±â¿¡ ÀÖ¾î¿ä... https://software.intel.com/en-us/mkl-linux-developer-guide-intel-optimized-linpack-benchmark-for-linux
¿ä°Ç µ¹¸±¶§ ¸Þ¸ð¸® ¿ÕÀå Àâ¾Æ¼(ÇÒ´çÇؼ) µ¹·Á¾ß ÇÕ´Ï´Ù. ±âº» »óÅ·Πµ¹¸®¸é Çê¼ö°í ÀÔ´Ï´Ù ¤»
¾Æ¸¶ 2.XX±îÁö ¹öÁ¯ÀÌ ÀÖÀ» °Ì´Ï´Ù..
# echo 1 > /proc/sys/vm/compact_memory
±× ³ëµåÀÇ meminfo Á¤º¸¸¦ ¿Ã·ÁÁÖ½Ã¸é µµ¿òÀÌ µÉ°Í °°½À´Ï´Ù.
±×³É Á¦°¡ º¸±â¿¡´Â ´ë·« 2°¡Áö Áß¿¡ ÇϳªÀε¥ 1) ·ÎÄà ¶Ç´Â ¸®¸ðÆ® ³ëµå ¾îµò°¡¿¡ ¸Þ¸ð¸®°¡ ºÎÁ·Çϰųª, 2) ¸®¸ðÆ® ³ëµå ¾îµò°¡¿¡ ¸Þ¸ð¸® fault°¡ ÀÖ´Â °æ¿ìÀÏ °Ì´Ï´Ù.
IB·Î RDMA¸¦ ÇÒ¶§ ¸®¸ðÆ® ³ëµå¿¡ ¸Þ¸ð¸®°¡ ºÎÁ·Çϸé OOMÀÌ ¾Æ´Ñ Áö±Ý º¸½Ã´Â °Í°ú À¯»çÇÑ ¸Þ½ÃÁö°¡ ³ª¿É´Ï´Ù.
ƯÈ÷ ¸¶Áö¸·¿¡ ÀÖ´Â "Node 0 DMA free:15864kB min:12kB low:12kB high:16..."¿Í °°Àº ¼ýÄ¡´Â ³ª¿Ã ¼ö°¡ ¾ø¾î¿ä.
±×·±µ¥ ¼¹ö°¡ AMD EPYCÀ̳׿ä. ÀÌÂÊ À̽´ÀÏ ¼öµµ ÀÖ½À´Ï´Ù.
Àúµµ ÀÚ°ß¿¡ EPYC¿¡ IB Å×½ºÆ® ÇÏ¸é¼ °í»ý Á» Çß¾ú½À´Ï´Ù.
¾Æ, ´©°¡ ÄÁÇDZ׸¦ ¸¸Á®³ùÀ» ¼ö µµ Àְڳ׿ä^^
º¸Åë MM°¡ ºÎÁ·Çϸé SWAPÀ¸·Î °¥¼öµµ ÀÖÁÒ? ±×·±µ¥ RDMA·Î ¸Þ¸ð¸®¸¦ ÀâÀ¸·Á ÇÏ´Ù MM°¡ ºÎÁ·Çϸé SWAPÀ¸·Î °¡¾ßÇÏ´Â°Ô ¸ÂÀ»±î¿ä?
ÀÌ ºÎºÐµµ ¼³Á¤ÀÌ Àִµ¥ ±â¾ïÀÌ ³ªÁú ¾Ê½À´Ï´Ù.
Kernel ¹öÀü¸¶´Ù ¹æ½ÄÀÌ ´Ù¸¦°Ì´Ï´Ù. Á¦°¡ ¾Æ´Â°Ç 15³âÀü RDMA¶ó...
¸¸¾à ½Ã¹Ä·¹À̼ǵîÀ» Çϴµ¥ °ÝÀÚ¸¦ ³Ê¹« Å©°Ô Àâ°Å³ª ÇÏ¸é ¹ß»ý ÇÒ ¼ö ÀÖ½À´Ï´Ù. ±×·±µ¥ ¿äÁò °ÅÀÇ ÀÌ·±°æ¿ì ¾ø´Âµ¥¿ä...
ªÀº Áö½ÄÀ¸·Î´Â oomÀÌ ¶°¾ß Çϴµ¥ Ä¿³ÎÀÌ Á¦´ë·Î ÀÏÀ» ¸øÇÏ°í ÀÖ³ª ¶ó°í »ý°¢Çߴµ¥ µ¿ÀÛÇÏ´Â µ¿ÀÛÇÏ´Â ¹æ½ÄÀÌ ´Ù¸£³×¿ä.