H100 80GB 2°³ º´·Ä·Î »ç¿ë½Ã ó¸® ¼Óµµ ÀúÇ×

   Á¶È¸ 1819   Ãßõ 0    

안녕Ȣ16;세요

아래 내용 해결1012; 위해서, 비용1012; 1648;불Ȣ16;/140;고 합니다.

문1228; 해결1060; 가능한 ǥ16;1008;  연락 1452;시기 0148;랍니다.

01*-****-****

대략1201;1064; 내용1012; 설명 드리1088;면,

HPE  DL380aG11  (2U1109;비1076;에도 GPU 4EA 까1648; 1109;착 가능) H100 80GB   2EA를 1109;착 Ȣ16;였습니다.

한개 씩 돌리면 성능1060; 1096;  나오고 1080;1004;나 .160; 개를 ᇼ1;시에 돌리면 처리 속도가 매우 lj12;립니다.

고ᄶ1;1060; Ȣ16;드웨Ǻ12;를 1032;심해서  DL380aG11  1060;외에  다른   DL380G11 ,   ASUS  4세대  CPU 서버에서 테스트를 Ȣ16;였1004;나, 결과lj16;  모.160; ᇼ1;1068; 합니다.

GPU 1109;애lj16; 아닙니다.  1200;7148;가  GPU 서버를 많1060; 납품 Ȣ16;였습니다. 보통 GPU를 사용 Ȣ16;lj16; 고ᄶ1;측에서  소프트웨Ǻ12;에서 수1221; 해1452;Ǻ12;서 문1228;를 해결 Ȣ16;lj16;데...

1068;단  브ና1;1648;/196; 연결해서 테스트 해보/140;lj16;데, 결과lj16; 1339;1648; 않1012;,144; 같습니다.


그ᇼ1;안   테스트 내용  아래 1221;리해서 알/140; 드리니,  해결1060; 가능할,144; 같1008;신ǥ16;1008;  연락 1452;시기 0148;랍니다.


  • 1.    OS : Ubuntu 22.04.2
  • 2.    CUDA Version : 12. 2
  • 3.    H100 Drive Version : 535.129.03
  • 4.    사용1473;1064; 프/196;그ǖ16; : Python 3.11.X
  • 5.   처리 속도 H100 80G * 2EA
  • Llama-3-8B처리시간 : 2.4
  •  
  • H100 80G * 1EA
  • Llama-3-8B 처리 시간 : 0.5초
  •  
  • 비교군 GPU A100 80G
  • A100 80G * 2EA
  • Llama-3-70B 처리 시간 : 2.7초
  •  
  • A100 80G * 1EA
  • Llama-3-8B 처리 시간 : 1.2초
  •  
  • 한 개를 돌/160;1012;ǐ12;lj16; H100 GPU가 A100 GPU 보다 20176;가 빠른데
  • 21109;1012; 같1060; 돌/160;1012; ǐ12;lj16; A100 GPUlj16; 2.7초 1060;고 H100 GPUlj16; 2.4초가 나오고 1080;습니다.
  •  

테스트 한 코드lj16; 아래와 같습니다.

 

from transformers import AutoTokenizer, AutoModelForCausalLM, TextIteratorStreamer, TextStreamer

import torch

from threading import Thread

import gradio as gr

import time

#import accelerate_speedup

 

torch.manual_seed(42)

 

model_id = "meta-llama/Meta-Llama-3-70B-Instruct"

#model_id = "meta-llama/Meta-Llama-3-8B-Instruct"

 

tokenizer = AutoTokenizer.from_pretrained(model_id)

#max_memory_mapping = {0: "80GB", 1: "80GB"}

model = AutoModelForCausalLM.from_pretrained(

    model_id,

    torch_dtype=torch.bfloat16,

    device_map="auto",

    #device_map="balanced_low_0",

    trust_remote_code=True,

    attn_implementation="flash_attention_2",

    low_cpu_mem_usage=True

    #max_memory=max_memory_mapping

).eval()

 

terminators = [

    tokenizer.eos_token_id,

&#160; &#160; tokenizer.convert_tokens_to_ids("<|eot_id|>")

]

&#160;

&#160;

### base inference

def chat(question):

&#160; &#160; messages = [

&#160; &#160; &#160; &#160; #{"role": "system", "content": "You are AI chatbot. You are honest, do not harm others, and help users."},

&#160; &#160; &#160; &#160; {"role": "system", "content": "Please try to provide useful, helpful answers."},

&#160; &#160; &#160; &#160; {"role": "user", "content": question},

&#160; &#160; ]

&#160; &#160;

&#160; &#160; input_ids = tokenizer.apply_chat_template(

&#160; &#160; &#160; &#160; messages,

&#160; &#160; &#160; &#160; add_generation_prompt=True,

&#160; &#160; &#160; &#160; return_tensors="pt"

&#160; &#160; ).to(model.device, non_blocking=True)

&#160; &#160;

&#160; &#160; outputs = model.generate(

&#160; &#160; &#160; &#160; input_ids,

&#160; &#160; &#160; &#160; max_new_tokens=1024,

&#160; &#160; &#160; &#160; eos_token_id=terminators,

&#160; &#160; &#160; &#160; do_sample=True,

&#160; &#160; &#160; &#160; temperature=0.05,

&#160; &#160; &#160; &#160; top_p=0.95,

&#160; &#160; )

&#160; &#160; response = outputs[0][input_ids.shape[-1]:]

&#160; &#160; #print(tokenizer.decode(response, skip_special_tokens=True))

&#160; &#160; return tokenizer.decode(response, skip_special_tokens=True)

&#160;

&#160;

response_times = []

for _ in range(100):

&#160; &#160; start_time = time.time()

&#160; &#160; #tmp = chat('hello.')

&#160; &#160; tmp = chat('hello!')

&#160; &#160; #tmp = chat('Testing. Please answer in 10,000 characters.')

&#160; &#160; end_time = time.time()

&#160; &#160; print((end_time - start_time))

&#160; &#160; response_times.append(end_time - start_time)

&#160;

print(f"Average Response Time: {sum(response_times) / len(response_times):.2f} seconds")

&#160;


ªÀº±Û Àϼö·Ï ½ÅÁßÇÏ°Ô.
¼úÀÌ 06-11
°°Àº CPU ·¹Àο¡ ²Å¾Æ¼­ 8¹è¼ÓÀ¸·Î ÂÉ°³Áø°Ô ¾Æ´Ò±î »ý°¢µå´Âµ¥...
°¢°¢ CPU°¡ ¹èÁ¤µÈ ÀͽºÇÁ·¹½º ½½·Ô¿¡ ÀåÂøÇÑ°ÇÁö°¡ °ü°ÇÀÏ°Å °°¾Æ¿ä.
À§ÀÇ Äڵ带 Å×½ºÆ® ÇÒ ¼ö Àִ ȯ°æÀÌ ¾Æ´Ï°í, µö·¯´×¿¡¼­ ¼Õ ¶©Áö ¿À·¡µÇ¾î¼­ ±×³É ´À³¦À¸·Î ºÃ½À´Ï´Ù.

Á¦ °æÇè¿¡ ¸î¸î ÆĶó¹ÌÅÍ ¼³Á¤À» º¯°æÇÏ´Â °ÍÀ» ÁÖ¼®Ã³¸® ¿©ºÎ·Î¸¸ ÇÒ °æ¿ì ½Ç¼öÇÏ´Â °æ¿ì°¡ ¸¹¾Ò½À´Ï´Ù.
for ¹®À¸·Î µ¹¸®¸é ÀÌÀü º¯¼ö°ªÀÌ ³²¾Æ ÀÖ¾î ¿À·ù°¡ ³ª´Â °æ¿ìµµ ¸¹±¸¿ä.

À§ÀÇ Äڵ嵵 º¸¸é device_map°ú max_memory ºÎºÐÀÌ ÁÖ¼® 󸮸¦ º¯°æÇϸ鼭 Å×½ºÆ®ÇÏ´Â °ÍÀ¸·Î º¸ÀÔ´Ï´Ù.

´Ù¾çÇÑ Á¶°Ç¿¡¼­ ÃÊ´ÜÀ§·Î ¼º´ÉÀ» Æò°¡ÇØ¾ß ÇÏ´Â »óȲÀ̶ó¸é, ÇØ´ç ¼³Á¤ ºÎºÐ¸¸ Á¤È®ÇÏ°Ô ¼öÁ¤ÇÑ ÆÄÀÏ 4°³¸¦ ¸¸µé¾î¼­ È®ÀÎÇØ º¸´Â °ÍÀÌ ÁÁÀ» °Í °°½À´Ï´Ù.

´ëºÎºÐ °á°ú°¡ 3ÃÊ À̳»¿¡ ³ª¿À¸é nvidia-smi µîÀ¸·Î GPU »óŸ¦ º¼ ¶§¿¡ ½Ã°£ÀÌ ³Ê¹« ª±â ¶§¹®¿¡ ´õ ±ä ½Ã°£ µ¿¾È Å×½ºÆ® ÇÒ ¼ö ÀÖµµ·Ï ÇÏ´Â °ÍÀÌ ÁÁÀ» °Í °°½À´Ï´Ù.
°ËÀºÄá 06-12
Vllm Å×½ºÆ®Çغ¸½ÃÁÒ


QnA
Á¦¸ñPage 62/440
06-13   1540   Q8300
06-13   1788   ±è½ÂÈÆ
06-13   1794   ¶óÈ¥
06-13   1570   ¶Ñ¶Ñ±è´ë¿ø
06-12   1677   Â÷³ÝÄÄÇ»ÅÍ
06-12   1606   gusoong
06-12   1981   akfalles
06-12   1764   2CPUÃÖÁÖÈñ
06-12   1531   ¹Î»çÀå
06-12   1886   ¿©ÁÖ³ó¹Î76
06-12   1574   ÀÌ¿øÀçK
06-12   1783   È­¶õ
06-12   1322   dateno1
06-12   1370   Frinc
06-12   1367   ÀÌ»óÇô´ç
06-12   1194   osthek83
06-12   1356   Q8300
06-12   1458   httpd
06-12   1679   audacity
06-11   1519   dukez73