안녕Ȣ16;세요
아래 내용 해결1012; 위해서, 비용1012; 1648;불Ȣ16;/140;고 합니다.
문1228; 해결1060; 가능한 ǥ16;1008; 160;연락 1452;시기 0148;랍니다.
01*-****-****
대략1201;1064; 내용1012; 설명 드리1088;면,
HPE 160;DL380aG11 160;(2U1109;비1076;에도 GPU 4EA 까1648; 1109;착 가능) H100 80GB 160; 2EA를 1109;착 Ȣ16;였습니다.
한개 씩 돌리면 성능1060; 1096; 160;나오고 1080;1004;나 .160; 개를 ᇼ1;시에 돌리면 처리 속도가 매우 lj12;립니다.
고ᄶ1;1060; Ȣ16;드웨Ǻ12;를 1032;심해서 160;DL380aG11 160;1060;외에 160;다른 160; DL380G11 , 160; ASUS 160;4세대 160;CPU 서버에서 테스트를 Ȣ16;였1004;나, 결과lj16; 160;모.160; ᇼ1;1068; 합니다.
GPU 1109;애lj16; 아닙니다. 160;1200;7148;가 160;GPU 서버를 많1060; 납품 Ȣ16;였습니다. 보통 GPU를 사용 Ȣ16;lj16; 고ᄶ1;측에서 160;소프트웨Ǻ12;에서 수1221; 해1452;Ǻ12;서 문1228;를 해결 Ȣ16;lj16;데...
1068;단 160;브ና1;1648;/196; 연결해서 테스트 해보/140;lj16;데, 결과lj16; 1339;1648; 않1012;,144; 같습니다.
그ᇼ1;안 160; 테스트 내용 160;아래 1221;리해서 알/140; 드리니, 160;해결1060; 가능할,144; 같1008;신ǥ16;1008; 160;연락 1452;시기 0148;랍니다.
- 1. 160; 160;OS : Ubuntu 22.04.2
- 2. 160; 160;CUDA Version : 12. 2
- 3. 160; 160;H100 Drive Version : 535.129.03
- 4. 160; 160;사용1473;1064; 프/196;그ǖ16; : Python 3.11.X
- 5. 160;160;처리 속도160;H100 80G * 2EA
- Llama-3-8B처리시간160;: 2.4
- 160;
- H100 80G * 1EA
- Llama-3-8B160;처리 시간160;: 0.5초
- 160;
- 비교군160;GPU A100 80G
- A100 80G * 2EA
- Llama-3-70B160;처리 시간160;: 2.7초
- 160;
- A100 80G * 1EA
- Llama-3-8B160;처리 시간160;: 1.2초
- 160;
- 한 개를 돌/160;1012;ǐ12;lj16;160;H100 GPU가160;A100 GPU160;보다160;20176;가 빠른데
- 21109;1012; 같1060; 돌/160;1012; ǐ12;lj16;160;A100 GPUlj16;160;2.7초 1060;고160;H100 GPUlj16;160;2.4초가 나오고 1080;습니다.
- 160;
테스트 한 코드lj16; 아래와 같습니다.
160;
from transformers import AutoTokenizer, AutoModelForCausalLM, TextIteratorStreamer, TextStreamer
import torch
from threading import Thread
import gradio as gr
import time
#import accelerate_speedup
160;
torch.manual_seed(42)
160;
model_id = "meta-llama/Meta-Llama-3-70B-Instruct"
#model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
160;
tokenizer = AutoTokenizer.from_pretrained(model_id)
#max_memory_mapping = {0: "80GB", 1: "80GB"}
model = AutoModelForCausalLM.from_pretrained(
160; 160; model_id,
160; 160; torch_dtype=torch.bfloat16,
160; 160; device_map="auto",
160; 160; #device_map="balanced_low_0",
160; 160; trust_remote_code=True,
160; 160; attn_implementation="flash_attention_2",
160; 160; low_cpu_mem_usage=True
160; 160; #max_memory=max_memory_mapping
).eval()
160;
terminators = [
160; 160; tokenizer.eos_token_id,
160; 160; tokenizer.convert_tokens_to_ids("<|eot_id|>")
]
160;
160;
### base inference
def chat(question):
160; 160; messages = [
160; 160; 160; 160; #{"role": "system", "content": "You are AI chatbot. You are honest, do not harm others, and help users."},
160; 160; 160; 160; {"role": "system", "content": "Please try to provide useful, helpful answers."},
160; 160; 160; 160; {"role": "user", "content": question},
160; 160; ]
160; 160;
160; 160; input_ids = tokenizer.apply_chat_template(
160; 160; 160; 160; messages,
160; 160; 160; 160; add_generation_prompt=True,
160; 160; 160; 160; return_tensors="pt"
160; 160; ).to(model.device, non_blocking=True)
160; 160;
160; 160; outputs = model.generate(
160; 160; 160; 160; input_ids,
160; 160; 160; 160; max_new_tokens=1024,
160; 160; 160; 160; eos_token_id=terminators,
160; 160; 160; 160; do_sample=True,
160; 160; 160; 160; temperature=0.05,
160; 160; 160; 160; top_p=0.95,
160; 160; )
160; 160; response = outputs[0][input_ids.shape[-1]:]
160; 160; #print(tokenizer.decode(response, skip_special_tokens=True))
160; 160; return tokenizer.decode(response, skip_special_tokens=True)
160;
160;
response_times = []
for _ in range(100):
160; 160; start_time = time.time()
160; 160; #tmp = chat('hello.')
160; 160; tmp = chat('hello!')
160; 160; #tmp = chat('Testing. Please answer in 10,000 characters.')
160; 160; end_time = time.time()
160; 160; print((end_time - start_time))
160; 160; response_times.append(end_time - start_time)
160;
print(f"Average Response Time: {sum(response_times) / len(response_times):.2f} seconds")
160;
°¢°¢ CPU°¡ ¹èÁ¤µÈ ÀͽºÇÁ·¹½º ½½·Ô¿¡ ÀåÂøÇÑ°ÇÁö°¡ °ü°ÇÀÏ°Å °°¾Æ¿ä.
Á¦ °æÇè¿¡ ¸î¸î ÆĶó¹ÌÅÍ ¼³Á¤À» º¯°æÇÏ´Â °ÍÀ» ÁÖ¼®Ã³¸® ¿©ºÎ·Î¸¸ ÇÒ °æ¿ì ½Ç¼öÇÏ´Â °æ¿ì°¡ ¸¹¾Ò½À´Ï´Ù.
for ¹®À¸·Î µ¹¸®¸é ÀÌÀü º¯¼ö°ªÀÌ ³²¾Æ ÀÖ¾î ¿À·ù°¡ ³ª´Â °æ¿ìµµ ¸¹±¸¿ä.
À§ÀÇ Äڵ嵵 º¸¸é device_map°ú max_memory ºÎºÐÀÌ ÁÖ¼® 󸮸¦ º¯°æÇÏ¸é¼ Å×½ºÆ®ÇÏ´Â °ÍÀ¸·Î º¸ÀÔ´Ï´Ù.
´Ù¾çÇÑ Á¶°Ç¿¡¼ ÃÊ´ÜÀ§·Î ¼º´ÉÀ» Æò°¡ÇØ¾ß ÇÏ´Â »óȲÀ̶ó¸é, ÇØ´ç ¼³Á¤ ºÎºÐ¸¸ Á¤È®ÇÏ°Ô ¼öÁ¤ÇÑ ÆÄÀÏ 4°³¸¦ ¸¸µé¾î¼ È®ÀÎÇØ º¸´Â °ÍÀÌ ÁÁÀ» °Í °°½À´Ï´Ù.
´ëºÎºÐ °á°ú°¡ 3ÃÊ À̳»¿¡ ³ª¿À¸é nvidia-smi µîÀ¸·Î GPU »óŸ¦ º¼ ¶§¿¡ ½Ã°£ÀÌ ³Ê¹« ª±â ¶§¹®¿¡ ´õ ±ä ½Ã°£ µ¿¾È Å×½ºÆ® ÇÒ ¼ö ÀÖµµ·Ï ÇÏ´Â °ÍÀÌ ÁÁÀ» °Í °°½À´Ï´Ù.