딥러능 GPU 관련 문의 사항이 있어 메일 보내 드립니다.
- 장비명 : DL380a Gen11
- OS : Ubuntu 22.04
- Python 3.11.9
드라이브 버전은 535.129.03 / CUDA Version : 12.2 버전에서 CUDA Tool kit을 이용하여 업데이트 후
- NVIDIA Drive version 555.42.02
- CUDA Version: 12.5
- H100 80G * 2EA
- Llama-3-8B처리시간 : 2.4
- H100 80G * 1EA
- Llama-3-8B 처리 시간 : 0.5초
테스트 코드는 아래와 같습니다.
from transformers import AutoTokenizer, AutoModelForCausalLM, TextIteratorStreamer, TextStreamer
import torch
from threading import Thread
import gradio as gr
import time
#import accelerate_speedup
torch.manual_seed(42)
model_id = "meta-llama/Meta-Llama-3-70B-Instruct"
#model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
#max_memory_mapping = {0: "80GB", 1: "80GB"}
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
#device_map="balanced_low_0",
trust_remote_code=True,
attn_implementation="flash_attention_2",
low_cpu_mem_usage=True
#max_memory=max_memory_mapping
).eval()
terminators = [
tokenizer.eos_token_id,
tokenizer.convert_tokens_to_ids("<|eot_id|>")
]
### base inference
def chat(question):
messages = [
#{"role": "system", "content": "You are AI chatbot. You are honest, do not harm others, and help users."},
{"role": "system", "content": "Please try to provide useful, helpful answers."},
{"role": "user", "content": question},
]
input_ids = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
return_tensors="pt"
).to(model.device, non_blocking=True)
outputs = model.generate(
input_ids,
max_new_tokens=1024,
eos_token_id=terminators,
do_sample=True,
temperature=0.05,
top_p=0.95,
)
response = outputs[0][input_ids.shape[-1]:]
#print(tokenizer.decode(response, skip_special_tokens=True))
return tokenizer.decode(response, skip_special_tokens=True)
response_times = []
for _ in range(100):
start_time = time.time()
#tmp = chat('hello.')
tmp = chat('hello!')
#tmp = chat('Testing. Please answer in 10,000 characters.')
end_time = time.time()
print((end_time - start_time))
response_times.append(end_time - start_time)
print(f"Average Response Time: {sum(response_times) / len(response_times):.2f} seconds")
ºñ»ó½ÄÀûÀÎ Delay ³×¿ä.
¸Þ¸ð¸® ¹®Á¦´Â ¾Æ´Ñ°Í °°Àºµ¥¿ä.
PCIe °¡ ¾î¶»°Ô ¹°·Á ÀÖ³ª¿ä?
PCIe 5.0 16¹è¼Ó ÀÌ ÃÖ´ë 64GB/S ±îÁö Áö¿øÀ» Çϴµ¥¿ä..== ÀÏ´ÜÀº ¿©±â¼ º´¸ñ °°¿¡¿ä..
PCIe 5.0 x16¸¦ °¢ Ä«µå´ç 2°³¾¿ ÁÙ ¼ö ¾ø³ª¿ä?? ±×·¯¸é 128GB/s ±îÁö È®º¸°¡ µÇ´Âµ¥..
ÀÏ´Ü ´ë¾ÈÀº NVlink¸¦ ¼³Ä¡ÇÏ´Â ¼ö ¹Û¿¡´Â ¾ø¾î º¸ÀÔ´Ï´Ù.
#max_memory_mapping = {0: "80GB", 1: "80GB"} ÀÌ°É ÁÜ ÁÙ¿© º¸¼¼¿ä..
PCIe ´ë¿ªÆø¿¡ ¾î¿ï¸®µµ·Ï {0: "30GB", 1: "30GB"} Á¤µµ·Î¿ä..
¼Ò½º°¡ seqÀε¥ ´õ ºü¸¦¸® ¾øÁÒ
¹ö½º Åë½Å ¿À¹öÇìµå°¡ »ó´çÇÒ °Çµ¥¿ä?
º¹¼ö °³ÀÇ gpu¸¦ ¾²´Â°Ç º¸Åë ÇнÀ½Ã Çϳª·Î´Â gpu ¸Þ¸ð¸®°¡ ºÎÁ·Çؼ ±×·²°Ì´Ï´Ù.
±×³ªÀú³ª H100 Á¤¸» ºü¸£±º¿ä. ¤¾¤¾