-
Notifications
You must be signed in to change notification settings - Fork 4k
Issues: microsoft/DeepSpeed
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Author
Label
Projects
Milestones
Assignee
Sort
Issues list
[BUG] Multi-node fine-tuning with thunderbolt
bug
Something isn't working
training
#5766
opened Jul 11, 2024 by
Raywang0211
[BUG] Multi-gpu stuck when the computation graph is not complete for wach process.
bug
Something isn't working
training
#5762
opened Jul 10, 2024 by
gary-young
[BUG] I can't run fp8 with pipeline parallel
bug
Something isn't working
training
#5760
opened Jul 10, 2024 by
exnx
[BUG] Learning rate scheduler and optimizer logical issue
bug
Something isn't working
training
#5731
opened Jul 5, 2024 by
zhourunlong
lr scheduler defined in config cannot be overwritten by lr scheduler defined in code and pass to Something isn't working
training
deepspeed.initialize
[BUG]
bug
#5726
opened Jul 5, 2024 by
xiyang-aads-lilly
[BUG] ImportError: /home/nlp/.cache/torch_extensions/py310_cu121/cpu_adam/cpu_adam.so: cannot open shared object file: No such file or directory
bug
Something isn't working
training
#5723
opened Jul 4, 2024 by
PhysicianHOYA
[REQUEST] Asynchronous Checkpointing
enhancement
New feature or request
#5721
opened Jul 2, 2024 by
zaptrem
Issue with LoRA Tuning on llama3-70b using PEFT and TRL's SFTTrainer
training
#5719
opened Jul 2, 2024 by
yutanozaki1
Different seeds are giving the exact same loss on Zero 1,2 and 3 during multi gpu training [BUG]
bug
Something isn't working
training
#5717
opened Jul 2, 2024 by
selenerkan
[REQUEST] Does Universal Checkpoint supports for MoE Checkpoint?
enhancement
New feature or request
#5716
opened Jul 2, 2024 by
tiggerwu
[BUG] localhost: Permission denied, please try again. with single node and multi-gpus with --autotuning run
bug
Something isn't working
training
#5709
opened Jul 1, 2024 by
Looong01
[BUG] 1-bit LAMB not compatible with bf16
bug
Something isn't working
training
#5708
opened Jun 28, 2024 by
catid
on Activation Checkpointing
bug
Something isn't working
training
#5704
opened Jun 28, 2024 by
ChaunceyWang
[BUG] Mixed-precision: fp16 will cast input_ids into torch.cuda.HalfTensor instead of Long or Int.
#5701
opened Jun 28, 2024 by
zhaoyang02
Tensor(hidden states)missing across GPU in Pipeline Parallelism Training[BUG]
bug
Something isn't working
training
#5696
opened Jun 25, 2024 by
Youngluc
[BUG] Regression: 0.14.3 causes grad_norm to be zero
bug
Something isn't working
training
#5692
opened Jun 21, 2024 by
rosario-purple
[ERROR] [launch.py:321:sigkill_handler] exits with return code = -11
bug
Something isn't working
training
#5690
opened Jun 21, 2024 by
shag1802
Running out of CPU memory. Dataset is loaded for each created process
bug
Something isn't working
training
#5689
opened Jun 21, 2024 by
MikeMitsios
[BUG] inference ValueError
bug
Something isn't working
inference
#5685
opened Jun 19, 2024 by
zxrneu
[BUG] Logs full of FutureWarning when training with nightly PyTorch
bug
Something isn't working
training
#5682
opened Jun 18, 2024 by
rosario-purple
[BUG] Using and Building DeepSpeedCPUAdam
bug
Something isn't working
training
#5677
opened Jun 18, 2024 by
oabuhamdan
Previous Next
ProTip!
Mix and match filters to narrow down what you’re looking for.