Download Hardware for Machine Learning and more Exercises Machine Learning in PDF only on Docsity!
Hardware for Machine
Learning
CS6787 Lecture 11 — Fall 2017
Recap: modern ML hardware
- Lots of different types
- CPUs
- GPUs
- FPGAs
- Specialized accelerators
- Right now, GPUs are dominant …we’ll get to why later
What does a modern machine learning pipeline
look like?
- Many different components
DNN
training
Preprocessing of the training set
DNN
inference
New examples to be processed
Where can hardware help?
- Everywhere!
- There’s interest in using hardware everywhere in the pipeline
- both adapting existing hardware architectures , and
- developing new ones
- What improvements can we get?
- Lower latency inference
- Higher throughput training
- Lower power cost
Why are GPUs so popular for
machine learning?
Why are GPUs so popular for
training deep neural networks?
FLOPS: GPU vs CPU
- FLOPS: f loating p oint o perations p er s econd From Karl Rupp’s blog https://www.karlrupp.net/2016/ /flops-per-cycle-for-cpus-gpus-and- xeon-phis/ This was the best diagram I could find that shows trends over time.
GPU FLOPS
consistently exceed CPU FLOPS Intel Xeon Phi chips are compute- heavy manycore processors that compete with GPUs
Memory bandwidth: CPU vs GPU
- GPUs have higher memory bandwidths than CPUs
- E.g. new NVIDIA Tesla V100 has a claimed 900 GB/s memory bandwidth
- Wheras Intel Xeon E7 has only about 100 GB/s memory bandwidth
- But, this comparison is unfair!
- GPU memory bandwidth is the bandwidth to GPU memory
- E.g. on a PCIE2, bandwidth is only 32 GB/s for a GPU
Challengers to the GPU
- More compute-intensive CPUs
- Like Intel’s Phi line — promise same level of compute performance and better handling of sparsity
- Low-power devices
- Like mobile-device-targeted chips
- Configurable hardware like FPGAs and CGRAs
- Accelerators that speed up matrix-matrix multiply
Will all computation become
dense matrix-matrix multiply?
What if dense matrix multiply takes over?
- Great opportunities for new highly specialized hardware
- The TPU is already an example of this
- It’s a glorified matrix-matrix multiply engine
- Significant power savings from specialized hardware
- But not as much as if we could use something like sparsity
- It might put us all out of work
- Who cares about researching algorithms when there’s only one algorithm anyone cares about?
What if matrix multiply doesn’t take over?
- Great opportunities for designing new heterogeneous, application-
specific hardware
- We might want one chip for SVRG, one chip for low-precision
- Interesting systems/framework opportunities to give users suggestions
for which chips to use
- Or even to automatically dispatch work within a heterogeneous datacenter
- Community might fragment
- Into smaller subgroups working on particular problems
Recent work on hardware for
machine learning
Abstracts from papers at architecture conferences this year
Questions?
- Conclusion
- Lots of interesting work on hardware for machine learning
- Lots of opportunities for interdisciplinary research
- Upcoming things
- Paper Review #10 — due today
- Project proposal — due today
- Paper Presentation #11 on Wednesday — TPU