Applied Computational AI: Model, Software, Hardware @ Stanford University Summer ‘24
Dr. Wei Li, VP/GM of AI Software Engineering at Intel, leads a top-tier team enhancing AI performance like Large Language Models and developer efficiency to enable an AI Everywhere future. He also drives engagements with industry and ecosystem leaders, including those from PyTorch/Meta, TensorFlow/Google, DeepSpeed/Microsoft, Hugging Face to various Gen AI startups. With a Ph.D. on supercomputers from Cornell University, he’s a renowned speaker at universities such as Harvard and Stanford and events including the Fortune and AI Summits. A life-long champion of open-source, Wei serves on the boards of the PyTorch Foundation and Linux Foundation AI&Data.
Speaker: Dr. Wei Li, VP/GM of AI Software Engineering at Intel
Place: Skilling Auditorium, Stanford, CA 94305
I intend to review in this article three very relevant topics on AI presented by Dr. Wei:
- Performance enhancement for LLMs;
- Open-source ecosystem;
- AI-optimized hardware.
He elaborated on performance enhancement techniques for large language models (LLMs), emphasizing the importance of optimizing across multiple layers — model, software, and hardware. He discussed how “performance will come from multiple layers, you know, doing operation at the multiple layers here,” and highlighted the significance of reducing memory traffic, particularly in transformer-based architectures. A specific example he provided was the use of “grouped query” in multi-head attention mechanisms, which helps to “reduce the amount of traffic into memory” by grouping multiple attention heads together, thereby making fewer trips to memory. Dr. Li noted that while this method may slightly compromise accuracy, it significantly boosts performance by minimizing memory access, a crucial factor as models scale in size. He also touched on the importance of utilizing different data types, such as “FP32, FP16,” and even “four-bit integers,” to optimize computations, especially given the increasing size of LLMs. These techniques, combined with advancements in hardware like the integration of systolic arrays, are essential to achieving efficient, high-performance AI systems.
He also underscored the pivotal role of the open-source ecosystem in advancing AI technologies, particularly highlighting Intel’s extensive contributions. He stated, “Intel is a very, very big, major player on the open-source ecosystem side,” and emphasized their active involvement in major initiatives like the PyTorch Foundation and the Linux Foundation AI & Data Foundation. Dr. Li expressed a strong belief in the value of open-source development, aligning with industry leaders’ perspectives, as he referenced Mark Zuckerberg’s remarks: “Open source is the right thing to do from a broader perspective.” He detailed Intel’s long-standing partnership with Hugging Face, noting their collaboration from the company’s early days in natural language processing to its current central role in large language models. This commitment to open-source efforts is seen as essential for fostering innovation and ensuring that AI technologies are accessible and beneficial to a wider audience, ultimately facilitating the broader application of AI in various industries.
At last, Dr. Wei Li detailed the critical advancements in AI-optimized hardware, emphasizing its foundational role in achieving high-performance AI applications. He described how various hardware components, such as CPUs, GPUs, and specialized AI accelerators, are crucial for supporting the computational demands of machine learning, particularly matrix computations. Dr. Li highlighted the widespread use of systolic array implementations, a concept dating back to early research but now integral to modern AI hardware. He explained, “We have been adding systolic implementation to the CPU side,” known as “IMX,” and noted similar advancements in GPU technology, such as the development of “Tensor Core,” which also utilizes a systolic array architecture. This hardware innovation allows for efficient matrix multiplication, a fundamental operation in AI computations. Dr. Li further pointed out that these hardware solutions are being optimized to handle various data types, including “FP32, FP16, and even narrower data types,” such as four-bit integers, to enhance performance and efficiency. This level of hardware optimization is essential for supporting the large-scale computations required by contemporary AI models, particularly large language models, ensuring they can operate swiftly and effectively across diverse applications.
About my experience
I had the privilege of attending a fascinating series of seminars at Stanford University, part of the HPC-AI Summer Seminar Series. These seminars were led by Professor Steve Jones, co-creator of Rocky Linux and the director of Stanford’s High Performance Computing Center, along with various distinguished guest speakers
Inspired by the insights shared during these sessions, I felt compelled to share them with a broader audience. This post aims to provide valuable information for anyone interested in the field of High Perfomance Computing.