Upskill your Team on Generative AI. Start here >

Parameter-Efficient Tuning of Large Language Models (LLMs) with Novel Ensemble Knowledge Distillation Framework – Rohit Sroch

In this talk, the focus was on parameter-efficient tuning and knowledge distillation techniques to optimize large language models (LLMs).

In a captivating talk delivered by Rohit Sroch, Senior AI Scientist at Course5i, the focus was on parameter-efficient tuning (PET) and knowledge distillation techniques to optimize large language models (LLMs). These methods aim to address the challenges associated with fine-tuning and inference, particularly concerning the extensive computational resources required. By combining the Seed Framework and the PET Framework, Sroch highlighted how the performance of LLMs can be boosted while mitigating resource constraints.

The SEAD Framework: Unleashing the Power of Multiple Teachers

Sroch introduced the SEAD Framework, inspired by the concept of knowledge transfer from multiple teachers. The framework comprises two key components: creating multiple teachers and distillation. To create multiple teachers, two approaches were explored: Average Ensemble and Multi-Seed. The former involves averaging the weights of multiple teachers, while the latter utilizes different seed values for each teacher and then captures the knowledge variance. The blending of knowledge is achieved through three methods: Noisy, Weighted, and Random, each tailored to the specific approach used. The distillation process involves soft logits, facilitating task-specific loss comparison between the student and teacher predictions.

Knowledge Distillation: Empowering the Student Model

Knowledge distillation is a powerful technique wherein a student model learns from a teacher model, improving its performance on a specific task. Sroch highlighted various knowledge distillation techniques, including key and divergence, Jacquard similarity, MSE, and cross-entropy. The SEAD framework incorporates these techniques, alongside sample choices and blending methods, to guide the student model’s learning process. Notably, a small weightage is assigned to the distilled knowledge, facilitating an effective transfer from teacher to student.

The PET Framework: Optimizing Parameter Efficiency

The PET Framework, another approach explored by Sroch, focuses on parameter-efficient tuning. This technique involves freezing the weights of the large language model and introducing a small number of new weights externally. By incorporating adapter modules such as Adapter S, Adapter P, and Laura, the PET Framework achieves fine-tuning with minimal additional weights. These methods significantly reduce the computational burden during inference, as the model only needs to load task-specific modules based on user input. Sroch showcased how the PET Framework enables LLMs to outperform models like GPT-3, demonstrating comparable or superior performance to the teacher model.

Conclusion

Rohit Sroch’s enlightening talk at Course5i delved into parameter-efficient tuning and knowledge distillation techniques for large language models. The Seed Framework leverages knowledge transfer from multiple teachers, harnessing the power of ensemble learning and blending methodologies. Meanwhile, the PET Framework optimizes parameter efficiency, enabling efficient fine-tuning and inference with limited computational resources. These advancements not only enhance the performance of LLMs but also offer practical benefits, such as reduced compute requirements and the ability to load task-specific modules. As the field of large language models continues to evolve, these strategies hold immense promise for creating more efficient and powerful language models that can elevate various NLU and NLG tasks.

The Chartered Data Scientist Designation

Achieve the highest distinction in the data science profession.

Elevate Your Team's AI Skills with our Proven Training Programs

Strengthen Critical AI Skills with Trusted Generative AI Training by Association of Data Scientists

Explore more from Association of Data Scientists