本课程针对人群:上过机器学习小班课程的同学或具备扎实的机器学习基础的同学想要进一步提升机器学习技能,深耕机器学习,冲刺高阶ML岗位。 主要知识点:Training Large Language Models(入门到高阶的语言模型训练)、Recurrent Neural Networks and Language Models、NLP文本分类、模型预训练、实战分析。

#### 第1讲 From Word Vectors to Language Models 本文表示是自然语言和机器学习模型之间的翻译器,主要包括文本处理和文本表征两个部分。在这一节课中我们涉及的面试重点包括,文本处理,分词,以及self-supervised learning。其次我们介绍NLP中的新任务,language modeling,它在工业界非常常见,比如输入法,Google search等。 以及如何用RNN解决好language modeling的问题,并分析RNN 在解决language model中的一系列问题。 面试问题: 1.什么是文本处理(Text Processing)?列举一些常见的文本处理任务。 2.为什么在NLP中需要进行文本表征(Text Representation)? 3.什么是Word2Vec和GloVe?它们如何生成词向量? 4.在文本处理中,为什么要进行分词、去停用词和词性标注? 5.请解释一下什么是语言模型(Language Model)。 6.什么是序列生成(Sequence Generation)问题?请举例说明。 #### 第2讲 Large Language model 基础面 这一节课我们介绍LLM基础面试考核内容,主要包括LLM的基本定义,以及LLM训练过程种常见的问题,例如复读机问题,训练的输入句子可以无限长吗? 面试问题: 1.目前主流的开源模型有哪些 2.Prefix-, causal, 和encoder-decoder的区别是什么 3.LLM的训练目标是什么? 4.什么是LLM的复读机问题? 5.为什么会出现复读机问题 #### 第3讲 Large Language model 进阶面试-prompt engineering 在这节课中我们会主要讲解prompt engineering, 它是目前LLM技术阶段的新兴方法,颠覆传统ML的预测模式。大部分工业界的LLM都主要集中在prompt engineering 上。 面试问题: 1.为什么需要prompting 2.什么是prompting 3.Prompting的方法 4.Pre-fix tuning的优势是什么 5.Prompt-tuning的思路是什么 #### 第4讲 Large Language model 进阶面试-LLM fintuning 这节课我们涉及LLM fintuning 的方法和技巧,如何压缩训练时间以及训练的资源消耗是这部分的重点。此外我们会重点介绍目前最常用的高效训练方法LORA 面试问题: 1.如果需要在某个模型上做全参数微调,需要多少显存? 2.为什么SFT后LLM效果变差? 3.SFT指令如何构建? 4.领域模型微调的方法 5.为什么出现灾难性遗忘?该怎么解决灾难性遗忘问题 6.Lora的思路是什么? 7.Lora的权重是否可以融入原模型 #### 第5讲 Large Language model 进阶面试-Langchain Langchain 是NLP最常用的开源社区之一,了解langchain的用法可以帮助我们解决各种不同的业务问题。 面试问题: 1.什么是langchain 2.Langchain包括什么核心概念 3.什么是langchain Agent 4.Langchain 如何使用 5.Langchain 中存在哪些问题以及解决方案 #### 第6讲 Large Language model 进阶面试-RLHF RLHF可以帮助我们align 大模型和human knowledge, 什么是RLHF?如何实现RLHF?在这节课我们将学习如果在LLM上用RL对模型进行fine-tuning 面试问题: 1.简单介绍一下RLHF 2.奖励模型和基础模型一致吗 3.如何解决HF代价大的问题,很难量产 4.如何解决三个阶段的训练,SFT,RM,PPO过程较长,更新迭代慢的问题 5.如何解决PPO训练过程中同时存在4个模型的问题 PPO训练的数据格式 #### 第7讲 LLM实战:Seq2Seq模型从小模型到大模型 NLP中大部分的问题都可以归纳为Seqseq的问题:例如机器翻译、对话系统、summarization等。解决seq2seq问题的常见方法是什么?小模型和大模型处理seq2seq问题的区别 面试问题: 1.What is a Seq2Seq model, and how does it work? 2.Explain the encoder-decoder architecture in Seq2Seq models. 3.What is the purpose of the encoder and decoder in a Seq2Seq model? 4.Explain the concept of teacher forcing and its role in Seq2Seq models. 5.How do you handle the issue of different sentence lengths in Seq2Seq models? 6.What are some evaluation metrics used for assessing the quality of machine translation outputs? 7.Can you explain the difference between beam search and greedy decoding in Seq2Seq models? 8.What are the limitations of Seq2Seq models for machine translation? 项目描述: The objective of this project is to train a translation model from scratch using the Transformer architecture. Specifically, we aim to build a German to English translation model using the torchtext library and the Multi30k dataset. The Transformer model has demonstrated remarkable success in various natural language processing tasks, including machine translation. By training our own translation model, we can delve into the intricacies of this powerful architecture and gain a deeper understanding of its inner workings. To begin, we will leverage the torchtext library, a popular tool for text processing in PyTorch. This library provides a convenient way to access the Multi30k dataset, which contains parallel sentences in both German and English. Multi30k is a widely used benchmark dataset in the machine translation community, offering a diverse range of sentence pairs to train and evaluate our model. Training a translation model from scratch involves several key steps. We will preprocess the Multi30k dataset, performing tokenization and building vocabulary for both the source (German) and target (English) languages. Then, we will design and implement the Transformer architecture, taking into consideration the specific requirements of machine translation. The model will be trained on the preprocessed data, optimizing the model's parameters using a suitable optimization algorithm, such as stochastic gradient descent (SGD) or Adam. During the training process, we will monitor the model's performance using appropriate evaluation metrics, such as BLEU score, to assess the quality of translations produced by the model. We will also employ techniques like teacher forcing and beam search to improve the model's translation accuracy and fluency. Upon completion of the training process, we will evaluate the trained translation model on a separate test set from the Multi30k dataset. This evaluation will provide insights into the model's generalization capabilities and its ability to translate German sentences to English accurately. Overall, this project will provide a comprehensive guide on how to train a translation model from scratch using the Transformer architecture. By utilizing the torchtext library and the Multi30k dataset, we will develop a German to English translation model that can effectively capture the complexities of language translation. The acquired knowledge and experience from this project can serve as a solid foundation for future exploration and improvement in the field of machine translation. #### 第8讲 LLM实战:文本分类 NLP的高级任务可以粗分为:表示和分类,分类可应用至主题分类、新闻分类、情感分析、意图识别等。评价指标:Confusion Matrix是什么?Accuracy、Precision、Recall、F1 Score之间的关系?ROC曲线和AUC面积的关系?我们为什么要用LogLoss来评价模型?模型构建思路的衍化里暗含着对各个神经网络框架优缺点的考量。没有最好的模型,只有更适合的模型。 面试问题: 1.文本分类在NLP中的应用有哪些? 2.解释一下准确率(Accuracy)、精确率(Precision)、召回率(Recall)和F1分数的定义和计算方式。 3.什么是混淆矩阵(Confusion Matrix)?如何使用混淆矩阵评估分类模型的性能? 4.ROC曲线和AUC面积的含义是什么?它们如何用于评估分类模型的性能? 5.为什么LogLoss(对数损失)被用来评价模型的性能?有什么优势和应用场景? 项目介绍: This notebook serves as a comprehensive guide for fine-tuning BERT, a state-of-the-art language model, to perform intent classification. Intent classification aims to map natural language instructions or sentences to a predefined set of intents. By leveraging BERT's powerful language representation capabilities, we can build an accurate and robust intent classification model. Throughout this notebook, you will gain hands-on experience in several key aspects of intent classification using BERT. The following topics will be covered: Data Loading and Preprocessing: You will learn how to load data from a CSV file and preprocess it to prepare it for training and testing. This includes tasks such as cleaning the data, tokenization, and creating suitable input representations for BERT. BERT Model Loading: You will understand how to load a BERT model from TensorFlow Hub. TensorFlow Hub provides pre-trained BERT models that have been trained on large-scale datasets, enabling us to benefit from the model's rich contextual understanding of language. Building a Custom Model: You will learn how to build your own intent classification model by combining the BERT model with a classifier. This involves adapting the BERT model to the specific task of intent classification by adding a classification layer on top of it. Fine-tuning BERT: You will delve into the process of fine-tuning BERT as part of training your intent classification model. Fine-tuning allows the model to adapt to the nuances and specific patterns in the intent classification task by updating its parameters on a smaller, task-specific dataset. Model Saving and Utilization: Once your model is trained and fine-tuned, you will discover how to save it for future use. This includes saving the trained model's weights and architecture, enabling you to deploy and utilize the model to recognize the intent of new instructions or sentences. By following this notebook, you will develop a strong understanding of the entire pipeline involved in fine-tuning BERT for intent classification. This includes data preprocessing, model loading, architecture customization, training, and model saving. The knowledge gained from this project will equip you with the necessary skills to build your own intent classification models using BERT and apply them to various real-world applications where understanding user instructions or queries is crucial. #### 第9讲 LLM实战:问答系统 在这一讲,我们会学习问答系统的概念和构建方法,阅读理解(Reading Comprehension)的实现,基于单个文本段落的问答,基于多个文档和问题的问答。 面试问题: 1.什么是问答系统? 2.如何搭建一个问答系统? 3.怎么做reading comprehension? 4.如何基于single passage of text 给出回答? 5.如何根据一大堆documents 以及问题来回答问题? 项目描述: This project aims to develop a question-answering system utilizing the power of GPT-3 and the OpenAI platform. By leveraging the advanced natural language processing capabilities of GPT-3, we can create an intelligent system that can understand questions posed by users and provide accurate and informative answers. The question-answering system will be built on the foundation of GPT-3, a state-of-the-art language model developed by OpenAI. GPT-3 has been trained on a vast amount of diverse textual data, enabling it to comprehend and generate human-like responses to a wide range of questions. The project workflow involves several key steps: Data Collection and Preprocessing: We will gather a suitable dataset of questions and corresponding answers relevant to the desired domain or topic. The dataset will be preprocessed to ensure it is in a format suitable for training and evaluation. Model Training and Fine-tuning: GPT-3 will be used as the base model for our question-answering system. We will utilize the OpenAI platform to train and fine-tune the model on the collected dataset. Fine-tuning will allow the model to specialize in providing accurate answers to the specific domain or topic. User Interaction: We will develop an interface where users can input their questions. The system will then use GPT-3 to process the question and generate an appropriate response. The generated response will be presented to the user in a user-friendly format. Evaluation and Improvement: To ensure the system's performance and accuracy, we will evaluate its answers against a benchmark dataset or through user feedback. This evaluation will help identify areas for improvement and fine-tuning of the system. Deployment and Integration: Once the question-answering system is trained and refined, it will be deployed to a suitable environment for easy accessibility. This can involve integrating the system with a web application or creating an API that can be accessed by users. By completing this project, you will gain practical experience in developing a question-answering system using GPT-3 and the OpenAI platform. This project will enhance your understanding of natural language processing, machine learning, and the capabilities of GPT-3. The resulting question-answering system can be utilized in various domains, such as customer support, knowledge bases, or interactive chatbots, where providing accurate answers to user queries is essential. 第10讲 LLM高阶技术面试 在这一节课我们讲讲解LLM的高阶技术,主要包括inference,hallucination, multi-agent等问题的构建和处理。 面试问题: 1.LLM的hallucination分为几种类型 2.如何构建multi-agent LLM 3.什么是COT, TOT 4.如何帮助LLM做推理 5.解决hallucination的方法分为几种 6.如何协调多个agent来解决任务