Machine Learning stories roundup 2023.9
General
Upgini can search public and premium external data sources to enrich our features for LLM or neural network models. Can also use OpenAI GPT to generate features.
LLM
LLM Ecosystem
Cost for customizing LLMs:
- Closed source APIs + Document Embedding Database: This first solution is probably the easiest to get started off with, and considering the high quality of ChatGPT API — might even give you a good enough (if not the best) performance. And it’s cheap!
- Fine-tune LLMs: Recent progress from fine-tuning LLaMA-like models has shown this costs ~500$ to get a baseline performance similar to ChatGPT in certain domains. Could be worthwhile if you had a database with ~50–100k instructions or conversations to fine-tune a baseline model.
- Train from scratch: As LLaMA and the more recent MPT-7B models have shown, this costs ~100–200k and takes a week or two.
Building with Instruction-Tuned LLMs, datasets used: instruction tuning (align with human preferences): https://huggingface.co/datasets/databricks/databricks-dolly-15k, fine-tuning: https://huggingface.co/datasets/FourthBrainGenAI/MarketMail-AI
Efficient Fine-Tuning for Llama-v2–7b on a Single GPU, PEFT, LORA, QLORA, Quantization, paged optimizer/adam optimizer weights offload to CPU, gradient accumulation; coding fine-tuning, dataset: https://huggingface.co/datasets/HuggingFaceH4/CodeAlpaca_20K, use declarative/low-code ML Ludwig
Deep Dive into LLM Evaluation with Weights & Biases, besides traditional accuracy, F1, exact match, ROUGE, also mentioned squad metrics, sematic answer similarity, G-Eval
Many LLMs, but fine-tuning techniques such as reinforcement learning with human feedback(RLHF) require a particularly complicated workflow. Lamini is one of the open-source initiatives to streamline LLM fine-tuning process. Main capabilities:
· The Lamini library includes optimized prompt-tuning and typed outputs, which you can try out in our playground right now.
· With only a few lines of code, you can access the advanced Lamini library for fine-tuning and RLHF by signing up for early access.
· The hosted data generator enables the building blocks for creating data necessary to train instruction-following LLMs.
· An instruction-following LLM that can be used with few lines of code.
Huggingface trl package has pipeline support for SFT, reward modelling, reinforcement learning training with human feedback, under research projects.
PPO: Instruction Tuning using RLHF involves training a reward model and then using RL to find a policy that maximizes the learned reward.
DPO: Direct Preference Optimization (DPO) is an efficient alternative to RLHF. It eliminates the requirement of training reward model (which is hard to find perfect reward function) and then using reinforcement learning.
Human preference dataset from stack exchange, dataset processed into human rejected and accepted
DPO evaluates the consistency of a reward function with empirical preference data using a theoretical preference model. While conventional approaches define a preference loss using the preference model to train a reward model, DPO instead trains a policy that maximizes the learned reward model using a variable switch. Therefore, DPO may optimize a policy with a simple binary cross-entropy goal given a dataset of human preferences over model responses without explicitly learning a reward function or sampling from the policy during training.
How Google Vertex AI can prevent LLM from hallucination: grounding with embeddings and vector search. Enablers:
Vertex AI Embeddings for Text/Image enabling Semantic Search, Recommendation, Clustering, Anomaly Detection, Sentiment Analysis
Vertex AI Matching Engine: vector search
Interesting project: build your self-hosted GPT with LangChain, GPT4All, LlamaCpp, Chroma and SentenceTransformers, in case data leak is serious risk.
Another article on privategpt, notable things besides private GPT model and private embeddings provider (the example use sentencetransformers embedding all-MiniLM-L6-v2 model), you have token limit setting
- MODEL_N_CTX — Maximum token limit for both embeddings and LLM models
ingest.py allows you to ingest documents and list supported file types
Automatic Metrics
Text-generation/Summarization tasks
- BLEU: Measures precision of n-grams between generated and reference texts. Useful for evaluating language fluency.
- ROUGE: Measures recall of n-grams between generated and reference texts. Also useful for language fluency.
- BLEU vs ROUGE, BLUE: Precision oriented score, ROUGE: Recall oriented score
- BERTScore: Calculates similarity between BERT embeddings of generated and reference texts. Evaluates semantic similarity.
Classification tasks
- Accuracy: Fraction of examples predicted correctly. Good for classification tasks.
- F1 Score: Harmonic mean of precision and recall. Useful when classes are imbalanced.
Extraction tasks
- Exact Match: Binary score indicating if generated text exactly matches reference. Useful for extraction tasks.
Human Evaluation
- Language Quality: Have humans rate grammar, fluency, consistency on a Likert scale.
- Engagingness: Score how interesting, diverse, and engaging conversations are.
- Correctness: Evaluate if responses are factually accurate and logically valid.
- Helpfulness: Assess if conversations resolve user queries appropriately.
- User Satisfaction: Overall subjective rating of conversation experience.
- Soundness — Assess logical validity of recommendations provided by the chatbot. Have human experts review conversations to check that the reasoning is analytically sound.
- Ethicality — Evaluate whether recommendations adhere to ethical principles like transparency, fairness, avoiding bias etc. Human evaluations needed.
- Actionability — Score how precise and actionable the decision support provided is. Rate on a scale whether humans can act on the advice easily.
Vicuna is an open-source chatbot that has been fine-tuned (supervised instruction fine-tuning) from a LLaMA base model using approximately 70,000 user-shared conversations collected from ShareGPT.com with public APIs.
The team expanded the max context length from 512 in alpaca to 2048 to enable a better understanding of long conversations.
Vicuna beats LLaMa, Alpaca in most tasks.
Streaming ChatGPT responses: the new API is simpler, set stream=True, for each event in response variable, get event dealta response, then retrieve content
Easy-to-digest explanation of the seminal SELF-INSTRUCT paper that led to another influential work, Stanford Alpaca.
Motivation for instruction following: Large Language Models are trained to predict the next token which, in general, can lead to untruthful, toxic, and unhelpful token generations. A better goal is to follow the user’s instructions. In doing so, LLMs generations can be truthful, helpful, and safe. This process is known as alignment.
SELF-INSTRUCT’s Motivation: reduce the dependence on human annotators, from cost, diversity and creativity perspectives.
6 steps:
- A bootstrapped pipeline generates tasks (and instances of those tasks) with a pre-trained model. This can be broken down into steps zero through four.
- Step 0 — Manual task creation seeding
- Step 1 — Instruction generation (with prompting like “come up with series of new tasks)
- Step 2 — Classification task identification
- Step 3 — Instance generation (input-first approach for non-classification, output-first approach for classification)
- Step 4 — Filter out similar tasks (ROUGE-L should be less than 0.7 to ensure diversity)
2. Step 5 — The final step, step 5, takes the generated tasks and fine-tunes them in order to align the language model to follow instructions better.
Other
Contrastive learning allows machine learning model to look at which pairs of data points are “similar” and “different” in order to learn higher-level features about the data, before even having a task such as classification or segmentation. The process is:
- Create different versions of the same image with two augmentation combinations (i.e. crop + resize + recolor, resize + recolor, crop + recolor, etc.)
- Train the model to output similar representations for similar images.
- Maximize the similarity of the two vector representations by minimizing a contrastive loss function.