ChatGPT can run for "Little Alpaca" Mac, 2 lines of code single GPU,UC Berkeley re-issued 7 billion parameter open source model 07/06 Update SLTechnology News&Howtos

ChatGPT can run for "Little Alpaca" Mac, 2 lines of code single GPU,UC Berkeley re-issued 7 billion parameter open source model

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >

Shulou(Shulou.com)11/24 Report--

Shortly after the weight of the 13 billion parameter model was announced, UC Berkeley LMSys org released the 7 billion parameter "Little Alpaca" again. Also today, Hugging Face released the 7 billion parameter model StackLLaMA.

Since Meta released the "open source version of ChatGPT" LLaMA, the academic community has been on a spree.

First Stanford proposed 7 billion parameter Alpaca, and then UC Berkeley joined forces with CMU, Stanford, UCSD and MBZUAI to release 13 billion parameter Vicuna, which can rival ChatGPT and Bard in more than 90% of cases.

Today, the "roll king" UC Berkeley LMSys org released another 7 billion parameter Vicuna--.

Not only small size, high efficiency, strong ability, but also only two lines of command can run on the M1 / M2 chip Mac, but also can turn on GPU acceleration!

Project address: https://github.com/ lm-sys / FastChat/#fine-tuning is today, and researchers at Hugging Face have released a model with 7 billion parameters-StackLLaMA. This is a model fine-tuned in LLaMA-7B through human feedback reinforcement learning.

Vicuna-7B: GPU,Mac alone can run less than a week after the release of the model, UC Berkeley LMSys org announced the weight of Vicuna-13B.

Among them, a single GPU requires about 28GB's video memory, and if only CPU is used, it needs about 60GB's memory.

The 7 billion parameter version released this time is much smaller-the requirements are cut in half.

In other words, running Vicuna-7B with a single GPU requires only 14GB + video memory, while pure CPU requires only 30GB + memory.

Not only that, we can also enable GPU acceleration on Mac equipped with Apple's own chip or AMD GPU through the Metal backend.

When the 13B model was released, many netizens complained:

What I thought was a single GPU:4090

Actual single GPU:28GB video memory and above

Now, there is also a new solution to this problem-using 8-bit compression to directly reduce the amount of memory by about half, but the quality of the model will be slightly reduced.

13B model 28GB memory instantly changed into 14GB TX 7B model 14GB memory instantly changed into 7GB, there is no! (but because of activation, the actual occupancy will be higher than this)

In response, LMSys org researchers say that if there is not enough memory or video memory, you can enable 8-bit compression by adding-load-8bit to the above command.

Moreover, whether it is CPU, GPU or Metal, 7B model or 13B model, all are applicable.

Python3-m fastchat.serve.cli-- model-name / path/to/vicuna/weights-- load-8bitStackLLaMA: Super full RLHF training tutorial Today, Hugging Face researchers released a blog StackLLaMA: a practical Guide to training LLaMA with RLHF.

At present, large language models ChatGPT, GPT-4 and Claude all use human feedback reinforcement learning (RLHF) to fine-tune the behavior of the model to produce a more consistent response in line with the user's intention.

Here, the HF researchers trained the LlaMa model to answer all the steps on the Stack Exchange using RLHF in a combination of the following ways:

Supervised fine tuning (SFT)

Reward / preference modeling (RM)

Human feedback reinforcement learning (RLHF)

Pay attention!

The main goal of training StackLLaMA is to provide a tutorial and guide on how to use RLHF to train models, rather than focusing on model performance.

In other words, the model is very funny in generating answers, such as asking it, "there is a camel in my garden. How can I drive it away?" "

StackLLaMA concluded with a summary: "if all the above methods don't work, call in reinforcements." If more than one person wants to catch this strange little guy, why not assemble a team? If we make concerted efforts and concentrate our efforts, this problem should be solved soon.

When doing RLHF, the most important thing is to start with a strong model. Because RLHF is just a fine-tuning step to make the model consistent with the interaction and response we expect.

At present, the parameters of Meta open source LLaMA model range from 7B to 65B, and trained on 1T to 1.4T token, which is a powerful open source model at present.

Therefore, the researchers used the 7B model as the basis for subsequent fine-tuning.

For dataset selection, the researchers used the StackExchange dataset, including all the questions and answers (as well as StackOverflow and other topics).

The advantage of choosing this dataset is that the answer is given along with the number of likes and the label that accepts the answer.

The researchers rated each answer based on the methods described in the A General Language Assistant as a Laboratory for Alignment paper:

Score = log2 (1 + upvotes) rounded to the nearest integer, plus 1 if the questioner accepted the answer (we assign a score of − 1 if the number of upvotes is negative). For the reward model, two answers to each question are always needed to compare.

While some questions have dozens of answers, there may be a lot of choices. Therefore, the researchers sampled up to ten answer pairs for each question to limit the number of data points for each question.

Finally, the format is cleared by converting HTML to Markdown, making the model output more readable.

Training strategies require a lot of memory to train even the smallest LLaMA model. By calculating the 7B parameter model, (2x8) * 7B=70GB memory space will be used. More may be needed when calculating intermediate values such as attention scores. Therefore, the model cannot be trained even on the A100 of a single 80GB.

One way is to use a more efficient optimizer and half-precision training to compress more information into memory, but memory is still insufficient.

Another option is to use parameter efficient fine tuning (PEFT) techniques, such as the PEFT library, which can perform low rank adaptation (LoRA) on an 8-bit model.

Low-rank adaptation of the linear layer: add an additional parameter (orange) next to the frozen layer (blue) and add the hidden state encoded by the result to the hidden state of the frozen layer.

Loading the model at 8 bits greatly reduces memory footprint because each parameter requires only one byte of weight. For example, 7B LLaMA is 7 GB in memory.

LoRA does not train the original weight directly, but adds a small adapter layer to some specific layers (usually the attention layer), so the number of trainable parameters is greatly reduced.

In this case, a rule of thumb is to allocate approximately 1. 2-1.4GB of memory for each billion parameters (depending on the batch size and sequence length) to accommodate the entire fine tuning setting.

This allows for fine-tuning larger models at a lower cost (models up to 50-60B are trained on the NVIDIA A100 80GB). These technologies have been able to fine-tune large models on consumer devices such as raspberry pies, mobile phones, and GoogleColab.

The researchers found that although very large models can now be put into a GPU, training may still be very slow.

Here, the researchers used a data parallelism strategy: copy the same training settings to a single GPU and pass different batches to each GPU.

Supervised fine tuning before starting to train the reward model and use RL to adjust the model, instruction tuning is required for the model to follow instructions in any case.

The easiest way to do this is to continue to train the language model using text from the domain or task.

To use the data effectively, the researchers used a technique called "packing": an EOS tag was used to join many texts between them, and context-sized blocks were cut to fill the batches without any padding.

In this way, training is more efficient because each token through the model is also trained.

In principle, researchers can use RLHF to fine-tune the model directly through manual tagging. However, this requires that some samples be sent to humans for rating after each optimization iteration.

Due to the need for a large number of training samples to achieve convergence, the inherent delay in human reading and tagging speed is not only expensive, but also very slow.

Therefore, the researchers trained a reward model on the collected manual tags before RL adjusted the model. The purpose of reward modeling is to imitate human evaluation of the text, which is more effective than direct feedback.

In practice, the best way is to predict the ranking of the two examples, and the reward model will provide two candidates according to the prompt X

And you have to predict which one will be rated higher by human commentators.

Using the StackExchange dataset, the researchers inferred which of the two answers users preferred based on their scores. With this information and the losses defined above, you can modify the transformers.Trainer. Train by adding a custom loss function.

Class RewardTrainer (Trainer): def compute_loss (self, model, inputs, return_outputs=False): rewards_j = model (input_ids=inputs ["input_ids_j"], attention_mask=inputs ["attention_mask_j"]) [0] rewards_k = model (input_ids=inputs ["input_ids_k"] Attention_mask=inputs ["attention_mask_k"]) [0] loss =-nn.functional.logsigmoid (rewards_j-rewards_k). Mean () if return_outputs: return loss, {"rewards_j": rewards_j, "rewards_k": rewards_k} return loss researchers used 100000 pairs of candidate subsets And evaluated on the support set of 50000 candidate pairs.

The training was recorded by Weights & Biases and spent several hours on 8-A100 GPU, and the final accuracy of the model was 67%.

Although this may not sound like a high score, the task is also very difficult for human commentators.

With a fine-tuned language model and reward model, human feedback reinforcement learning can now run the RL loop, roughly divided into the following three steps:

Generate a response according to the prompt

Grade the answers according to the reward model

Optimization of reinforcement learning strategy for rating

Before the query and response prompts are marked and passed to the model, the template is as follows. The same template applies to the SFT,RM and RLHF phases.

Question: Answer: a common problem with the RL training language model is that the model can learn to use the reward model by generating complete nonsense, resulting in unrealistic rewards for the reward model.

To balance this, the researchers added a penalty to the reward: keep an untrained model for reference, and compare the generation of the new model with that of the reference model by calculating KL divergence.

Each step is rewarded in batches during the training, and the performance of the model tends to be stable after about 1000 steps.

Reference:

Https://twitter.com/lmsysorg/status/1644060638472470528?s=20

Https://huggingface.co/blog/stackllama

This article comes from the official account of Wechat: Xin Zhiyuan (ID:AI_era)

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.