LLaMA: Your Personal AI Language Model in Your Pocket

Large AI language model on your own computer: A LLaMA for your pocket?

Facebook’s parent company Meta has released its own large language model (LLM) called LLaMA (Large Language Model Meta AI). Meta made the pre-trained model, which was trained with open data, available to the global AI research community. While Meta has not clarified the model’s specific uses, it is an open model that can be customized and run on personal hardware. LLaMA is also much smaller than other models like ChatGPT and GPT-4, and a research team at Stanford has developed a new model from LLaMA that can compete with ChatGPT.

Modern language models are based on complex neural networks with different levels that work with tricks like the attention mechanism. The first such model was Google’s BERT (Bidirectional Encoder Representations from Transformers) and it revolutionized an entire industry. Over time, models like BERT have grown in size due to increased computing capacity and available training data. Training language models is an extremely complex and expensive process that requires billions of parameters to be adjusted.

For the prediction of tokens, or the application of the neural network, it is sufficient to evaluate the network. This still takes a while, even with reduced accuracy on CPUs. Training such large models requires a great deal of hardware and computing power, with the cost for a single training run in the millions. Meta’s LLaMA training was probably very expensive, and the model is available in different sizes, ranging from 7 billion to 65 billion parameters.

Language models are evaluated by dividing text into tokens for training, with each token corresponding to a specific number/position in the vocabulary. The numbers of tokens for an entire paragraph are provided to the Transformer model as input in the first level of the neural network. Each token corresponds to a specific weight in the matrix multiplication and contributes to the prediction of the next token, which is represented as a probability distribution. The prediction then starts again, but with one more token than before. This process is represented by a matrix multiplication of billions of operations, which can be carried out much faster on GPUs than CPUs.

While GPUs are essential for training large language models, some developers have made approximations available on CPUs with reduced accuracy. To speed up the process, the weights in the original model are stored as 16-bit floating point numbers and then quantized to just four bits. Models evaluated with CPUs are slow but can still provide useful results.

In summary, LLaMA is an open language model that can be customized and run on personal hardware, making it unique among other language models like ChatGPT and GPT-4, which are black boxes. While training large language models is expensive, approximations can provide results on CPUs with reduced accuracy. As language models continue to grow in size, the hardware and computing power required will also increase.

Leave a Reply