Fine Tune Training AI Models: OpenAI, Anyscale & Infinite.Tech
Date Updated: January 14, 2024
Language models have only been mainstream for about a year, and the astounding improvement and access are hard to keep up with. Within this time, we have seen power struggles, literally and figuratively, that have revealed the brittleness at the cutting edge of technology.
With the progression of open models and fine-tuning, we can be sure that we will see a new slew of impressive and not-so-impressive implementations to understand, use, and balance. As you will learn, fine-tuning models can drastically adjust the model's structure into far-out places.
Similar to using many AI systems today, these can provide quick and worthwhile results, but pushing them into greatness can be expensive in terms of work, energy, and capital. In this series, we will use various techniques and services like Open AI and Anyscale to fine-tune training and Infinite.Tech will make and output our training sets. Through this workflow, improve your custom AI application and remove the mystery and headache from the process.
Science Stanley, one of our greatly valued advisors and a seasoned LLM machine learning expert, uses the spider web to describe a trained model. The different bunchings of the webs can be looked at as peaks or key terms in the dataset, and when these models are prompted, you can see these bunchings being extenuated, reacting to the new conditions.
Effective prompting can have repeatable results with this, given stability in the system. To change that stability, we can fine-tune, adjusting these webs to new configurations to meet our use.
In the ever-evolving AIs, the focus on model sizes, data quality, and specific applications has brought a new understanding of how AI models are classified and utilized. The traditional classification into large, medium, and small models based on parameters, while a useful starting point, is now supplemented with a deeper understanding of efficiency, versatility, and application-specific requirements.
Models to start with for fine tune training are the LLama family from Meta, the mistralai/Mistral-7B-Instruct-v0.1, and using the OpenAI playground allows training of Babbage-002 davinci-002 and GPT-3.5-Turbos.
Check out this data visualizer from Atlas Nomic! Looking around and seeing the type of data they have used and how the model interprets it gives you an idea of the correlations that each token is considered through.
In the intricate web of AI model training, envisioning each keyword, prompt, and response as individual threads woven into the fabric of the dataset offers a vivid metaphor. These threads can bunch up, creating dense clusters of information, or spread out, weaving a broader but less concentrated tapestry. This imagery is particularly apt in the context of AI development in 2024, where the focus and granularity of data detailing and annotation have become a cornerstone of effective model training.
Certain applications, particularly those requiring repeatable data retrieval, might find a dense clustering of similar data threads beneficial. However, when applied indiscriminately, this approach can lead to peaks of over-concentration and potential information leakage in more general applications. Striking the right balance in the dataset’s structure is thus not just a technical challenge but an art form, requiring a nuanced understanding of the dataset’s nature and purpose.
Embeddings, a key natural language processing (NLP) technique, have become essential in AI and machine learning for identifying keywords and domain niches. These tools represent words and phrases as numerical vectors, enabling the analysis of text data by calculating similarities between terms. For example, in the automotive industry, embeddings can analyze search queries related to a specific brand like ‘Porsche.’ Utilizing models in Open AI Embedding models or SentenceTransformers, embeddings create a multidimensional vector space where similar queries cluster together, revealing patterns like ‘buy a new Porsche’ or ‘Porsche maintenance issues.’ This approach is crucial for extracting relevant keywords and gaining insights into customer interests, significantly aiding in the development of targeted content and marketing strategies. Embeddings are a powerful tool for dissecting and understanding text data, vital for AI-driven keyword and niche identification.
System / User / Assistant
[
{"role": "system", "content": "You are a helpful assistant"},
{"role": "user", "content": "Hi"},
{"role": "assistant", "content": "Hello, How can I help you?"},
{"role": "user", "content": "How is the weather today?"},
{"role": "assistant", "content": "get_weather_status()"}
]
User / Assistant
[
{"role": "user", "content": "Hi"},
{"role": "assistant", "content": "Hello, How can I help you?"},
{"role": "user", "content": "How is the weather today?"},
{"role": "assistant", "content": "get_weather_status()"}
]
Something important that is not always obvious is how JSONL files should be formatted, which is a series of long single lines for each training point. Saving your JSONL incorrectly will throw errors. Using the Save to Training Set feature in Infinite.Tech, you can export your projects to the correct file format, so you don’t have to be a data scientist to get started.
{"messages":[{"role": "system", "content": "You are a friendly assistant."},{"role": "user", "content": "Hello"},{"role": "assistant", "content": "Hi there! How can I assist you today?"},{"role": "user", "content": "Can you tell me the current time?"},{"role": "assistant", "content": "Certainly! get_current_time()"}]}
{"messages": [{"role": "system", "content": "You are a serious assistant. "},{"role": "user", "content": "Hello"},{"role": "assistant", "content": "Good day. How may I be of service?"},{"role": "user", "content": "Can you tell me the current time?"},{"role": "assistant", "content": " Of course. get_current_time"}]}
{"messages": [{"role": "system", "content": "You are a professional assistant."},{"role": "user", "content": "Hello"},{"role": "assistant", "content": "Hello! How may I assist you today?"},{"role": "user", "content": "Can you tell me the current time?"},{"role": "assistant", "content": "Absolutely, let me check that for you. get\_current\_time()"}]}
On Anyscale, every job is a minimum of $5.
Then, it is charged per million tokens depending on the model. A good way of estimating token counts is using the open-AI tokenizer.
Using Open-AI, jobs are priced based on tokens times Epochs. As of now the playground does not let you specify the epoch number.
Now, it is time to create your fine-tuning job. This is accomplished, as demonstrated here through the Anyscale Fine Console or Open AI Fine Tune Playground or programmatically through their APIs.
To further use and test the models use different ways navigate to app.Infinite.Tech, and follow these steps:
2. Generate or Interact: Use the selected model to generate design variations, interact with the design, or receive AI-generated insights and suggestions.
5. Evaluate Results: Evaluate the generated designs or insights based on your project goals, design requirements, or any specific parameters you’ve set.
6. Iterate and Refine: Based on the results, iterate on your design, make refinements, and continue using the fine-tuned model as needed to enhance your design process.
With A little prompt engineering, the concept-tuned model performs as expected and formats the instruction manuals.
We hope this can inspire training specialized models. With a huge potential for impact across industries and use cases, we are still in the early days of what will come and how we create new and exciting products with fine-tuning.