Infinite.Tech
Search

App

Fine Tune Training AI Models: OpenAI, Anyscale & Infinite.Tech

Date Updated: January 14, 2024

Language models have only been mainstream for about a year, and the astounding improvement and access are hard to keep up with. Within this time, we have seen power struggles, literally and figuratively, that have revealed the brittleness at the cutting edge of technology.

With the progression of open models and fine-tuning, we can be sure that we will see a new slew of impressive and not-so-impressive implementations to understand, use, and balance. As you will learn, fine-tuning models can drastically adjust the model's structure into far-out places.

Similar to using many AI systems today, these can provide quick and worthwhile results, but pushing them into greatness can be expensive in terms of work, energy, and capital. In this series, we will use various techniques and services like Open AI and Anyscale to fine-tune training and Infinite.Tech will make and output our training sets. Through this workflow, improve your custom AI application and remove the mystery and headache from the process.

First Principal Knowledge Webs

Science Stanley, one of our greatly valued advisors and a seasoned LLM machine learning expert, uses the spider web to describe a trained model. The different bunchings of the webs can be looked at as peaks or key terms in the dataset, and when these models are prompted, you can see these bunchings being extenuated, reacting to the new conditions.

Effective prompting can have repeatable results with this, given stability in the system. To change that stability, we can fine-tune, adjusting these webs to new configurations to meet our use.

Models to Train and Data Needed

In the ever-evolving AIs, the focus on model sizes, data quality, and specific applications has brought a new understanding of how AI models are classified and utilized. The traditional classification into large, medium, and small models based on parameters, while a useful starting point, is now supplemented with a deeper understanding of efficiency, versatility, and application-specific requirements.

  • Size and Efficiency: Contrary to earlier beliefs that larger models are inherently better, there is now a recognition of the value of smaller, more efficient models. For instance, small Language Models (SLMs) offer several advantages over their larger counterparts. They are more efficient and versatile and can be deployed even on devices with limited processing capabilities. Tailored to specific tasks or domains, these SLMs can deliver improved performance with reduced training times, marking a significant shift in usability.
  • Data Quality Over Quantity: The focus in AI model training has shifted from the sheer volume of data to the quality of data. High-quality training data has become essential, as the reliance on outputs from existing models for training purposes can result in less capable AI systems. The challenge now lies in determining the right mix of training data to enhance the capabilities of these models.
  • Diverse Data Types and Applications: Modern AI models are being trained on a combination of data types, such as natural language, computer code, images, and videos, to broaden their applicability and improve their effectiveness. This multidimensional approach to training is vital for enhancing the models’ capabilities to address various real-world challenges.

Models to start with for fine tune training are the LLama family from Meta, the mistralai/Mistral-7B-Instruct-v0.1, and using the OpenAI playground allows training of Babbage-002 davinci-002 and GPT-3.5-Turbos.

Check out this data visualizer from Atlas Nomic! Looking around and seeing the type of data they have used and how the model interprets it gives you an idea of the correlations that each token is considered through.

Techniques and Philosophies of Good AI Model Training Datasets

In the intricate web of AI model training, envisioning each keyword, prompt, and response as individual threads woven into the fabric of the dataset offers a vivid metaphor. These threads can bunch up, creating dense clusters of information, or spread out, weaving a broader but less concentrated tapestry. This imagery is particularly apt in the context of AI development in 2024, where the focus and granularity of data detailing and annotation have become a cornerstone of effective model training.

Certain applications, particularly those requiring repeatable data retrieval, might find a dense clustering of similar data threads beneficial. However, when applied indiscriminately, this approach can lead to peaks of over-concentration and potential information leakage in more general applications. Striking the right balance in the dataset’s structure is thus not just a technical challenge but an art form, requiring a nuanced understanding of the dataset’s nature and purpose.

Identifying Keywords, Concepts, Focuses, and Domain Niches.

Embeddings, a key natural language processing (NLP) technique, have become essential in AI and machine learning for identifying keywords and domain niches. These tools represent words and phrases as numerical vectors, enabling the analysis of text data by calculating similarities between terms. For example, in the automotive industry, embeddings can analyze search queries related to a specific brand like ‘Porsche.’ Utilizing models in Open AI Embedding models or SentenceTransformers, embeddings create a multidimensional vector space where similar queries cluster together, revealing patterns like ‘buy a new Porsche’ or ‘Porsche maintenance issues.’ This approach is crucial for extracting relevant keywords and gaining insights into customer interests, significantly aiding in the development of targeted content and marketing strategies. Embeddings are a powerful tool for dissecting and understanding text data, vital for AI-driven keyword and niche identification.

Formatting training data, prepping for Training

u/a/u/a Pattern

System / User / Assistant

[
{"role": "system", "content": "You are a helpful assistant"},
{"role": "user", "content": "Hi"},
{"role": "assistant", "content": "Hello, How can I help you?"},
{"role": "user", "content": "How is the weather today?"},
{"role": "assistant", "content": "get_weather_status()"}
]

u/a/u/a Pattern

User / Assistant

[
{"role": "user", "content": "Hi"},
{"role": "assistant", "content": "Hello, How can I help you?"},
{"role": "user", "content": "How is the weather today?"},
{"role": "assistant", "content": "get_weather_status()"}
]

Saving the training set and putting the L in Json Lines

Something important that is not always obvious is how JSONL files should be formatted, which is a series of long single lines for each training point. Saving your JSONL incorrectly will throw errors. Using the Save to Training Set feature in Infinite.Tech, you can export your projects to the correct file format, so you don’t have to be a data scientist to get started.

{"messages":[{"role": "system", "content": "You are a friendly assistant."},{"role": "user", "content": "Hello"},{"role": "assistant", "content": "Hi there! How can I assist you today?"},{"role": "user", "content": "Can you tell me the current time?"},{"role": "assistant", "content": "Certainly! get_current_time()"}]}
{"messages": [{"role": "system", "content": "You are a serious assistant. "},{"role": "user", "content": "Hello"},{"role": "assistant", "content": "Good day. How may I be of service?"},{"role": "user", "content": "Can you tell me the current time?"},{"role": "assistant", "content": " Of course. get_current_time"}]}
{"messages": [{"role": "system", "content": "You are a professional assistant."},{"role": "user", "content": "Hello"},{"role": "assistant", "content": "Hello! How may I assist you today?"},{"role": "user", "content": "Can you tell me the current time?"},{"role": "assistant", "content": "Absolutely, let me check that for you. get\_current\_time()"}]}

Pricing Structures

Anyscale

On Anyscale, every job is a minimum of $5.

Then, it is charged per million tokens depending on the model. A good way of estimating token counts is using the open-AI tokenizer.

Open AI

Using Open-AI, jobs are priced based on tokens times Epochs. As of now the playground does not let you specify the epoch number.

Fine Tune Training The Model!

Now, it is time to create your fine-tuning job. This is accomplished, as demonstrated here through the Anyscale Fine Console or Open AI Fine Tune Playground or programmatically through their APIs.

Anyscale

    1. Go to the Anyscale Fine Tuning Console and hit the Create button.
    2. Next, upload the .jsonl dataset, Set the model to your preference, a good starter is meta-llama/Llama-2–13b-chat-hf
    3. If the data is correctly formatted, clicking Submit starts the training job.
    4. To check out the model, navigate to the model tab

Open AI

    1. Go to the Open AI Fine Tune Playground and hit the green Create button.
    2. Next, upload the .jsonl dataset; set the base model to your preference; a good starter is gpt-3.5-turbo-116
    3. If the data is correctly formatted, clicking Submit starts the training job.
    4. To check out the model, navigate to the model tab.

Using & Testing the Fine-Tuned Model In Infinite.Tech

To further use and test the models use different ways navigate to app.Infinite.Tech, and follow these steps:

    1. Select Fine-Tuned Model: Add the created Mode ID to the custom models.

2. Generate or Interact: Use the selected model to generate design variations, interact with the design, or receive AI-generated insights and suggestions.

5. Evaluate Results: Evaluate the generated designs or insights based on your project goals, design requirements, or any specific parameters you’ve set.

6. Iterate and Refine: Based on the results, iterate on your design, make refinements, and continue using the fine-tuned model as needed to enhance your design process.

With A little prompt engineering, the concept-tuned model performs as expected and formats the instruction manuals.

Get Started Today

We hope this can inspire training specialized models. With a huge potential for impact across industries and use cases, we are still in the early days of what will come and how we create new and exciting products with fine-tuning.