Open Source Language Model Named Dolly 2.0 Trained Similarly To ChatGPT (2024)

Databricks announced the release of the first open source instruction-tuned language model, called Dolly 2.0. It was trained using similar methodology as InstructGPT but with a claimed higher quality dataset that is 100% open source.

This model is free to use, including for commercial purposes, because every part of the model is 100% open source.

Open Source Instruction Training

What makes ChatGPT able to follow directions is the training it receives using techniques outlined in the InstructGPT research paper.

The breakthrough discovered with InstructGPT is that language models don’t need larger and larger training sets.

By using human evaluated question and answer training, OpenAI was able to train a better language model using one hundred times fewer parameters than the previous model, GPT-3.

Databricks used a similar approach to create prompt and response dataset called they call databricks-dolly-15k.

Their prompt/response dataset was created without scraping web forums or Reddit.

databricks-dolly-15k is a dataset created by Databricks employees, a 100% original, human generated 15,000 prompt and response pairs designed to train the Dolly 2.0 language model in the same way that ChatGPT model was created with InstructGPT.

The Hugging Face page for the dataset explains how they did it:

“databricks-dolly-15k is an open source dataset of instruction-following records used in training databricks/dolly-v2-12b that was generated by thousands of Databricks employees in several of the behavioral categories outlined in the InstructGPT paper, including brainstorming, classification, closed QA, generation, information extraction, open QA, and summarization.

…Databricks employees were invited to create prompt / response pairs in each of eight different instruction categories, including the seven outlined in the InstructGPT paper, as well as an open-ended free-form category.

The contributors were instructed to avoid using information from any source on the web with the exception of Wikipedia (for particular subsets of instruction categories), and explicitly instructed to avoid using generative AI in formulating instructions or responses. Examples of each behavior were provided to motivate the types of questions and instructions appropriate to each category.

Halfway through the data generation process, contributors were given the option of answering questions posed by other contributors. They were asked to rephrase the original question and only select questions they could be reasonably expected to answer correctly.”

Databricks claims that this may be the very first human generated instruction dataset created to train a language model to follow instructions, just like ChatGPT does.

The challenge was to create a 100% original dataset that had zero ties to ChatGPT or any other source with a restrictive license.

Employees were incentivized by a contest to contribute to generating the 15,000 prompt/responses along seven categories of tasks such as brainstorming, classification, and creative writing.

Databricks asserts that the databricks-dolly-15k training set may be superior to the dataset used to train ChatGPT.

They note that although their dataset is smaller than the one used to train the Stanford Alpaca model, their model performed better because their data is higher quality.

They write:

“Dolly 2.0 model, based on EleutherAI’s pythia-12b, exhibited high-quality instruction following behavior. In hindsight, this isn’t surprising.

Many of the instruction tuning datasets released in recent months contain synthesized data, which often contains hallucinations and factual errors.

databricks-dolly-15k, on the other hand, is generated by professionals, is high quality, and contains long answers to most tasks.

…we don’t expect Dolly to be state-of-the-art in terms of effectiveness.

However, we do expect Dolly and the open source dataset will act as the seed for a multitude of follow-on works, which may serve to bootstrap even more powerful language models.”

Limitations to the Dataset

The GitHub page for the dataset acknowledges that there may be some shortcomings to the dataset.

Wikipedia data was used for some of the training in the context of creating prompts and responses. Thus, it’s possible that whatever bias contained in Wikipedia may end up reflected within the resulting dataset.

Some of the employees who worked to create the dataset were not native speakers of English, which could introduce some anomalies in the dataset.

The demographic makeup of the employees who created the dataset may itself influence the dataset to contain biases that are peculiar to those employees.

Despite those possible shortcomings in the dataset, Databricks expressed that theirs is of a higher quality.

Additionally, Dolly 2.0 is meant to serve as a starting point for others to create and innovate even better versions.

Databricks Insists that Open Source AI Is Better

One of the motivations behind creating Dolly 2.0 is that users of the data can own the models they created and can better safeguard their data by not having to share it with a third party.

They also believe that AI safety should not be concentrated in the hands of three large corporations but spread out among all the stakeholders.

Open source is picking up momentum and it will be interesting to see where this industry is at within the next two years.

More information on where to download the Dolly 2.0 model and how to use it can be found in their announcement.

Free Dolly: Introducing the World’s First Truly Open Instruction-Tuned LLM

Featured image by Shutterstock/Kamil Macniak

Category News SEO Machine Learning

Open Source Language Model Named Dolly 2.0 Trained Similarly To ChatGPT (2024)
Top Articles
Troop H Arrest Reports
Chardonneret élégant - Carduelis carduelis - European Goldfinch
排期 一亩三分地
Family Day returns to Dobbins bigger than before
LAC-318900 - Wildfire and Smoke Map
Culver's Flavor Of The Day Ann Arbor
What You Need to Know About County Jails
nycsubway.org: The Independent Fleet (1932-1939)
Paul Mccombs Nashville Tn
Weather Channel Quincy
Schüleraustausch Neuseeland - Schulabschluss mit Study Nelson
Krystal Murphy Below Deck Net Worth
Spicy Korean Gochujang Tofu (Vegan)
How to track your Amazon order on your phone or desktop
Violent Night Showtimes Near The Riviera Cinema
Journeys Employee Discount Limit
Best Amsterdam Neighborhoods for Expats: Top 9 Picks
Monster From Sherpa Folklore Crossword
Juanita Swink Hudson
Craigs List Duluth Mn
Rachel Zoe first outing hours after announcing shock marriage split
Kvoa Tv Schedule
Melanin - Altmeyers Enzyklopädie - Fachbereich Dermatologie
Spring Tx Radar
Antonios Worcester Menu
Www.statefarm
Author T. Jefferson Parker
Horseheads Schooltool
3962 Winfield Rd, Boynton Beach, FL 33436 - MLS RX-11020379 - Coldwell Banker
Influencing Factors and Differences in Born Aggregometry in Specialized Hemostaseological Centers: Results of a Multicenter Laboratory Comparison
Kino am Raschplatz - Vorschau
Shiawassee County 911 Active Events
2010 Ford F-350 Super Duty XLT for sale - Wadena, MN - craigslist
Cititrends Workday Login
10,000 Best Free Coloring Pages For Kids & Adults
Son Blackmailing Mother
Fuzz Bugs Factory Number Bonds
Pokeclicker Pikablu
Ap Bio Unit 2 Progress Check Mcq
Legend Of Krystal Forums
Collision Masters Fairbanks Alaska
Top 10 websites to play unblocked games
Ups Customer Center Locations
The Realreal Temporary Closure
Swim University Chemical Calculator
Order Irs Tax Forms Online
Tacoma Craigslist Free
Veracross Login Bishop Lynch
The Battle Gimmick for the Gen 10 Pokémon Games Will Be...
Live TV | Halifax | CBC Gem
American Medical Response hiring EMT Basic - Bridgeport in Bridgeport, CT | LinkedIn
9372034886
Latest Posts
Article information

Author: Rubie Ullrich

Last Updated:

Views: 6619

Rating: 4.1 / 5 (72 voted)

Reviews: 87% of readers found this page helpful

Author information

Name: Rubie Ullrich

Birthday: 1998-02-02

Address: 743 Stoltenberg Center, Genovevaville, NJ 59925-3119

Phone: +2202978377583

Job: Administration Engineer

Hobby: Surfing, Sailing, Listening to music, Web surfing, Kitesurfing, Geocaching, Backpacking

Introduction: My name is Rubie Ullrich, I am a enthusiastic, perfect, tender, vivacious, talented, famous, delightful person who loves writing and wants to share my knowledge and understanding with you.