Intro to Language Models
Lately, it seems like generative AI models like ChatGPT and Midjourney are everywhere.
Understanding what’s happening “under the hood” of generative AI models will be key to navigating the digital world.
This course will focus first on language models, then dive into image models.
What is Corpus?
Corpus (plural: corpora) is a large and structured set of texts used for studying language. It’s commonly used in fields like linguistics, natural language processing (NLP), and AI.
“collection, of data that was used to train the AI. This corpus is the material the AI reviews to become intelligent in whatever it was designed for.
Every AI’s corpus will be different, because it is humans who decide what kind of data they want to train an AI on. And the corpus the humans decide to train the AI on will depend on what they want the AI to be proficient in.” source FastCompany
Language Models
Chatbots, predictive text, and virtual assistants all use language models. Each of these models is built differently, but they all turn language into numbers and then back into language.
Modern email programs try to predict the next word in a sentence. How would you guess they do this?
They use Probability, Language models all involve storing probabilities about which words might come next, given the preceding words. They calculate these probabilities based on sequences of words in the corpus.
Next, we’ll look at how to predict one word from another using a simple language model called an “N-gram model.”