Embracing GenAI for Business Success- Part VII

In the dynamic realm of business technology, integrating GenAI into enterprise data systems heralds significant transformative opportunities for organizations. As we explored earlier in this blog series, these cutting-edge AI-driven tools and solutions are reshaping how businesses engage with their data, opening unprecedented avenues of efficiency and accessibility.

Whether your organization comprises citizen developers or hands-on-keyboard, tech-savvy individuals, this blog series will aid you in your journey by introducing you to the foundational constructs required to develop today’s class of GenAI tools and solutions. The series covers terms and concepts pertinent to GenAI, deep learning (DL), and natural language processing (NLP) in simple, easy-to-understand language. This blog post will trace the progression of the seminal research that led us to the GenAI revolution, while highlighting NLP models of yesteryear that are still used in generating GenAI applications.

This blog is not intended to be an in-depth technical study of NLP techniques. Familiarity with the terms and models introduced here will give you the vocabulary to communicate with your senior technical staff. We focus today’s discussion on the first language model, which debuted in 1950 and introduced a foundational approach to representing text data for machines that is still used today.

First, let us introduce the foundational terms used in the context of AI/ML. The terms defined here will help you have productive conversations with the AI practitioners and identify the engineering skill sets required for AI projects in your organization.

Algorithms An algorithm is a set of predefined instructions and computations followed in order to accomplish and automate repeatable tasks, solve problem(s), and achieve goal (s). Algorithms are typically written by software engineers.

AI/ML algorithms: ML algorithms are learned from vast amounts of training data. Here, the software engineer’s job shifts from writing the algorithm to writing the training logic that creates it. Software engineers, data scientists, ML engineers, AI engineers, and analytics engineers with programming skills (e.g., Python) are typically tasked with writing AI/ML algorithms.

A model is created through the assimilation of real-world relationships (represented by data sets whose relationships and logic are defined by an algorithm) in a process called modeling or model development).⁵ Data scientists or software, ML, AI or advanced analytics engineerswith programming skills (e.g., Python) are typically tasked with focused on AI/ML model building.

A language model is a computer algorithm trained to receive written text (in English or other languages) and produce output as written text (in the same language or a different one).⁶

“Large” in the context of the Large Language Model (LLM) refers to the size of these models in terms of their training data and the parameters used during the learning process. OpenAI’s GPT-4, for example, boasts an estimated 1.7 trillion parameters, equivalent to an Excel spreadsheet stretching across thirty thousand soccer fields, and was trained using 450 terabytes of text data.

Parameters, in the context of DL and LLM, are numbers that control the output and relative weight of the constructs of the DL layers, which are neurons, and the relative weight of their connections with their neighboring neurons.

Now that we have a few definitions squared away, let us probe into the history of the first language model and the discoveries that led to the LLMs of today.

Bag-of-Words (BOW): One of the earliest works in language modeling, the BOW model, is a simple feature-extraction technique for unstructured data, specifically natural language text data. Introduced in the 1950s, the BOW model became popular in the 2000s. The BOW model launched a journey of constant invention and innovation that led to breakthroughs in natural language processing (NLP) and ultimately culminated in the release of OpenAI’s ChatGPT, Anthropic’s Claude, and Google’s Bard in 2020. Let us examine what unstructured data is.

Unstructured data: Unstructured data consists of items (objects, files) generated from various data sources, such as media and entertainment data, surveillance data, geospatial data, weather data, sensor data, stock ticker data, emails, productivity applications, document collections, and invoices, and can range from a few bytes to terabytes. Unstructured data is not organized in a predefined, searchable format like structured data, which follows a conventional data model and fits neatly into a database.

BOW models use text data to conduct information retrieval and process documents based on the technique of counting word occurrences. BOW models primarily tally word occurrences in manuscripts. Bag-of-words, which is one of several steps in many text mining pipelines, is solely responsible for counting the frequency of word occurrences in a text document and ignores word order and context.

Unstructured textual data by itself is often quite hard to process. This data does not provide values that allow us to directly process and visualize it and create actionable results. We first have to convert this textual data into data that we can easily process: numeric representations. This process, through which the input is converted to outputs in the form of usable vectors (embeddings), is often referred to as embedding.

Fig. 1: Converting textual input into embeddings

Tokenization involves the separation of input text, whitespace, and punctuation into individual tokens. It is a crucial step in preparing data for natural language processing (NLP) tasks. Some models adopt subword tokenization, dividing words into smaller segments that retain meaningful linguistic elements called tokens. Tokenization is the first step inof the bag-of-words model and the initial phase of interacting with LLMs.

Tokens: In NLP and LLMs, the fundamental linguistic unit is a sequence of characters, typically forming a word, punctuation mark, or number token. Tokens can range from single characters to subwords, or from a set of characters to entire words. A helpful way to understand the size of text data is by looking at the number of tokens it comprises. For instance, a text of 100 tokens roughly equates to about 75 words. This comparison can be essential for managing the processing limits of LLMs, as different models may have varying token capacities.

The most common tokenization method involves splitting up a sentence using whitespace to create individual words. After tokenization, all unique words from each sentence are combined to create a vocabulary that we can use to represent the sentences (Fig. 1).

Language modeling is a subfield of NLP that involves the creation of statistical/DL models to predict the likelihood of a sequence of tokens in a specified vocabulary (a limited and known set of tokens). ¹

Let us take two sentences and walk through the individual steps in the BOW model.

Biscuit is a cute dog.
Biscuit loves to go for walks.

Fig. 2: Tokenization example

First, we tokenize the sentences and create a vocabulary of unique words. Then, using this vocabulary, we count how often a word appears in each sentence, making a bag of words. With its straightforward approach, the BOW model aims to represent text in the form of numbers, also known as vectors or vector representations.

Vectors: A vector is a mathematical representation of your data.² Vectors are lists of numbers representing text (or images), where each number corresponds to a specific dimension or feature in the vector space.

Vector embeddings: In GenAI, vector embedding is a numerical representation that encapsulates semantic content while discarding irrelevant details so that machines can process and understand it.

The BOW model, also known as a representation model, performs remarkably well in many practical applications. It is the foundation for more sophisticated methods like sentiment analysis, text classification, and large language models like OpenAI’s ChatGPT.³ However, the vectors in the BOW do not capture the order of the words or their semantic context, which is critical for language understanding in GenAI applications. Let us explore this with an example.

“I like to eat a Biscuit with my cuppa.”

The BOW cannot distinguish between Biscuit, the dog, and the food group, the tea biscuits I like to have with my afternoon cup of tea, because the BOW does not consider word order or context.

In 1972, a decade after the BOW model was introduced, a new strategy emerged: Term Frequency-Inverse Document Frequency (TF-IDF). This model, an evolution of the BOW, improved its strategy by adjusting word counts based on rarity or frequency. TF-IDF, the term frequency-inverse document frequency, is a measure that can quantify the importance or relevance of word representations in a document within a collection of documents, also known as a corpus.⁴

TF-IDF can be divided into TF (term frequency) and IDF (inverse document frequency). Term frequency works by looking at the frequency of a particular term you are concerned with relative to the document. Inverse document frequency looks at how common (or uncommon) a word is amongst the corpus.⁵ The reason we need IDF is to help correct for words that frequently appear in an English corpus, such as “a,” ‘as,” “of,” and “the.” Thus, by taking inverse document frequency, we can minimize the weighting of frequent terms while giving more weight to infrequent terms, improving the model’s ability to detect document relevancy.

TF-IDF is still relevant and in use today. Similar to BOW, the TF-IDF model does not consider word context.

Deep Learning (DL), a subdomain of machine learning (ML), was introduced for language tasks in 2010. DL-based models were are adept at learning patterns in sequences, allowing them to effectively handle documents of varied lengths. They maintained an internal state that retainsed information from previous words, facilitating sequential understanding. They a were capable of computing document embeddings and adding word context understanding.

Word2vec, which surfaced in 2013, used word embeddings to capture subtle semantic links between words that previous models could not. Word2vec marked a significant breakthrough in natural language processing (NLP), one of the foundations on which GenAI was built.

Deep Learning (DL), a subdomain of machine learning (ML), was introduced for language tasks in 2010. DL-based models are adept at learning patterns in sequences, allowing them to effectively handle documents of varied lengths. They maintain an internal state that retains information from previous words, facilitating sequential understanding. They are capable of computing document embeddings and adding word context understanding.

Word embeddings are high-dimensional vectors encapsulating semantic associations, as seen in the word2vec model. Word2vec was one of the first successful attempts at capturing the meaning of text in embeddings. During the model training process, word2vec, using the foundational blocks of neural nets, a DL technology, learns the relationship between words and distills that information into word embeddings. If two words in a document tend to have the same neighbors, their embeddings will be closer to one another, and vice versa.⁶ Word embeddings represent a substantial advancement in capturing textual semantics where words with similar meanings have a numeric vectorial representation.⁷

Context understanding is considered the holy grail of natural language understanding, which the BOW, TF-IDF, and word2vec models could not accomplish. Despite falling short, these discoveries in NLP and DL were the impetus for the discovery of transformer architecture in 2017, which we discussed in the fifth post in this series. Transformer architecture is the foundational building block for GenAI.

Embracing GenAI in business means being open to radical change, questioning existing business processes without fear of disrupting the status quo, and being dauntless in throwing out the rulebook and starting anew to achieve better business outcomes. Trailblazers, innovators, and those who are curious and on the lookout for technological developments that lie around the corner will reap the greatest benefit from GenAI. AI will not replace the role of humans in critical functions, but those incapable of embracing AI technologies will find themselves at a disadvantage, unable to partner and collaborate with AI practitioners within their organizations and beyond.

References:

https://learning.oreilly.com/library/view/quick-start-guide/9780135346570/ch01.xhtml#ch01lev1sec1
Top 10 Best Vector Databases for AI. https://www.purelogics.net/what-is-a-vector-database-top-10-best-vector-databases-for-ai/
Anthony, G. (2024). Developing a Framework to Identify Professional Skills Required for Banking Sector Employee in UK using Natural Language Processing (NLP) Techniques. https://core.ac.uk/download/603398323.pdf
Yin, C., Zhang*, L., Tu, M., Wen, X., & Li, Y. (2019). TF-IDF Based Contextual Post-Filtering Recommendation Algorithm in Complex Interactive Situations of Online to Offline: An Empirical Study. https://doi.org/10.17559/TV-20190515161539
Understanding TF-IDF for Machine Learning | Capital One. https://www.capitalone.com/tech/machine-learning/understanding-tf-idf/
https://www.google.com/url?q=https://learning.oreilly.com/library/view/hands-on-large-language/9781098150952/ch02.html%23tokens_and_embeddings&sa=D&source=docs&ust=1735754097074440&usg=AOvVaw0Ly3ouV8ta37CGs4VjrP13
https://learning.oreilly.com/library/view/building-llms-for/9798324731472/index_split_009.html
https://hbr.org/2017/01/deep-learning-will-radically-change-the-ways-we-interact-with-technology

Author

Hema Seshadri, Ph.D.