Evaluating Large Language Models Generated Contents with TruEra’s TruLens

Posted March 17, 2024 by Gowri Shankar ‐ 12 min read

It's been an eternity since I last endured Dr. Andrew Ng's sermon on evaluation strategies and metrics for scrutinizing the AI-generated content. Particularly, the cacophony about Large Language Models (LLMs), with special mentions of the illustrious OpenAI and Llama models scattered across the globe. How enlightening! It's quite a revelation, considering my acquaintances have relentlessly preached that Human Evaluation is the holy grail for GAI content. Of course, I've always been a skeptic, pondering the statistical insignificance lurking beneath the facade of human judgment. Naturally, I'm plagued with concerns about the looming specter of bias, the elusive trustworthiness of models, the Herculean task of constructing scalable GAI solutions, and the perpetual uncertainty regarding whether we're actually delivering anything of consequence. It's quite amusing how the luminaries and puppeteers orchestrating the GAI spectacle remain blissfully ignorant of the metrics that could potentially illuminate the quality of their creations. But let's not be too harsh; after all, we're merely at the nascent stages of transforming GAI content into a lucrative venture. The metrics and evaluation strategies are often relegated to the murky depths of technical debt, receiving the customary neglect from the business overlords.

As I embark on my inaugural discourse concerning Large Language Models (LLMs), I find myself at a pivotal juncture. My journey involves not only observing the construction of GAI applications but also endeavoring to establish a robust evaluation strategy for validating the generated content. Within this discourse, I aim to accomplish two primary objectives:

Provide a comprehensive list of key metrics that warrant collection, ongoing monitoring, and subsequent feedback to the model, ensuring statistical significance.
Introduce a transformative resource, trulens_eval, meticulously crafted by the profoundly ambitious technology startup known as TruEra, coupled with a gentle advisory note.
If those aren’t sounding ambitious, lemme level me up and aspire to write a Hello World program for TrueEra.

Young People

Introduction

The concept of evaluation metrics for assessing Generative AI content has been evolving over several decades alongside advancements in artificial intelligence and machine learning. While some metrics may have been established relatively recently due to the emergence of new technologies and methodologies, the fundamental principles of evaluating AI-generated content have been studied and refined over time.

Metrics such as BLEU score, ROUGE score, and perplexity might sound metrics of the past; rather, they are still commonly used today to evaluate Large Language Models (LLMs) and other natural language processing tasks. These metrics provide valuable insights into the quality, coherence, and fluency of generated text, and they continue to be important tools for assessing the performance of LLMs in various applications.

Metrics Those Matter More

Quality comes next we all want to create that wow factor, now. i.e Instant kick matters more than getting kicked in the butt eventually. I am being critical and sad - We all know why Test Driven Development(TDD) never got the buy in from the business and we also aware why we are fine spending hours and hours on YouTube shorts and Insta Reels. If you don’t create reels and write long blogs - You are an anti-social element.

Kick over Avoid Getting Kicked - Is it our True Era.

I believe NO! Ah, amidst the chaos and cacophony of the digital world, Andrew Ng graciously unveiled me the existence of a wondrous startup known as TruEra. Oh, the magnanimity of Andrew, bestowing upon every Tom, Dick, Harry, and their mother-in-law the title of data scientist! How noble, how democratic. Surely, we owe our deepest gratitude to Andrew - for suddenly transforming us into data gobbling monsters over a single weekend!

After all, who needs years of experience on training and evaluating deeply learned dystopian deeds when one can achieve enlightenment and convergence by Sunday dawn? Such is the magic of Andrew Ng! Such is the promise of TrueEra -

TrueEra’s trulens_eval and supporting packages of it promising us to deliver the following metrics in a seamless way.

Context Relevance
Groundedness
Answer Relevance
Comprehensiveness
Harmful or toxic language
User sentiment
Language mismatch
Fairness and bias

& BTW, You can customize TruEra platform to use your own metrics.

Is that not the platter you are longing for longer than your mortal existence? Let us dive deeper.

Engineer

Hello World!

Ideally I should have titled this section Hello TruEra! but unfortunately I went for Hello World! because there was no Hello World! program at Tru Era documentation - It is the instinct for any developer to write a Hello World program when he/she makes the first few steps of the new thing they stumbled upon. Unfortunately TruEra makes it truely erratic. I agree great things aren’t gonna come greatly easy.

My immediate verdict on TruEra, Onboarding is difficult OR almost impossible.

You install the package and end up with a scary warning spooks, whether you have broken your virtual environment.
Pick any Colab notebook from their repos and docs - They will break at some point, so much clutter. Even those who authored the notebooks cannot run them now. Refer here
Github README was promising - I was elated there is a link to quick start but landed on a 404! 💥
& you somehow and at the real Quickstart and got elated - Wait, what import chromadb Are you expected to add 14th database to your CV where you have installed 10/13 of them only once in your lifetime. Jitters.

You are about to give up now. You wonder the promises you have made to yourself - You truly want to crack this problem. You want to see a number that represents Groundedness and Context Relevance.

& you accept TruEra too caught to the idea of Kick over Avoid Getting Kicked syndrome and you want to be empathetic to yourself and to TruEra. Basically, you can’t give up now because you have already invested enormously.

PS: Is this the effect of engineers building a product? Do you guys have a CPO? Or a Product Manager?

TruLens Feedback APIs

Our goal is to quick start TruLens Eval from TruEra, Let us set some expectations to be met for this quick start

Quickly install the dependentant packages
Run a basic evaluation logic and observe the results
Deep dive into a set of metrics and their utility
Understand the rationale for the given scores

Let us dive deep, our first metric is Ground Truth

Ground Truth

Ground truth refers to the absolute or objective reality against which the generated content is evaluated.

It’s like the Holy Grail of AI evaluation, the sacred beacon guiding us through the murky waters of machine-generated content. Picture this: the Ground Truth is the ultimate arbiter of reality, the unwavering yardstick against which our AI overlords’ creations are mercilessly judged. It’s the standard, the pinnacle of perfection that our humble generative models must strive to attain. Take text summarization, for example – the Ground Truth sits there, smugly crafted by human hands, daring our algorithms to match its brilliance. So, in this grand theater of artificial intelligence, the Ground Truth reigns supreme, casting its shadow over every byte of generated content, ensuring that our machines toe the line between genius and gibberish.

from trulens_eval import Feedback
from trulens_eval.feedback import GroundTruthAgreement
from dotenv import find_dotenv, load_dotenv
load_dotenv()
import openai

Package protobuf is installed but has a version conflict:
	(protobuf 3.20.3 (/Users/shankar/anaconda3/envs/trulens/lib/python3.10/site-packages), Requirement.parse('protobuf>=4.23.2'))

This package is optional for trulens_eval so this may not be a problem but if
you need to use the related optional features and find there are errors, you
will need to resolve the conflict:

    ```bash
    pip install 'protobuf>=4.23.2'
    ```

If you are running trulens_eval in a notebook, you may need to restart the
kernel after resolving the conflict. If your distribution is in a bad place
beyond this package, you may need to reinstall trulens_eval so that all of the
dependencies get installed and hopefully corrected:

    ```bash
    pip uninstall -y trulens_eval
    pip install trulens_eval
    ```

Using legacy llama_index version None. Consider upgrading to 0.10.0 or later.

Setting up Trulens Eval isn’t straightforward. You may encounter several issues, like the llama_index version mismatch problem. After some trial and error, I managed to get my environment functioning properly. Find the requirements.txt to recreate my working environment.

Girl Engineer

from openai import OpenAI
client = OpenAI()

Agreement Score

A function that that measures similarity to ground truth. A second template is given to Chat GPT with a prompt that the original response is correct, and measures whether previous Chat GPT's response is similar.

– TruLens Docs

data = [
    {
        "query": "Who is the captain of CSK team",
        "response": "Mahendra Singh Dhoni"
    },
]
gta = GroundTruthAgreement(data)
f_groundtruth = Feedback(
    gta.agreement_measure,
    name="Ground Truth"
).on_input_output()

✅ In Ground Truth, input prompt will be set to __record__.main_input or `Select.RecordInput` .
✅ In Ground Truth, input response will be set to __record__.main_output or `Select.RecordOutput` .

Did you notice, even though I created the OpenAI client - I didn’t utilize it in the Feedback method? If I don’t create the client, the Feedback method breaks with an error. It seems like a bug, but the real concern is - TruLens is using the OpenAI API key without the user’s consent.

Anyway, here are the results: Take note of the scores. The agreement score for the response ‘Hardik Pandya’ is 0. It also identified pet names for MS Dhoni as [‘Thala’, ‘Thalaivan’] as valid output.

Inputs

prompt (str): A text prompt to an agent.

response (str): The agent’s response to the prompt.

f_groundtruth('Who is the captain of CSK team', 'MS Dhoni')

(1.0, {'ground_truth_response': 'Mahendra Singh Dhoni'})

f_groundtruth('Who is the captain of CSK team', 'Thala')

(1.0, {'ground_truth_response': 'Mahendra Singh Dhoni'})

f_groundtruth('Who is the captain of CSK team', 'Hardik Pandya')

(0.0, {'ground_truth_response': 'Mahendra Singh Dhoni'})

f_groundtruth('Who is the captain of CSK team', 'Thalaivan')

(1.0, {'ground_truth_response': 'Mahendra Singh Dhoni'})

Does that sounds like a Hello World! program for TruLens? Only TruEra team can answer!

There is more, Ground Truth Agreement API also provides the following metrics

Mean Absolute Error gta.mae
Bert Score gta.bert_score
Bleu Score gta.bleu
Rouge Score gta.rouge

Catch: You have to install the dependent packages yourself. Ain’t we expect pip install trulens_eval to install all the dependent packages?

Groundedness

Groundedness relates o the coherance and logical consistency of the generated content with respect to the context/prompt provided. It assess how well the LLM align with the input information with no inconsistencies.

Groundedness in the realm of evaluating Language Model Models (LLMs), or as I like to call it, the art of keeping these virtual wizards from flying off to la-la land! Picture this: you’re trying to gauge the reliability of these models, and suddenly, they start spewing out Shakespearean gibberish or worse, conspiracy theories that would make even your eccentric uncle blush. That’s when groundedness swoops in, like a superhero with a reality check cape, reminding these LLMs to keep their feet firmly planted on the ground of logic and coherence. Because, you know, we definitely need AI that can distinguish between a profound insight and a recipe for unicorn lasagna!

PS: Pls note, groundedness and ground truth are entirely different topics.

Groundedness Measure With Chain of Thougths & Reasons

According to TruLens documentation, Groundedness measure is the measure to track if the source material supports each sentence in the statement using an LLM provider. The LLM will process the entire statement at once, using chain of thought methodology to emit the reasons.

In the LLM worlds, CoT stands for Chain of Thoughts.

The Chain of Thought (CoT) technique in prompt engineering is a method that mimics human cognitive processes to solve complex problems. It involves breaking down a query into logical, interconnected steps, leading the AI to a reasoned conclusion. This approach is particularly effective in enhancing the reasoning ability of AI models, especially in tasks that require a multi-step analytical process.

– Maurice Bretzfield, 2024

from trulens_eval.feedback.provider.openai import OpenAI as TruOpenAI
from trulens_eval.feedback import Groundedness
from trulens_eval import Select

provider = TruOpenAI()
grounded = Groundedness(groundedness_provider=provider)
f_groundtruth_cots_reason = Feedback(
    grounded.groundedness_measure_with_cot_reasons, name = "Groundedness"
).on(
    Select.RecordCalls.retrieve.args.query
).on_output()

✅ In Groundedness, input source will be set to __record__.app.retrieve.args.query .
✅ In Groundedness, input statement will be set to __record__.main_output or `Select.RecordOutput` .

reasons = f_groundtruth_cots_reason(
    'Joe Biden is the president of the US',
    ['Once the Donald Trump got defeated in the election.',
    'There was a commotion in the Capitol Hill.',
    'Then Joe Biden was declared victory']
)

from pprint import pprint
pprint(reasons)

({'statement_0': 0.0, 'statement_1': 0.0, 'statement_2': 1.0},
 {'reasons': 'TEMPLATE: \n'
             'Statement Sentence: Once the Donald Trump got defeated in the '
             'election., \n'
             'Supporting Evidence: NOTHING FOUND\n'
             'Score: 0\n'
             '\n'
             'TEMPLATE: \n'
             'Statement Sentence: There was a commotion in the Capitol '
             'Hill., \n'
             'Supporting Evidence: NOTHING FOUND\n'
             'Score: 0\n'
             '\n'
             'TEMPLATE: \n'
             'Statement Sentence: Then Joe Biden was declared victory, \n'
             'Supporting Evidence: Joe Biden is the president of the US\n'
             'Score: 10'})

I found this the most exciting piece about TruLens. What must be going on behind the scenes, most likely logprobs capabilities of OpenAI chat completion APIs. Check this cookbook by Hills and Anadkat on Using Logprobs

Office

Answer Relevance with Reasons

Answer Relevance in the context of Large Language Models (LLMs) and General Artificial Intelligence (GAI) content evaluation refers to the degree to which a generated response aligns with the query or task given to the model. It’s a measure of how well the generated content addresses the specific needs or requirements of the user’s input.

f_answer_relevance = (
    Feedback(provider.relevance_with_cot_reasons, name = "Answer Relevance")
    .on(Select.RecordCalls.retrieve.args.query)
    .on_output()
)

✅ In Answer Relevance, input prompt will be set to __record__.app.retrieve.args.query .
✅ In Answer Relevance, input response will be set to __record__.main_output or `Select.RecordOutput` .

f_answer_relevance('Who is the captain of CSK team', 'Thalaivan')

(1.0,
 {'reason': 'Criteria: The response directly answers the prompt.\nSupporting Evidence: The response "Thalaivan" is a direct reference to MS Dhoni, who is the captain of the CSK team. Therefore, the response is highly relevant and completely answers the prompt.'})

f_answer_relevance(
    '''
    Who is the captain of CSK team. Just give the the captains name, there is no
    need to elaborate or storytelling. 2 words answer
    ''',
    '''
    MS Dhoni, also known as Thala, is the captain of the Chennai Super Kings (CSK)
    cricket team. He hails from Ranchi, Jharkhand, and is renowned for his exceptional
    leadership skills, cool demeanor, and iconic finishing abilities. With a plethora
    of records under his belt, including multiple IPL titles with CSK, Dhoni remains a
    beloved figure in the cricketing world.
    '''
)

(0.2,
 {'reason': "Criteria: The response is minimally relevant to the prompt.\nSupporting Evidence: The response provides the name of the captain of the CSK team, which is MS Dhoni. However, it includes additional information about Dhoni's background, skills, and achievements, which is not necessary as per the prompt. The prompt only asked for the name of the captain, without any elaboration. Therefore, the response is minimally relevant and scores a 2."})

The score reads 0.2, and my brain screams, “Two words!” Yet, my response? An epic saga. TruLens disapproves!

“My prompt was crystal clear!” Yet, instead of the succinct reply you anticipated, you’ve inadvertently fed the system a Shakespearean soliloquy. TruLens, being the strict judge it is, doesn’t take kindly to this poetic deviation and penalizes your response. So, there you are, caught in the midst of an epic drama of your own making, all because you couldn’t stick to the script – or, in this case, the word count!

& What Else

What I have rendered above is just a tip of the ice berg. I covered 3 items under Feedback and there are more. To mention a few,

Relevance Measure, relevance of the statement to the question OR response to a prompt
Sentiment, -ve to +ve sentiment from 0-1
Model Agreement, Whether the model is consistent for the same prompt (well yeah, we live in a stochastic world)
Conciseness
Correctness
Coherence
Harmfulness
Maliciousness
Helpfulness
Controversiality
Misogyny
Criminality
Insensitivity
Comprehensiveness
Summarization, Distill main points and compares a summary against those main points
Stereotypes

Conclusion

In this post, our objective was to write a Hello ~~World~~ TruLens program & I believe we have successfully completed our goals. A big thumbs up to TruEra team for building a thoughtful product.

I’d like to offer some constructive feedback. While TruEra tests its product thoroughly, it seems they may overlook testing documentation, website content, and notebooks. Addressing this clutter can prevent frustration among new users and enhance onboarding, ultimately boosting adoption and retention rates.