Deciphering the code: How financial institutions can evaluate and trust GenAI

Smriti Sharma

Principal Consultant, Financial Services

Vinoj Radhakrishnan

Principal Consultant, Financial Services

Summary

The financial services sector is evolving with Generative AI (GenAI), elevating customer engagement. While traditional AI models applied metrics like accuracy and precision, GenAI’s complexity requires sophisticated evaluation methods. In financial organizations, GenAI’s outputs must align with policies, client needs, and regulations. Read on to discover the different aspects of evaluating and leveraging GenAI in the financial services industry to ensure high trust and accuracy in AI-generated content.

GenAI in banking:

Transforming operations and user experiences

Integrating data engineering and GenAI

in manufacturing

GenAI for demand forecasting

in fashion retail

Summary

Financial Services has been going through a transformation in how it functions in the past few years, keeping up with the changing trends in technology so as to serve its customers better. Adoption of Generative AI, with all its benefits, is the natural next step.However, as GenAI continues to evolve, evaluating its outputs is becoming more critical.

The evolution of AI in content creation

Traditional AI models were evaluated with metrics like accuracy and precision. They operated in structured, data-driven environments. These models worked on straightforward tasks where outputs were measured against established benchmarks.

GenAI generates more complex, contextual outputs that can vary based on user input. In customer-facing applications, interactions can influence trust and relationships. Here, GenAI’s complexity introduces new evaluation challenges. For instance, it will produce different yet correct responses to similar inputs. This variance in tone or emphasis can create significant differences in the customer experience. As a result, more sophisticated assessment methods are required, which could be a combination of traditional, new and human-based metrics.

In banking, where regulations are strict and mistakes can be costly, it’s not just about whether an answer is technically correct; it’s about whether it aligns with the bank’s policies, the client’s needs, and regulatory requirements.

Challenges in evaluating GenAI-based content

While traditional AI relied on clear, quantitative metrics, GenAI introduced subjectivity that makes it harder to apply similar measures. A summary or response generated by GenAI might be accurate, yet its tone could clash with customer expectations or regulatory guidelines.

Moreover, many banks are still in the proof-of-concept (POC) phase with GenAI, exploring its potential without robust evaluation frameworks. This lack of standardized frameworks introduces risk, especially when banks seek to scale AI applications from small internal projects to large-scale customer-facing solutions. Without a way to evaluate the outputs of these AI systems, banks face the potential for costly mistakes.

The regulated nature of the financial industry adds another layer of complexity. Unlike other sectors, where errors may just lead to poor customer experience, in financial services, an AI-generated mistake could cause fines and losses. This makes the development of effective evaluation frameworks essential for GenAI’s safe and effective use.

To address these challenges, institutions can implement tailored metrics and evaluation frameworks that go beyond traditional methods.

Potential evaluation metrics for GenAI

In the context of financial services, evaluation of generated responses needs to be rooted in domain and context. Adherence to metric-based methods may lead to loss of knowledge and expertise in the response, which eventually will impact customer experience.

There needs to be a structured approach that includes metrics based on the use case, standard frameworks that evaluate the algorithms and human knowledge that will measure how rooted the response is to the domain.

Some key considerations needed for a comprehensive evaluation framework are as follows:

1. Relevance & Coherence

• On-Topic Accuracy: Does the generated content address the subject or topic it was asked to cover?

• Contextual Appropriateness: Is the content appropriate for the context in which it will be used (e.g., technical report, marketing material, casual chat)?

• Logical Flow: Does the content have a clear and logical progression of ideas?

• Consistency: Are facts, figures, and narratives consistent throughout the text?

2. Accuracy

• Factual Accuracy: Does the text contain accurate data, figures, or factual information?

• Correct Terminology: For specialized content (e.g., scientific, legal), does the text use accurate and domain-specific terminology?

3. Originality and Creativity

• Uniqueness: Is the content original, or does it appear to plagiarize or mimic existing sources?

• Creativity: For creative writing, is the content engaging, imaginative, or innovative?

4. Tone and Style

• Appropriate Tone: Does the text match the intended tone suitable for the use case and commensurate domain?

5. Bias and Ethical Considerations

• Bias-Free: Does the text avoid stereotypes, prejudices, biases and anything restricted per regulations in financial services?

• Inclusivity: Is the language inclusive, avoiding gender, cultural, or social biases? >7. Engagement and Usability

• Engagement: Is the content engaging, interesting, and informative enough providing the necessary information to the customer with minimum iterations.

A few examples of metrics used in text generation are:

• Perplexity: measures how well the model predicts a sample. Lower perplexity indicates better performance.

• ROUGE scores (Recall-Oriented Understudy for Gisting Evaluation): Measures recall by comparing overlapping units such as n-grams, word sequences, or longest common subsequences (LCS) between the machine-generated and human reference summaries.

• BLEU (Bilingual Evaluation Understudy): measures precision by comparing n-grams (sequences of n words) between the machine-generated and human reference translations.

• METEOR: measures the quality of machine-generated translations by considering not only exact word matches but also partial matches, synonyms, and reordering of words.

In addition, traditional classification metrics like Precision, Recall, F1 score and Accuracy are used to evaluate model effectiveness.

In the case of retreival systems, additional metrics can be used.

For e.g.,

• QaEval: A question-answering based metric for estimating the content quality of a summary.

• Flesch-Kincaid readability: Used to evaluate the readability or complexity of written text based on factors like sentence length and syllable count per word.

• Faithfulness: Measures the factual consistency of the generated answer against the given context.

• Answer Relevancy: Focuses on assessing how pertinent the generated answer is to the given prompt.

• Context Precision: The signal to noise ratio of retrieved context.

• Context Recall: Measures the extent to which the retrieved context aligns with the annotated answer, treated as the ground truth.

• Answer correctness: Gauges the accuracy of the generated answer when compared to the ground truth.

• Answer semantic similarity: Pertains to the assessment of the semantic resemblance between the generated answer and the ground truth.

This framework needs to be augmented with human evaluation in a domain knowledge heavy industry like financial services. Human practitioners need to regularly assess samples of generated content to ensure that Gen AI based responses remain useful. This feedback needs to be fed into the models to ensure subsequent iterations are closer to how a human would have responded.

Evaluating bias and safety in AI outputs is crucial, particularly as bias in financial services can lead to ethical and legal challenges.

The role of human oversight

In financial services, where a single incorrect response can damage customer relationships or lead to financial losses, human evaluators are needed to ensure AI-generated content is accurate and appropriate. This is particularly important in client-based areas like wealth and asset management, where relationship managers rely on their intuition and personal knowledge of client needs to make nuancedrecommendations to clients. While GenAI can assist these professionals by providing data-driven insights, it cannot replace the human understanding that drives personalized client interactions.

For example, when AI suggests an investment strategy, human review must ensure that the recommendation aligns with behavioral nuances of a client’s personal risk profile and the financial goals.

One key point to remember is that it is not “human in the loop” but “AI in the loop”. We can never replace human input; we can only augment it with AI! For e.g., AI and specifically Gen AI can infuse efficiency in the operations of KYC, but the decisioning will remain with the human agent.

Continuous human oversight ensures that AI-generated responses remain relevant and accurate. This balance between automation and human expertise is crucial to mitigating risks and maintaining trust in GenAI systems.

Tips for Effective Evaluation

· Diverse Dataset: Use a diverse set of inputs for evaluation to cover various aspects of language understanding and generation.

· Custom Metrics: Depending on your specific use case, you might need to define custom metrics that better capture the performance of your model.

· Regular Updates: Continuously update your evaluation process to include new metrics and techniques as they become available in the field.

Case study: GenAI in asset management

One example of GenAI’s successful application in financial services is with a project in the asset management sector. An AI-powered tool was developed to assist sales and distribution teams by providing real-time information about various funds. This application allows sales teams to query data, receive summaries of research papers, and offer customers informed recommendations on the fly.

The evaluation of this tool involved a combination of automated metrics and human oversight. The automated system assessed fluency, coherence, and relevance. Human analysts then reviewed the outputs to ensure they met the company’s standards for interactions with the relevant teams (given that these teams would eventually interact with end-clients). This hybrid approach allowed the tool to scale effectively while maintaining high levels of accuracy and reliability. By ensuring the tool’s outputs were monitored and refined, the bank could leverage GenAI without compromising the quality of its client interactions.

Scaling and operationalizing GenAI

Moving from POC to full-scale deployment of GenAI presents several challenges, like data management, underlying architecture & infrastructure set-up, evaluation frameworks etc.2 Many banks still operate with fragmented, on-premises data systems, making it difficult to consolidate data for AI applications. This fragmentation hinders GenAI’s full potential and introduces risks related to data accuracy and integrity. Without a unified data environment, AI models may deliver inconsistent or incomplete outputs.

In addition, banks face regulatory and compliance challenges. The use of GenAI in customer interactions raises concerns about meeting stringent regulatory requirements across different jurisdictions. For example, in the U.S., the Consumer Financial Protection Bureau has raised concerns about the reliability of automated responses, particularly in highly regulated industries like banking. In Europe, regulations such as GDPR add another layer of complexity when dealing with customer data. As a result, banks are often hesitant to deploy GenAI beyond internal applications until they can ensure compliance without sacrificing AI’s effectiveness.

The industry is making strides with LLMs as well. Cloud providers like Azure and AWS have started implementing standard metrics for evaluating AI-generated content. These frameworks offer a starting point but often require customization to meet the specific needs of each financial services domain, such as retail banking, asset management, and wealth management. Banks must work to create tailored frameworks that define what quality means for each use case.

Standardizing evaluation in GenAI

As the banking industry continues to embrace GenAI, one of the key challenges will be developing standardized evaluation frameworks. While metrics like fluency, coherence, and groundedness can be applied across different use cases, there is no one-size-fits-all solution. Banks must develop customized frameworks that account for the unique demands of their domains.

For instance, in wealth management, coherence and groundedness may take precedence to ensure that client communications are both logically consistent and factually accurate. In retail banking, fluency and risk management might be more important to ensure that customers receive clear and compliant information that adheres to regulatory standards.

Unlocking the full potential of GenAI in banking

As banks move from POC to full-scale deployment, incorporating tailored metrics will be essential for ensuring the reliability of GenAI. Human oversight will continue to play a critical role, ensuring AI-generated responses are accurate, aligned with regulatory requirements, and maintain customer trust. By addressing these challenges, banks can unlock the full potential of GenAI while mitigating risks and ensuring compliance.

1. Generative AI in Banking Market Touching USD 13,957 Mn By 2033, Market.us Scoop (May 2024)

2. New Study Reveals Data Management Is a Top Challenge in the AI Revolution, Inside AI News (August 2023)