|
| |
Alina Gorbunova | |
| Georgia Institute of Technology |
Generative AI is becoming more and more commonplace across every industries, yet the same model can deliver brilliant insights one moment and blatant errors the next. This article explores the sources of Gen AI unreliability and how concepts from reliability engineering can help make these systems more stable, transparent, and dependable.
The past several years have seen the emergence of generative artificial intelligence (Gen AI) models such as ChatGPT, Gemini, and Copilot. These models are now widely used in almost every field from healthcare to higher education to manufacturing to customer service. While they are powerful tools that have changed the way many of us work and learn, they are not without their flaws. Being data driven models, Gen AI can produce biased results, as well as produce vastly different results with only slight changes to their input prompts. These models also add a new risk - hallucinations where the generated output has little basis in reality and is often incorrect. To mitigate these risks and improve the performance of these models, can traditional reliability engineering concepts be applied to Gen AI models?
What makes Gen AI unreliable?
Although artificial intelligence may seem new to the public, its roots go back decades. One of its core technologies, the artificial neural network, was introduced by Frank Rosenblatt in the 1950s. Despite early excitement, its inability to learn complex patterns led to harsh criticism, most notably from Minsky and Papert in 1969. Such criticisms led to the first "AI winter" (1974–1980), a period marked by declining AI research. Neural networks were largely abandoned until the mid-1980s, when the backpropagation algorithm enabled the training of multi-layer perceptrons, reigniting interest. This laid the groundwork for today’s revolution, which accelerated in the 2010s thanks to advances in data, computing power, and algorithm design.
Myth 2: AI is Just ChatGPT
Gen AI is by definition a probabilistic model, it does not consider logic and rationale in its outputs but rather generates sequences of words, images, or sounds based on what is statistically most likely to be correct based on its training. As Weise and Metz (2023) note, these models function like advanced autocomplete tools—they predict the next likely token, not the most truthful one. How smart or silly their responses are is a reflection of learning patterns in data, not genuine understanding. From this arises three sources of unreliability in Gen AI: data bias, hallucinations, and environmental uncertainty.
Let’s break down what each of these sources of unreliability are and where they come from. Data bias comes from the datasets that are used to train the model. It is common to train Gen AI models using data drawn from internet sources, which often contain both accurate and inaccurate content, which will result in this bias. Another example is in Stable Diffusion models: Nicoletti and Bass (2023) analyzed over 5,000 images from Stable Diffusion and found that it amplified both gender and racial stereotypes, likely arising from data provided to the model during training without being properly screened.
Even if a model is well-trained, its real-world performance depends on how users prompt it. Small changes in input phrasing can produce vastly different outputs because the model’s probabilistic attention mechanisms weigh tokens differently. Even asking ChatGPT the same question will give different results each time. When applied to new or domain-specific contexts, such as specialized medical or educational settings, the model extrapolates beyond its training distribution, just like an engineering system operating beyond its design tolerance.
Model hallucinations occur when a Gen AI model produces a result that is not true. A New York lawyer relying on ChatGPT to draft legal documents found this out the hard way when the model made up multiple case citations and presented them as real Weiser (2023). Because these types of models do not have an internal mechanism to tell them what is fact and what is fiction, they confidently provide incorrect results. This brings up the need for safeguards to both help users identify hallucinations when they occur and minimize them occurring.
Applying reliability engineering concepts to generative AI reframes hallucinations and bias as manageable system vulnerabilities. With the right safeguards, monitoring, and design principles, we can build Gen AI systems that perform more consistently and reliably in the real world.
How to make Gen AI more reliable?
Reliability engineering studies how complex systems perform under uncertainties and offers an interesting perspective on how to pinpoint and address system vulnerabilities. Can reliability engineering concepts be extended from mechanical systems that fail due to material fatigue and environmental stress to Gen AI models that “fail” due to data and architecture limitations? Let’s look at some ways that reliability engineering principles can be extended to Gen AI models.
- Fault Detection and Diagnostics: Within physical systems, sensors are used to monitor for abnormal conditions. In AI systems, these sensors are replaced with fact-checking and verification mechanisms. After results are generated, diagnostic tools can compare them to verified data sources to ensure they are accurate. This is already being implemented with Retrieval-Augmented Generation (RAG) models. RAGs combine neural generation with real-time data retrieval to generate responses that are based in fact Li et al. (2024). This acts as a “sensor” to detect the abnormal content being outputted to the user
- AI Ensembles as System Redundancies: No system is ever completely perfect and reliability engineers design safeguards through redundancy - when one component or system fails, there are others working that will make sure the entire system does not come to a halt. In Gen AI models, redundancy can be model ensembles, cross-model verifications, or hybrid systems that combine logical reasoning with generation. Having the Gen AI model check its work against other models or data sets adds an extra layer of protection to avoid giving an inaccurate or biased response
- Fault Tolerance and Degradation: Just like no system is ever completely perfect, its final product will be within some tolerances. Robust system design helps to address this, and can also be used in Gen AI models. For example, temperature tuning or controlling the randomness of an output can be used to balance when the model should produce more consistent and factual results with a lower temperature or more creative and variable responses with a higher temperature Noble (2025).
- Retraining for Gen AI Maintenance: Every system will degrade over time as tools and components wear out from various stresses. To address this, reliability engineers schedule regular maintenance to prevent complete degradation and potential system failure. Similarly, AI systems should require continual retraining and monitoring to prevent concept drift, which occurs when there is a mismatch between training data and current reality. Establishing schedules for retraining and system operating validation ensures the Gen AI models continue to produce consistent and accurate results.
Designing robust, transparent, and adaptable systems are at the heart of reliability engineering. These qualities create systems that are able to handle uncertainty, detect errors, and perform maintenance over time - clear strengths that can be used to make Gen AI models more reliable.
Gen AI models are inherently probabilist and data-driven, making their outputs statistical patterns not logically derived answers. Bias, hallucinations, and inconsistencies all arise from this. Applying reliability engineering principles, such as fault detection, tolerance, and preventive maintenance, to Gen AI creates models that mitigate these risks and push the performance and boundaries of the field.
References:
Li, J., Yuan, Y., Zhang, Z., 2024. Enhancing llm factual accuracy with rag to counter hallucinations: A case study on domain-specific queries in private knowledge-bases. URL: https://arxiv.org/abs/2403.10446, arXiv:2403.10446.
Nicoletti, L., Bass, D., 2023. Humans are biased. generative ai is even worse. URL: https://www.bloomberg.com/graphics/2023-
Noble, J., 2025. What is llm temperature? URL: https://www.ibm.com/think/topics/llm-temperature.
Weise, K., Metz, C., 2023. When a.i. chatbots hallucinate. URL: https://www.nytimes.com/2023/05/01/business/ai-chatbots-hallucination.html.
Weiser, B., 2023. Here’s what happens when your lawyer uses chatgpt. URL: https://www.nytimes.com/2023/05/27/nyregion/ avianca-airline-lawsuit-chatgpt.html.
Acknowledgements: We would like to thank Ronak Tiwari for taking time to review this article. Photo credit goes to Logan Voss for the header photo and Igor Omilaev for the footer photo.