CYBER-TECH | Second opinions: Doctors VS algorithms; who decides?

As large language models move into medicine, disagreements between doctors and algorithms expose a deeper problem of accountability

Krishna Shah | 12th February, 12:56 am

A clinician reviews an AI-generated recommendation that contradicts her own assessment. She can give a chain-of-thought reasoning of her assessment, but the tool… Can it? Developments of such are coming about, but not yet entirely.

Large language models (LLMs) are moving steadily into medicine, genomics, and drug discovery, not as replacements for clinicians but as advisers whose confidence often exceeds their explainability, beyond statistical confidence.

Their capabilities are now well-advertised: summarising clinical notes, scanning biomedical literature, annotating genes, and proposing molecular candidates for drug development. Their reasoning, however, remains largely opaque, by design.

Why are the systems opaque?

Modern LLMs are built on transformer architectures that model relationships between tokens across massive datasets. This design enables parallel processing, contextual awareness, and remarkable linguistic fluency. It also produces systems whose internal decision pathways resist interpretation even by their creators. While the broad architecture is publicly known, most leading LLMs are treated as trade secrets. This opacity is often defended as commercially necessary.

In biomedical contexts, LLMs are increasingly used to annotate the functions of newly identified genes by synthesising existing literature and databases, a necessary response to the pace of genomic discovery. The pace of gene discovery has outstripped the capacity of human researchers to annotate function manually. LLMs have stepped into the gap, synthesising existing literature and databases to propose likely roles for newly identified genes. This is indispensable work. It is also fragile. A confident but incorrect annotation, once accepted, can cascade through biological databases and downstream research. At scale, automation accelerates discovery and error propagation.

Drug discovery follows a similar logic. Transformer-based models such as ChemBERTa encode chemical information directly from symbolic representations like SMILES strings, while platforms such as NVIDIA’s BioNeMo integrate multiple foundation models to predict protein structures and molecular interactions. These systems promise to compress years of exploratory work into months. They also relocate early scientific judgment into models whose internal logic is difficult to audit independently.

Opacity would be less troubling if confidence were carefully calibrated. Evidence suggests it is not. Recent studies show a significant gap between LLMs’ expressed certainty and their actual accuracy, particularly in medical reasoning tasks. A model that is right 80% of the time is still wrong often enough to matter, especially when its language conveys authority rather than hesitation. In clinical environments marked by time pressure and staff shortages, such fluency can subtly shift decision-making, even when human oversight remains formally intact.

What’s the fix?

Proposed technical fixes tend to focus on explainable AI. Methods such as SHAP and LIME have been used to highlight which features most influenced a model’s output in areas like cardiovascular risk prediction and oncology. These tools can expose patterns and biases, and they may increase user confidence. What they do not provide is causal reasoning. They explain correlations after the fact, not the internal logic by which conclusions are formed. Transparency, in this sense, is partial and interpretive rather than structural.

Public attitudes suggest this tension is already visible. Surveys indicate 57% of Americans feel a discomfort with healthcare providers relying on AI for diagnosis or treatment recommendations, alongside concerns that such technologies may weaken the patient–provider relationship and compromise data security. Trust, in healthcare, is not ancillary. It is based on the entire infrastructure.

When machines and doctors disagree, the decisive question is not which one is more accurate. It is who bears responsibility for the decision, and who explains it when confidence proves misplaced. The problem is that only one side can be asked to explain its reasoning, and only one can be held accountable when the explanation fails. That imbalance, left unaddressed, is what to watch most closely. Until that gap is closed, opacity will remain not a technical curiosity but a structural risk.