Enterprise AI Analysis
Reproducibility in the TradingAgents Framework: Quantifying LLM Stochasticity in Financial AI
This study evaluates the reproducibility of the TradingAgents framework, a multi-agent system for stock trading, using GPT-40 and Qwen3:30B. We quantify how LLM stochasticity, influenced by temperature, seed, and sampling parameters, translates into financial performance variability. The results show that while parameters can reduce variability, complete determinism is elusive without restrictive controls. Critically, LLM-based agents, without cherry-picking, fail to outperform simple buy-and-hold strategies for Google stock over a three-month period (May-July 2025), remaining significantly below theoretical maximum returns. This highlights the substantial gap between current agent capabilities and optimal decision-making in financial AI.
Executive Impact: Unlocking Financial AI Potential
Explore the core benefits and quantifiable impact of integrating advanced AI into your financial decision-making, backed by research-driven insights.
Deep Analysis & Enterprise Applications
Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.
Large Language Models (LLMs) inherently exhibit output variability, which, when integrated into multi-agent systems for financial decision-making, leads to variability in investment returns. This research aims to quantify this stochasticity in a practical financial AI framework to understand its impact on performance and reproducibility, challenging the assumption of deterministic outcomes typical of traditional trading systems.
The study employs the TradingAgents framework, a multi-agent system (Analyst, Researcher, Trader, Risk Manager, Fund Manager) for stock trading. It uses GPT-40 and Qwen3:30B models to trade Google stock (GOOGL) over three months (May-July 2025). Key inference parameters—temperature (0 and 1), random seed (variable vs. fixed at 42), top_k (1 and 40), and top_p (0 and 0.9)—are systematically varied. Performance is assessed via mean cumulative returns, standard deviation, and Shannon Entropy of daily trading signals, compared against Google buy-and-hold, QQQ buy-and-hold, and a perfect-foresight strategy.
LLM stochasticity significantly impacts financial outcomes. Qwen3:30B shows higher decision variability (Entropy 1.10) than GPT-40 (Entropy 0.76) at T=1. Reducing temperature and fixing seed decreases variability but doesn't eliminate it; full determinism requires restrictive sampling parameters. Crucially, stochastic LLM agents consistently underperform passive benchmarks (Google buy-and-hold 19.1%, QQQ 17.4%) and fall far short of the perfect-foresight maximum (97.3%). This suggests current LLM-based trading agents do not justify their complexity compared to simpler strategies without cherry-picking.
The findings underscore the challenges of deploying LLM-based agents in high-stakes financial environments where deterministic and reliable outcomes are paramount. The inherent variability necessitates robust risk management and advanced strategies to mitigate unpredictable performance. For enterprises, this implies that while LLMs offer sophisticated reasoning capabilities, their stochastic nature demands a cautious, data-driven approach, potentially combining them with traditional, deterministic models or more advanced ensemble methods to improve consistency and outperformance.
With default stochastic settings (Temperature T=1), the Qwen3:30B-based system achieved a mean cumulative return of 18.1% ± 2.8% over three months, indicating significant variability in outcomes across independent runs.
Under comparable stochastic conditions (Temperature T=1), the GPT-40-based system generated a mean cumulative return of 15.8% ± 4.2%. This shows similar performance trends but with slightly higher return variance for GPT-40 compared to Qwen3:30B.
| Strategy | Mean Cumulative Return (%) | Standard Deviation (%) | Reproducibility |
|---|---|---|---|
| Qwen3:30B (T=1, Random Seed) | 18.1 | 9.0 |
|
| GPT-40 (T=1, Random Seed) | 15.8 | 9.4 |
|
| Qwen3:30B (T=0, Seed=42, Restrictive Sampling) | 28.2 | 0.0 |
|
| Google Buy-and-Hold | 19.1 | 0.0 |
|
| QQQ Buy-and-Hold | 17.4 | 0.0 |
|
| Perfect-Foresight (Theoretical Max) | 97.3 | 0.0 |
|
Enterprise Process Flow
LLM Stochasticity vs. Deterministic Benchmarks
Despite sophisticated multi-agent reasoning, the study found that LLM-based agents, under default stochastic conditions, consistently failed to outperform simple passive benchmarks like Google Buy-and-Hold (19.1%) or QQQ Buy-and-Hold (17.4%). The mean returns for Qwen3:30B (18.1%) and GPT-40 (15.8%) were statistically indistinguishable from these passive strategies. Furthermore, they remained dramatically significantly below the Perfect-Foresight (theoretical maximum of 97.3%), underscoring a significant gap in performance. This highlights the inherent challenge of achieving reliable, outperforming returns with current stochastic LLM models in financial trading.
Advanced ROI Calculator
Estimate your potential savings and efficiency gains by deploying AI within your enterprise.
Your AI Implementation Roadmap
A structured approach to integrate AI seamlessly into your operations, designed for measurable success.
Phase 1: Initial Assessment & Data Integration
Evaluate existing data pipelines, identify relevant market data sources, and integrate them into the TradingAgents framework. Establish baseline performance metrics for current trading strategies.
Phase 2: Agent Configuration & Parameter Tuning
Configure LLM agents (e.g., GPT-40, Qwen3:30B) with initial parameters. Conduct preliminary experiments to understand the impact of temperature, seed, and sampling on decision variability.
Phase 3: Extended Backtesting & Reproducibility Analysis
Perform comprehensive backtesting over diverse market conditions and longer time horizons. Quantify financial performance variance and decision entropy to assess reproducibility under various LLM configurations.
Phase 4: Strategy Refinement & Risk Mitigation
Based on reproducibility analysis, refine agent decision-making logic and implement robust risk management protocols to mitigate the impact of LLM stochasticity. Explore ensemble methods or hybrid models.
Phase 5: Production Deployment & Continuous Monitoring
Deploy the refined multi-agent system in a controlled production environment. Continuously monitor performance, conduct A/B testing, and iterate on models and parameters to optimize for consistent, outperforming returns.
Ready to Transform Your Enterprise with AI?
Schedule a personalized consultation with our AI strategists to explore how these insights can be tailored to your specific business needs.