Skip to main content
Enterprise AI Analysis: Reproducibility in the TradingAgents Framework: Quantifying LLM Stochasticity in Financial AI

Enterprise AI Analysis

Reproducibility in the TradingAgents Framework: Quantifying LLM Stochasticity in Financial AI

This study evaluates the reproducibility of the TradingAgents framework, a multi-agent system for stock trading, using GPT-40 and Qwen3:30B. We quantify how LLM stochasticity, influenced by temperature, seed, and sampling parameters, translates into financial performance variability. The results show that while parameters can reduce variability, complete determinism is elusive without restrictive controls. Critically, LLM-based agents, without cherry-picking, fail to outperform simple buy-and-hold strategies for Google stock over a three-month period (May-July 2025), remaining significantly below theoretical maximum returns. This highlights the substantial gap between current agent capabilities and optimal decision-making in financial AI.

Executive Impact: Unlocking Financial AI Potential

Explore the core benefits and quantifiable impact of integrating advanced AI into your financial decision-making, backed by research-driven insights.

0 Mean Qwen3:30B Return (T=1)
0 Mean GPT-40 Return (T=1)
0 Google Buy-and-Hold
0 Perfect Foresight Max

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Large Language Models (LLMs) inherently exhibit output variability, which, when integrated into multi-agent systems for financial decision-making, leads to variability in investment returns. This research aims to quantify this stochasticity in a practical financial AI framework to understand its impact on performance and reproducibility, challenging the assumption of deterministic outcomes typical of traditional trading systems.

The study employs the TradingAgents framework, a multi-agent system (Analyst, Researcher, Trader, Risk Manager, Fund Manager) for stock trading. It uses GPT-40 and Qwen3:30B models to trade Google stock (GOOGL) over three months (May-July 2025). Key inference parameters—temperature (0 and 1), random seed (variable vs. fixed at 42), top_k (1 and 40), and top_p (0 and 0.9)—are systematically varied. Performance is assessed via mean cumulative returns, standard deviation, and Shannon Entropy of daily trading signals, compared against Google buy-and-hold, QQQ buy-and-hold, and a perfect-foresight strategy.

LLM stochasticity significantly impacts financial outcomes. Qwen3:30B shows higher decision variability (Entropy 1.10) than GPT-40 (Entropy 0.76) at T=1. Reducing temperature and fixing seed decreases variability but doesn't eliminate it; full determinism requires restrictive sampling parameters. Crucially, stochastic LLM agents consistently underperform passive benchmarks (Google buy-and-hold 19.1%, QQQ 17.4%) and fall far short of the perfect-foresight maximum (97.3%). This suggests current LLM-based trading agents do not justify their complexity compared to simpler strategies without cherry-picking.

The findings underscore the challenges of deploying LLM-based agents in high-stakes financial environments where deterministic and reliable outcomes are paramount. The inherent variability necessitates robust risk management and advanced strategies to mitigate unpredictable performance. For enterprises, this implies that while LLMs offer sophisticated reasoning capabilities, their stochastic nature demands a cautious, data-driven approach, potentially combining them with traditional, deterministic models or more advanced ensemble methods to improve consistency and outperformance.

18.1% Qwen3:30B Mean Return (T=1)

With default stochastic settings (Temperature T=1), the Qwen3:30B-based system achieved a mean cumulative return of 18.1% ± 2.8% over three months, indicating significant variability in outcomes across independent runs.

15.8% GPT-40 Mean Return (T=1)

Under comparable stochastic conditions (Temperature T=1), the GPT-40-based system generated a mean cumulative return of 15.8% ± 4.2%. This shows similar performance trends but with slightly higher return variance for GPT-40 compared to Qwen3:30B.

Strategy Mean Cumulative Return (%) Standard Deviation (%) Reproducibility
Qwen3:30B (T=1, Random Seed) 18.1 9.0
  • ✓ High Variability (Entropy 1.10)
GPT-40 (T=1, Random Seed) 15.8 9.4
  • ✓ High Variability (Entropy 0.76)
Qwen3:30B (T=0, Seed=42, Restrictive Sampling) 28.2 0.0
  • ✓ Perfect (Entropy 0.0) - cherry-picking
Google Buy-and-Hold 19.1 0.0
  • ✓ Deterministic
QQQ Buy-and-Hold 17.4 0.0
  • ✓ Deterministic
Perfect-Foresight (Theoretical Max) 97.3 0.0
  • ✓ Deterministic

Enterprise Process Flow

Market Signals & News Analysis (Analyst Team)
Assessments Debate (Researcher Team)
Trading Proposal (Trader Agent)
Risk Management Review (Risk Management Team)
Final Trading Order (Fund Manager Agent)

LLM Stochasticity vs. Deterministic Benchmarks

Despite sophisticated multi-agent reasoning, the study found that LLM-based agents, under default stochastic conditions, consistently failed to outperform simple passive benchmarks like Google Buy-and-Hold (19.1%) or QQQ Buy-and-Hold (17.4%). The mean returns for Qwen3:30B (18.1%) and GPT-40 (15.8%) were statistically indistinguishable from these passive strategies. Furthermore, they remained dramatically significantly below the Perfect-Foresight (theoretical maximum of 97.3%), underscoring a significant gap in performance. This highlights the inherent challenge of achieving reliable, outperforming returns with current stochastic LLM models in financial trading.

Advanced ROI Calculator

Estimate your potential savings and efficiency gains by deploying AI within your enterprise.

Estimated Annual Savings $0
Annual Hours Reclaimed 0

Your AI Implementation Roadmap

A structured approach to integrate AI seamlessly into your operations, designed for measurable success.

Phase 1: Initial Assessment & Data Integration

Evaluate existing data pipelines, identify relevant market data sources, and integrate them into the TradingAgents framework. Establish baseline performance metrics for current trading strategies.

Phase 2: Agent Configuration & Parameter Tuning

Configure LLM agents (e.g., GPT-40, Qwen3:30B) with initial parameters. Conduct preliminary experiments to understand the impact of temperature, seed, and sampling on decision variability.

Phase 3: Extended Backtesting & Reproducibility Analysis

Perform comprehensive backtesting over diverse market conditions and longer time horizons. Quantify financial performance variance and decision entropy to assess reproducibility under various LLM configurations.

Phase 4: Strategy Refinement & Risk Mitigation

Based on reproducibility analysis, refine agent decision-making logic and implement robust risk management protocols to mitigate the impact of LLM stochasticity. Explore ensemble methods or hybrid models.

Phase 5: Production Deployment & Continuous Monitoring

Deploy the refined multi-agent system in a controlled production environment. Continuously monitor performance, conduct A/B testing, and iterate on models and parameters to optimize for consistent, outperforming returns.

Ready to Transform Your Enterprise with AI?

Schedule a personalized consultation with our AI strategists to explore how these insights can be tailored to your specific business needs.

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs


AI Consultation Booking