Skip to main content
Enterprise AI Analysis: Developing a Multi-Dimensional Evaluation Framework for AI-Generated Reading Questions: A Case Study of Doubao

Enterprise AI Analysis

Developing a Multi-Dimensional Evaluation Framework for AI-Generated Reading Questions: A Case Study of Doubao

This study developed and validated a multi-dimensional evaluation framework for assessing AI-generated reading comprehension questions, using Doubao as a case study. The framework, constructed through literature analysis, Delphi expert consultation, and the Analytic Hierarchy Process, comprises five dimensions: Linguistic Quality, Content Relevance, Cognitive Level, Pedagogical Appropriateness, and Item Design Quality. Evaluated by six expert raters, Doubao showed strong performance in linguistic accuracy (M=4.02) and content relevance (M=3.85), moderate in pedagogical appropriateness (M=3.56) and item design quality (M=3.42), but limitations in generating higher-order cognitive questions (M=3.18). The framework achieved satisfactory inter-rater reliability (ICC=0.84) and construct validity.

Executive Impact: Key Findings at a Glance

0 Linguistic Quality Score
0 Content Relevance Score
0 Cognitive Level Score
0 Inter-rater Reliability (ICC)

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Methodology
Performance
Implications

The study adopted a mixed-methods approach, combining quantitative and qualitative research paradigms. Framework development involved literature analysis, Delphi expert consultation for consensus on dimensions and indicators, and Analytic Hierarchy Process (AHP) for weighting. Case study validation applied the framework to Doubao-generated questions, assessing 120 items with six expert raters using a standardized scoring rubric. Data analysis included descriptive statistics, inter-rater reliability (ICC), content validity (Delphi consensus), and construct validity (exploratory factor analysis).

Evaluation Framework Development Process

Literature Analysis: Identify dimensions (Linguistic, Content, Cognitive, Pedagogical)
Delphi Expert Consultation: 12 experts, 2 rounds, 5-point Likert scale, CV < 0.25 consensus
Analytic Hierarchy Process (AHP): Establish weight coefficients for dimensions
Case Study Validation: Apply framework to Doubao (120 questions, 6 expert raters)
Data Analysis: Descriptive stats, ICC, content & construct validity

Evaluation Framework Dimensions and Key Indicators

Dimension Key Indicators
Linguistic Quality
  • Grammatical accuracy
  • Clarity of expression
  • Appropriate vocabulary
Content Relevance
  • Alignment with source text
  • Accuracy of reflection
  • Verifiability of answers
Cognitive Level
  • Distribution across Bloom's Taxonomy
  • Higher-order thinking skills representation
Pedagogical Appropriateness
  • Suitability for target learners
  • Alignment with instructional objectives
Item Design Quality
  • Stem clarity
  • Distractor plausibility (MCQs)
  • Scoring criteria specificity (CRQs)
0.84 Average Inter-Rater Reliability (ICC) for the Framework

Doubao's performance varied across dimensions. Strong in linguistic quality (M=4.02) and content relevance (M=3.85). Moderate in pedagogical appropriateness (M=3.56) and item design quality (M=3.42). Weakest in cognitive level (M=3.18), with a concentration of lower-order questions. 68.3% of questions met basic quality standards (≥3.0).

Dimension-Specific Performance of Doubao-Generated Questions

As illustrated in the original paper's Figure 2, Doubao shows its strongest performance in Linguistic Quality (4.02/5) and Content Relevance (3.85/5). Pedagogical Appropriateness (3.56/5) and Item Design Quality (3.42/5) are moderate, while Cognitive Level is the weakest at 3.18/5, indicating a challenge in generating higher-order thinking questions.

This textual summary provides the key insights from the Radar Chart as presented in the source document.

3.18 Doubao's Lowest Mean Score: Cognitive Level

Case Study: Doubao's Cognitive Level Limitations

Doubao showed a pronounced concentration of questions at lower cognitive levels (58.3% remembering/understanding), with only 15.8% addressing higher-order skills. This imbalance suggests limitations in generating questions requiring complex cognitive processing, mirroring patterns observed in traditional examination practices where assessments often overemphasize recall. This bias is consistent across LLM evaluation studies, indicating an inherent challenge in current AI question generation technology rather than platform-specific deficiencies.

The findings highlight the need for human oversight in integrating AI-generated questions into educational practices. Teachers should scrutinize questions for cognitive level appropriateness and supplement AI content with manually crafted higher-order questions. The framework also supports professional learning communities for AI literacy and informs institutional quality assurance protocols, ensuring assessment validity and targeted professional development.

Implications for Educational Practice

  • Provides teachers with a structured approach to evaluate and select AI-generated items.
  • Enables informed decisions about pedagogical standards and modification needs.
  • Highlights the necessity of human oversight for assessment quality.
  • Recommends screening processes for AI-generated questions prior to deployment.
  • Emphasizes scrutinizing questions for cognitive level appropriateness.
  • Suggests supplementing AI content with manually crafted higher-order questions for balanced coverage.
  • Supports professional learning communities for AI literacy among educators.
  • Informs institutional quality assurance protocols and targeted professional development.
Human Oversight Critical for integrating AI-generated questions in education

Advanced ROI Calculator: Unlock Your AI's Potential

Estimate the potential time and cost savings AI can bring to your assessment workflows based on the insights from this analysis.

Estimated Annual Savings
$0
Estimated Annual Hours Reclaimed
0

These figures are estimates based on streamlining question generation, reducing manual grading, and improving assessment efficiency. Actual ROI may vary based on implementation scope and existing infrastructure.

Your Strategic AI Implementation Roadmap

A phased approach to integrating AI into your educational assessment processes, maximizing efficiency and quality.

Phase 1: Pilot & Framework Integration (Weeks 1-4)

Integrate the evaluation framework into existing assessment workflows. Conduct pilot testing with AI-generated questions in a controlled environment. Train educators on using the framework for quality screening and refinement.

Phase 2: Targeted AI Deployment & Feedback Loop (Weeks 5-12)

Expand AI-generated question deployment to specific courses, focusing on identified strengths (e.g., linguistic quality, content relevance). Establish a continuous feedback mechanism for educators to report on question performance and quality. Refine prompt engineering strategies based on early outcomes.

Phase 3: Scaling & Advanced Integration (Months 4-6)

Scale AI question generation across more subject domains, with emphasis on balancing cognitive levels through human-AI collaboration. Develop AI literacy programs for all teaching staff. Monitor long-term impact on student learning outcomes and assessment efficiency. Explore integration with existing LMS platforms for seamless content delivery.

Ready to Transform Your Assessments with AI?

Leverage cutting-edge AI insights to optimize your educational content creation and evaluation strategies.

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs


AI Consultation Booking