Enterprise AI Analysis: Software Engineering Education
Can Code Evaluation Metrics Detect Code Plagiarism?
A comparative empirical study on source code plagiarism detection
Fahad Ebrahim, Mike Joy • University of Warwick • April 28, 2026
Executive Impact Summary
Our research highlights critical performance benchmarks for plagiarism detection using Code Evaluation Metrics (CEMs) in software engineering. Key metrics demonstrate the potential of CEMs, especially with preprocessing, to rival or exceed traditional tools.
Deep Analysis & Enterprise Applications
Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.
Our Approach to Evaluating Code Evaluation Metrics
We performed a comparative empirical study using two open-source labelled datasets, ConPlag and IRPlag, to evaluate five CEMs: CodeBLEU, CrystalBLEU, RUBY, TSED, and CodeBERTScore. Performance was assessed using threshold-free ranking-based measures (AUROC, AUPRC) across overall, per-dataset, and per-level plagiarism. Results were compared against SOTA SCPDTs (JPlag, Dolos), and the impact of preprocessing was also examined.
Core Discoveries in Plagiarism Detection
Our analysis reveals that CEMs can achieve comparable ranking performance to dedicated SCPDTs, especially with preprocessing. CrystalBLEU and FusionTop3 demonstrated strong performance, often surpassing Dolos and JPlag on pooled preprocessed datasets. However, all methods struggled with complex plagiarism levels (L4-L6), indicating a need for metrics that capture deeper semantic changes.
Strategic Implications for Software Engineering Education
CEMs are valuable for screening and filtering potentially plagiarised pairs in educational settings, but final judgments still require instructor review due to false positives and limitations in complex cases. A recommended workflow involves preprocessing, using strong ranking methods like CrystalBLEU to flag suspicious pairs, and then confirming with dedicated plagiarism tools and manual inspection. Future efforts should focus on combining complementary metrics and addressing higher-level semantic plagiarism.
Enterprise Process Flow
| Metric/Tool | AUROC (Pooled) | AP (Pooled) |
|---|---|---|
| FusionTop3 | 0.882 | 0.862 |
| CrystalBLEU | 0.879 | 0.865 |
| Dolos | 0.864 | 0.842 |
| CodeBLEU | 0.843 | 0.822 |
| RUBY | 0.839 | 0.826 |
| JPlag | 0.777 | 0.762 |
All evaluated methods, including both CEMs and dedicated SCPDTs, experienced a significant drop in detection performance from plagiarism level L4 onwards. This highlights the inherent difficulty in identifying structural and semantic modifications in source code.
CodeBERTScore consistently generated very high similarity scores (above 0.99) for both plagiarised and non-plagiarised pairs, making differentiation extremely difficult. This indicates a strong surface bias and limited ability to capture semantic nuances.
Recommended Workflow for Academic Integrity
For educational institutions, a practical workflow for using CEMs involves: 1. Preprocessing code to normalize it. 2. Employing strong ranking methods like CrystalBLEU to flag suspicious pairs. 3. Final confirmation using dedicated plagiarism tools and manual instructor review for complex cases. CEMs serve as an effective initial screening layer.
Impact: Enhance fairness and academic integrity by effectively identifying potential plagiarism, especially at lower modification levels, while acknowledging the need for human oversight for semantic changes.
Calculate Your Potential AI Impact
Estimate the efficiency gains and cost savings AI can bring to your specific enterprise operations. Adjust the parameters to see a customized projection.
Your AI Implementation Roadmap
Our phased approach ensures a smooth and effective integration of AI, maximizing your return on investment with minimal disruption.
Phase 1: Discovery & Strategy
In-depth analysis of your current workflows, identification of AI opportunities, and development of a tailored implementation strategy.
Phase 2: Pilot Program Development
Building and testing a small-scale AI pilot to validate functionality, gather feedback, and demonstrate initial value.
Phase 3: Full-Scale Integration
Seamless deployment of AI solutions across your enterprise, including data migration, system integration, and user training.
Phase 4: Optimization & Scaling
Continuous monitoring, performance optimization, and strategic scaling of AI capabilities to new areas of your business.
Ready to Transform Your Enterprise with AI?
Book a complimentary 30-minute consultation with our AI experts to discuss your specific challenges and how our tailored solutions can drive your business forward.