LEGALMIDM: USE-CASE-DRIVEN LEGAL DOMAIN SPECIALIZATION FOR KOREAN LARGE LANGUAGE MODEL
Revolutionizing Legal AI with LEGALMIDM
Discover how LEGALMIDM, a specialized Korean legal-domain LLM, sets new benchmarks in precision and utility for AI-assisted legal workflows, grounded in real-world use cases.
Quantifiable Impact of Domain Specialization
LEGALMIDM's innovative use-case-driven approach delivers superior performance across critical legal tasks and maintains robust general domain capabilities.
Deep Analysis & Enterprise Applications
Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.
Domain Adaptation: Strategies for specializing general LLMs for the legal domain.
Strategies for specializing general LLMs for the legal domain.
| Model | AVG (R-L) |
|---|---|
| LEGALMIDM-11B | 41.06 |
| Qwen2.5-32B | 30.64 |
| Llama3.3-70B | 25.61 |
| Gemma-2-27b | 27.75 |
The Challenge of Practical Utility in Legal AI
Existing domain-specialized LLMs often lack alignment with nuanced real-world legal application requirements, especially where precision and reliability are essential. This limits their practical utility and necessitates a use-case-driven framework, which LEGALMIDM addresses directly.
"precision and reliability are essential, this lack of consideration limits practical utility."
Data Curation: Methods for constructing high-quality, use-case-driven legal datasets.
Methods for constructing high-quality, use-case-driven legal datasets.
Human-Curated Data Composition Process
Leveraging Written Law for Automatic QA Generation
A distinctive feature of the legal domain is the presence of clear statutory references. LEGALMIDM leverages GPT-4o to generate questions, answers, and specific references from Korean legal statutes, ensuring factual grounding through a verification step.
"Each generated QA pair is factually grounded in the provided legal text."
Training Pipeline: Optimized training protocols including CPT, IT, and prompt optimization.
Optimized training protocols including CPT, IT, and prompt optimization.
LEGALMIDM Training Pipeline Stages
| Training Stage | Data Composition | Legal Task Performance Impact |
|---|---|---|
| CPT | Legal Only | Lower Adaptability (Catastrophic Forgetting Risk) |
| CPT | Legal + General | Superior Performance & Generalization |
| IT | Legal Only | Lower Average Results |
| IT | Legal + General | Better Average Results On Average |
| Variation | Doc-based (R-L) | Open QA (R-L) | MC (Acc) |
|---|---|---|---|
| Q ⇒ A (No Ref) | 45.83 | 14.58 | 0.64 |
| Q ⇒ A + Ref (Ref in Output) | 47.53 | 16.80 | 0.56 |
| Q + Ref ⇒ A (Ref in Input) | 46.89 | 17.74 | 0.65 |
Base Model: Mi:dm-2.0-Base Foundation
LEGALMIDM leverages Mi:dm-2.0-Base, a proprietary Korean-English bilingual 11.5B language model from KT, which is pre-trained on high-quality Korean and English data, ensuring strong foundational understanding of cultural contexts and a substantial 32K context length.
"Korea-centric LLM, trained on high-quality Korean and English data to understand Korean cultural contexts, and features a 32K context length."
Evaluation & Results: Benchmarking LEGALMIDM against state-of-the-art LLMs.
Benchmarking LEGALMIDM against state-of-the-art LLMs.
| Model | Complaint | Summary | Petition | QA | MRC | MC | AVG |
|---|---|---|---|---|---|---|---|
| LEGALMIDM-11B | 67.67 | 47.94 | 14.46 | 17.74 | 57.50 | 0.65 | 41.06 |
| Qwen2.5-32B | 58.81 | 30.76 | 14.08 | 15.70 | 33.86 | 0.26 | 30.64 |
| Llama3.3-70B | 53.40 | 30.30 | 9.33 | 12.23 | 22.77 | 0.45 | 25.61 |
| Gemma-2-27b | 51.61 | 32.37 | 11.17 | 13.51 | 30.09 | 0.40 | 27.75 |
| EXAONE-3.5-32B | 54.29 | 25.47 | 11.28 | 14.98 | 30.60 | 0.27 | 27.32 |
Validation of Use-Case-Driven Methodology
Ablation studies robustly confirm the effectiveness of each component of LEGALMIDM's training strategy: integrating general domain data, formatting references in the input, and using system prompts only during inference. This approach leads to superior performance in legal tasks while maintaining strong general domain capabilities.
"confirming the effectiveness of our methodology."
Advanced AI ROI Calculator
Estimate the potential annual savings and hours reclaimed by implementing enterprise AI solutions tailored to your business.
Your AI Implementation Roadmap
A structured approach to integrate LEGALMIDM into your legal operations, from initial assessment to full-scale deployment.
Phase 1: Use-Case Definition & Data Curation
Collaborate with legal professionals to identify high-demand tasks and construct human-curated, use-case-driven datasets.
Phase 2: Automated Data Generation & Pre-training
Leverage written law for synthetic data creation and perform continual pre-training on a mix of legal and general domain data.
Phase 3: Instruction-Tuning & Prompt Optimization
Refine the model with instruction-tuning using mixed datasets and optimize system prompts for inference.
Phase 4: Comprehensive Evaluation & Deployment
Rigorously benchmark LEGALMIDM against state-of-the-art LLMs on both legal and general tasks, then prepare for deployment.
Ready to Transform Your Legal Operations?
Discuss how LEGALMIDM can be tailored to your specific enterprise needs and start building a more efficient, precise legal workflow.