What is the multi-agent LLM framework for HTS classification?

The system uses multi-agent retrieval, evidence grounding, and voting to classify HTS codes. It auto-escalates to human review when confidence drops below a threshold.

Why is this framework critical for smart port operations?

Precise classification slashes customs delays and costs. Interpretable reasoning transforms AI into a trusted compliance partner for port operations.

What limitations should operators monitor during deployment?

Accuracy drops on fine-grained statistical suffixes. Operators must set confidence thresholds, keep humans in the loop for edge cases, and utilize the open-source codebase.

A Multi-Agent LLM Framework Based on Consensus Mechanisms: A New Paradigm for HTS Code Classification in Smart Ports

This paper addresses the complex challenge of Harmonized Tariff Schedule (HTS) code classification in maritime logistics by proposing a multi-agent collaborative LLM framework. HTS classification is particularly difficult due to short and ambiguous product descriptions, as well as strict requirements for hierarchical structure and legal notes. The framework integrates multi-agent information retrieval, semantic retrieval from official tariff documents, evidence-based grounding reasoning, and a consensus verification mechanism to achieve precise classification of Canadian 10-digit HTS codes. Experiments on 3,300 domain-expert-annotated samples reveal that even with advanced LLMs, prediction performance degrades significantly from coarse-grained chapters to fine-grained statistical suffixes. The study demonstrates that fully autonomous single-step prediction falls short of compliance requirements, while incorporating uncertainty awareness, evidence grounding, and a human-AI consensus workflow substantially improves both interpretability and regulatory compliance, providing robust technical support for smart port operations.

Background and Context

In the intricate ecosystem of maritime logistics and smart port operations, the accurate classification of Harmonized Tariff Schedule (HTS) codes serves as the foundational pillar for customs clearance, tariff assessment, and regulatory compliance. This process is not merely an administrative formality but a critical component of global trade statistics and legal adherence. However, the task presents severe practical challenges that traditional automated systems have struggled to resolve. Product descriptions provided by shippers are often brief, incomplete, or inherently ambiguous, lacking the technical specificity required for precise categorization. Despite the vagueness of these inputs, the determination of the correct HTS code depends heavily on a complex hierarchical structure, obscure legal notes, and specific jurisdictional rules that vary significantly across different trade agreements. In the Canadian context, for instance, the requirement for a 10-digit HTS code demands a level of granularity that goes beyond general product identification, requiring an understanding of statistical suffixes and specific material compositions.

Traditional methods for HTS classification have largely relied on rule-based systems or keyword matching, which fail to handle the semantic richness and contextual nuance of modern supply chain data. These legacy approaches are brittle when faced with non-standardized product descriptions or novel goods that do not fit neatly into predefined categories. The complexity arises from the intersection of natural language semantics and rigid legal frameworks. A single word in a product description can drastically alter the applicable tariff rate, yet the surrounding context may be missing or misleading. This gap between the ambiguity of human language and the precision of legal codes creates a significant bottleneck in port operations, leading to delays, increased compliance costs, and potential legal liabilities for importers and logistics providers.

To address these persistent challenges, recent research has introduced an innovative Large Language Model (LLM) framework based on multi-agent collaboration, specifically designed for the classification of Canadian 10-digit HTS codes. This framework moves away from the conventional paradigm of single-model end-to-end prediction, which often suffers from hallucinations and a lack of transparency. Instead, it constructs a comprehensive workflow that integrates multi-agent information retrieval, semantic search of official tariff documents, evidence-based grounding reasoning, and a consensus verification mechanism. The core objective is to simulate the rigorous review process of human customs experts, thereby enhancing both the accuracy and interpretability of classifications in complex regulatory scenarios. By breaking down the classification task into manageable, verifiable steps, the framework aims to provide a robust technical solution for handling long-tail and ambiguous product descriptions.

Deep Analysis

The technical architecture of the proposed framework is characterized by a sophisticated multi-agent collaboration structure that avoids the pitfalls of black-box prediction. The process begins with a multi-agent information retrieval phase, where various agents are deployed to extract relevant features from massive amounts of unstructured data associated with the product. This initial stage ensures that all available contextual information is gathered before any classification decision is made. Following this, the system employs semantic retrieval techniques to query an official tariff document repository. This step is crucial for locating precise legal notes and chapter explanations, ensuring that the basis for classification is authoritative and legally sound. By grounding the retrieval in official sources, the framework minimizes the risk of relying on outdated or incorrect external knowledge.

A critical innovation in this framework is the implementation of evidence-based grounding reasoning. Unlike standard LLM applications that may generate plausible-sounding but factually incorrect outputs, this system forces the model to cite specific document snippets as support for its conclusions before generating a final classification. This mechanism significantly reduces hallucination by tethering the model's reasoning to verifiable textual evidence. Furthermore, the framework introduces a consensus verification mechanism that operates on the hierarchical components of the HTS code, such as chapters, headings, and subheadings. Instead of a single prediction, the system aggregates judgments from multiple agents through element-level voting. This collective decision-making process enhances the stability of the output, particularly for fine-grained statistical suffixes where individual model errors can have significant financial implications.

The framework also incorporates a confidence estimation module that continuously evaluates the certainty of the classification process. When the system's confidence in a predicted code falls below a predefined threshold, it automatically triggers an escalation protocol involving human intervention. This human-AI consensus workflow acknowledges the limitations of current AI capabilities in highly specialized domains. The inclusion of human oversight ensures that edge cases and high-risk predictions are reviewed by domain experts, combining the speed of AI with the nuanced judgment of humans. This layered approach, which combines hierarchical processing with collective decision-making, effectively compensates for the deficiencies of single LLMs in complex logical reasoning and fact-checking, ensuring the robustness of the final output.

Industry Impact

The empirical validation of this framework was conducted on a private dataset comprising 3,300 product records annotated by domain experts, primarily sourced from logistics and distribution scenarios. The experimental results provide critical insights into the current capabilities and limitations of advanced LLMs in regulatory compliance tasks. The analysis reveals a significant degradation in prediction performance as the granularity of the HTS code increases. While models can relatively accurately predict coarse-grained chapters, their accuracy drops sharply when moving to fine-grained tariff subdivisions and statistical suffixes. This finding underscores the difficulty of mapping ambiguous natural language descriptions to highly specific legal categories, even with state-of-the-art language models.

Ablation studies further demonstrate the necessity of the proposed framework components. The introduction of evidence grounding and consensus verification mechanisms was shown to significantly improve the stability of fine-grained classifications. These components help mitigate the variance in model outputs and ensure that predictions are supported by concrete evidence. Additionally, the confidence estimation module proved effective in identifying high-risk prediction samples, allowing for targeted human review. These results strongly suggest that in highly specialized compliance domains, relying solely on the parametric memory of large models is insufficient to handle complex rule constraints. Instead, a hybrid approach that combines external knowledge retrieval with uncertainty management is essential for achieving reliable results.

The implications for the open-source community and industrial deployment are profound. The study provides empirical evidence that "human-AI collaboration" and "consensus mechanisms" outperform "fully autonomous" AI agents in complex compliance tasks. This supports the development of more trustworthy AI systems in regulated industries. Moreover, the strategies of evidence grounding and hierarchical voting proposed in this framework are transferable to other domains requiring strict adherence to legal or industry standards, such as financial compliance and medical diagnostic assistance. The open-sourcing of the code (https://github.com/Analytics-Everywhere-Lab/hts) facilitates technology sharing in the smart port and logistics automation sectors, encouraging further innovation and standardization in the application of AI for regulatory compliance.

Outlook

This research marks a significant shift in the application of AI within professional compliance fields, moving from "auxiliary tools" to "trusted partners." By providing a robust technical foundation for smart port operations, the framework enhances customs clearance efficiency and reduces legal risks through its interpretable reasoning processes. The ability to explain why a specific HTS code was chosen, backed by cited legal documents, is invaluable for auditors and compliance officers. This transparency builds trust in AI systems, encouraging wider adoption in critical infrastructure such as ports and logistics hubs.

Looking forward, the integration of such multi-agent frameworks into smart port ecosystems will likely accelerate the automation of trade compliance. As global trade continues to grow in complexity, the demand for real-time, accurate, and compliant classification services will increase. The framework's design, which emphasizes uncertainty awareness and human-in-the-loop workflows, provides a scalable model for handling this growing complexity. It offers a pathway to reduce the operational bottlenecks that currently hinder the speed and efficiency of global supply chains.

Furthermore, the success of this approach in the HTS classification domain suggests potential for broader applications in other areas of international trade and regulatory technology. As LLMs continue to evolve, the combination of these models with rigorous grounding and consensus mechanisms will become increasingly important for ensuring safety and reliability. The open-source nature of the project invites further research and development, potentially leading to even more sophisticated systems that can adapt to changing regulatory landscapes. Ultimately, this work contributes to the broader goal of creating more resilient, efficient, and transparent global trade systems through the responsible application of artificial intelligence.

Sources

arXiv