PLADA: A Dataset is Worth 1 MB - Pseudo-Labels for Extreme Data Compression

PLADA (Pseudo-Labels as Data) proposes a revolutionary dataset transmission method: completely eliminating pixel transmission. Assuming receivers have a pre-loaded large unlabeled reference dataset (e.g., ImageNet), it transmits only class labels for the target task (under 1MB), enabling local training of high-accuracy models. Through semantic relevance pruning, PLADA selects the most relevant image subsets. Experiments on 10 diverse datasets show less than 1MB payload retains high classification accuracy, offering a novel approach to efficient dataset distribution.

Core Idea

Traditional dataset transmission requires sending complete image pixel data - ImageNet alone exceeds 100GB. PLADA proposes a disruptive approach: if the receiver already has abundant unlabeled images, we only need to communicate "which images belong to which categories."

Technical Approach

| Step | Operation | Transfer Size |

|------|------|--------|

| Prerequisite | Receiver pre-loaded with ImageNet-1K/21K (unlabeled) | 0 |

| Pruning | Select reference image subsets by semantic relevance | 0 |

| Transfer | Send only class labels for selected images | < 1MB |

| Training | Receiver trains locally with pseudo-labels | 0 |

The semantic pruning mechanism is the key innovation: computing semantic similarity between each reference image and the target task, retaining only the most relevant images while maximizing training efficiency and minimizing transmission payload.

Experimental Results

Tested on 10 diverse datasets, PLADA achieves classification accuracy comparable to traditional methods (hundreds of MBs) with less than 1MB payload. On fine-grained classification tasks, just 0.3MB achieves over 89% accuracy.

Industry Trend Connection

PLADA offers new insights for Edge AI and Model Compression. In federated learning and privacy computing scenarios, methods that avoid transmitting raw data inherently protect data privacy, aligning with efficient knowledge transfer in the Self-Improving AI trend.

In-Depth Analysis and Industry Outlook

From a broader perspective, this development reflects the accelerating trend of AI technology transitioning from laboratories to industrial applications. Industry analysts widely agree that 2026 will be a pivotal year for AI commercialization. On the technical front, large model inference efficiency continues to improve while deployment costs decline, enabling more SMEs to access advanced AI capabilities. On the market front, enterprise expectations for AI investment returns are shifting from long-term strategic value to short-term quantifiable gains.

However, the rapid proliferation of AI also brings new challenges: increasing complexity of data privacy protection, growing demands for AI decision transparency, and difficulties in cross-border AI governance coordination. Regulatory authorities across multiple countries are closely monitoring these developments, attempting to balance innovation promotion with risk prevention. For investors, identifying AI companies with truly sustainable competitive advantages has become increasingly critical as the market transitions from hype to value validation.