Abstract:To address the difficulty in recognizing subtle differences in facial biomarkers in children with autism, a Learnable Positional Encoding Enhancement (LPEE) module was combined with the Adaptive Token Aggregation (ATA) module. ViT-LPATA, a predictive model for autism, was proposed. The model leverages the LPEE module to dy-namically capture facial geometric deformation features and integrates the ATA module to enhance the feature representation capability of pathological regions, thereby establishing precise mappings of biomarker differences. Experiments on a publicly available autism facial dataset demonstrated that ViT-LPATA achieved optimal perfor-mance, with 99.2% accuracy and an AUC value of 0.940.