Repository logo

Infoscience

  • English
  • French
Log In
Logo EPFL, École polytechnique fédérale de Lausanne

Infoscience

  • English
  • French
Log In
  1. Home
  2. Academic and Research Output
  3. EPFL thesis
  4. Neuroscience-inspired computer vision and language modeling for behavior analysis
 
doctoral thesis

Neuroscience-inspired computer vision and language modeling for behavior analysis

Ye, Shaokai  
2025

Measuring animal behaviors is a crucial task in a range of scientific applications. Modern approaches with deep neural networks calls for developing solutions that solve a wide range of vision and language tasks. This ubiquitous need for understanding behavior in both scientific areas and industrial applications, as well as the difficulty of modeling behavior occurring in the wild, motivates building more generalizable and robust deep neural networks. Motivated by this, this dissertation focuses on solving the following key problems: how do we train deep neural networks to be generalizable across different data domains? How do we train neural networks or neural network-based systems to perform a wide range vision and language tasks? How do we enable the model to adapt and learn at inference time? More specifically, can we enable models to have some level of adaptive intelligence, either by learning from interacting with users and data, or via dynamic computing such as in-context learning? Towards this goal, my aim was to get inspiration from both the progress in artificial intelligence and insights from neuroscience. In Chapter 2, we introduce methods to merge heterogeneous animal pose datasets and training algorithms for the model to counter domain shifts and catastrophic forgetting. I showed that the proposed methods gave rise to the first animal pose foundation model that has zero-shot performance comparable to a fully trained model in downstream tasks. Then in Chapter 3, we report the first keypoint-aware, unsupervised learning approach with transformers for re-identification of animals. To explore the growing utility of large language models (LLMs), in Chapter 4, we proposed the first LLM-based agentic system that uses pre-trained deep neural networks and a Python API as tools and that dynamically learn and adapt at inference time. This proposed system was built with principles inspired from neuroscience at its core, and helped push the frontier of using LLM-based agents to automate behavior analysis using frontier computer vision and large language models. Finally, in Chapter 5 we proposed novel methods to evaluate and improve multi-modal large language models to recognize challenging human actions that occur in daily life. This work aims to better evaluate and refine MLLMs that can recognize human actions.

  • Files
  • Details
  • Metrics
Type
doctoral thesis
DOI
10.5075/epfl-thesis-11066
Author(s)
Ye, Shaokai  

École Polytechnique Fédérale de Lausanne

Advisors
Mathis, Mackenzie  
Jury

Prof. Volkan Cevher (président) ; Prof. Mackenzie Mathis (directeur de thèse) ; Prof. Alexandre Alahi, Prof. Siyu Tang, Prof. Sara Beery (rapporteurs)

Date Issued

2025

Publisher

EPFL

Publisher place

Lausanne

Public defense year

2025-06-12

Thesis number

11066

Total of pages

258

Subjects

neuroscience

•

computer vision

•

large language models

•

multi-modal large language models

•

behavior analysis

•

pose estimation

EPFL units
UPMWMATHIS  
Faculty
SV  
School
BMI  
Doctoral School
EDEE  
Use this identifier to reference this record
https://infoscience.epfl.ch/handle/20.500.14299/250866
Logo EPFL, École polytechnique fédérale de Lausanne
  • Contact
  • infoscience@epfl.ch

  • Follow us on Facebook
  • Follow us on Instagram
  • Follow us on LinkedIn
  • Follow us on X
  • Follow us on Youtube
AccessibilityLegal noticePrivacy policyCookie settingsEnd User AgreementGet helpFeedback

Infoscience is a service managed and provided by the Library and IT Services of EPFL. © EPFL, tous droits réservés