AStats

AStats: an agentic-AI approach to applied statistical practitioner workflows

Mentors: Jonathan Morris suresh.krishna@mcgill.ca, Yohai-Eliel Berreby suresh.krishna@mcgill.ca, Suresh Krishna suresh.krishna@mcgill.ca

Skill level: Intermediate – Advanced

Required Skills: Familiarity with the use of agentic AI workflows and the use of LLMs. Familiarity with statistical practice at a moderately advanced level is a plus. Familiarity with setting up and using open-weight LLMs and with fine-tuning LLMs is a plus. Familiarity with Slurm and working with clusters preferred.

Time commitment: Full time (350 hours)

Forum for discussion

About: Informal use and much anecdotal evidence suggests that the most recent LLMs, accessed via agentic AI coding systems, have reached a stage where they are very capable of exploring large datasets under supervision and with human guidance. Both exploratory and confirmatory analysis appears to be possible with results presented for verification by the practitioner. The A in AStats could stand for autonomous, augmented, automatic, applied, etc.

Aims: This is a new project, that this GSoC contributor will start from scratch, with help and mentorship from us. We have had good success in the past with such an approach, with successful projects going on to second and third years for additional development, and contributors from one year joining in as mentors for the following year. The project will explore and define good practices for robust workflows that incorporate agentic AI into statical exploration and practice. Practitioners already often use recipe-driven methods (e.g. JASP, Jamovi) to guide their use of statistical tools in familiar contexts. A major focus will be on the automatic exploration of large datasets, as well as the possibility of fine-tuning workflows or even models and using open-weight models to reduce cost and customize usage and make workflows more predictable.

Approach: Areas of enquiry include exploration of suitable CLI front-ends and building harnesses, as well as pipelines involving styles of interacting with LLMs, search and local file storage. Both commercial models as well as locally installed small and large models are going to be useful areas to work with. The emphasis will be on Python and R: R is more statistically sophisticated but the models know it less well, and Python is the opposite. The goal is to start with simple simulated datasets, and build a harness that starts with data auto-discovery and summarization, followed by examining and validating the ability to perform more and more complex analyses.

Project website: https://github.com/m2b3/AStats

Tech keywords: Agentic AI, Statistics, Data science, Python, PyTorch, Visual search, Saliency, Science portals, Vision AI, Vision-language models.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AStats

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

AStats

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Packages