DATA SCIENCE SESSIONS :: A Foundational Python Data Science Course

This code repository encompasses the material (data sets, notebooks, labs, task lists) for DataKolektiv's DATA SCIENCE SESSIONS VOL. 03 :: A Foundational Python Course in Data Science.

About the course

Welcome to the introductory course on Data Science with Python! In this course, we will delve into the fascinating world of data analysis, manipulation, and modeling using the powerful programming language Python. Originally developed in R, this course has been revamped and rewritten specifically for Python, catering to individuals who are eager to harness the potential of this versatile language in the field of data science.

The primary objective of this course is to equip you with a comprehensive set of skills and knowledge required to tackle real-world data challenges. Throughout the course, we will cover a wide range of supervised learning topics that will lay a strong foundation for your journey in data science. Our focus will be on practical applications, and you will gain hands-on experience by working on various datasets and projects.

The course is designed to achieve several key goals:

We will explore techniques for handling diverse data sources, for example relational databases or files like CSV.
You will learn how to effectively preprocess and clean datasets, ensuring that the data is in a suitable format for analysis.
Additionally, we will delve into data visualization, enabling you to present insights and patterns in a visually appealing and understandable manner.
Another vital aspect of data science is statistical analysis. You will acquire a solid understanding of both inferential and descriptive statistics, enabling you to draw meaningful conclusions from data and make informed decisions.
Moreover, we will delve into the art of reporting, as effective communication of findings and results is essential in data-driven decision-making processes.
Lastly, we will explore the exciting realm of machine learning. You will gain a firm grasp of various machine learning models and algorithms, understanding their strengths, weaknesses, and practical implementation.

By the end of the course, you will have the skills necessary to build and evaluate predictive models, allowing you to extract valuable insights and make accurate predictions from data.

Join us on this exhilarating journey into the world of Data Science with Python, and unlock the power to extract knowledge, uncover patterns, and make data-driven decisions that drive success.

Tech stack used for this course

Python: Python is a high-level programming language known for its simplicity and readability, making it an ideal choice for data analysis and manipulation tasks in the field of data science.

NumPy: NumPy is a fundamental library for scientific computing in Python, providing powerful tools for efficient numerical operations, array manipulation, and mathematical functions, making it a cornerstone for data analysis and machine learning workflows.

pandas: pandas is a versatile and user-friendly data manipulation library in Python, offering powerful data structures and functions for cleaning, transforming, and analyzing structured data, allowing for seamless data exploration and preprocessing.

scikit-learn (sklearn): sklearn is a widely-used machine learning library in Python, offering a rich set of algorithms and tools for tasks such as classification, regression, clustering, and dimensionality reduction, empowering data scientists to build robust and scalable machine learning models.

scipy: scipy is a comprehensive library for scientific computing and advanced mathematics in Python, providing a wide range of functions and tools for optimization, integration, interpolation, signal processing, and more, making it an essential resource for data scientists working with complex data analysis tasks.

matplotlib: matplotlib is a popular plotting library in Python, enabling the creation of high-quality, customizable visualizations for data exploration and presentation, making it an invaluable tool for communicating insights and patterns in data.

seaborn: seaborn is a powerful data visualization library built on top of matplotlib, offering a higher-level interface and a variety of statistical visualizations, allowing for the creation of attractive and informative plots with minimal effort.

plotly: plotly is an interactive and dynamic visualization library for Python, providing a wide range of chart types and interactive features, enabling the creation of engaging and interactive visualizations that can be shared and explored in web-based environments.

Lecturers

Goran S. Milovanović, Phd.
He studied mathematics, philosophy, and psychology at the University of Belgrade and New York University (NYU), where he obtained a doctorate in psychology. A programmer since the age of ten, he published his first scientific paper at the age of twenty, collaborated with top cognitive scientists, and had papers cited in Stevens' Handbook of Experimental Psychology. With several years of experience in analytics and machine learning on some of the world's largest data systems (Wikidata), he served on the program committees of European conferences in Data Science. As an independent consultant, in collaboration with American educational startups and DataKolektiv, he has trained dozens of individuals for work in Data Science and ML.

Aleksandar Cvetković, Phd. He completed his doctoral studies in applied mathematics in Italy (GSSI - L'Aquila, SISSA - Trieste), conducting scientific research in the field of control and optimization. He is the author and co-author of several scientific papers published in leading international journals. With years of experience in the ML industry and education, focusing on machine learning, 3D computer vision, graph neural networks, and accelerating ML algorithms on hardware, he is currently working in the gaming industry.

Ilija Lazarevic, MSc. He completed his master's studies in computer science at the University of Kragujevac and was involved in teaching the subject of Operating Systems and Computer Networks. With years of experience as a software engineer and machine learning engineer, he specializes in (a) automatic pricing assessment in the real estate domain in the international market and (b) developing recommendation systems in the domain of casino games.

Curriculum

Session00 - Preparing for the course
- Installation of Python 3.8
- Creating local virtual environment + libraries installation
- Installing Visual Studio Code IDE
- Installing MariaDB
- Installing git on local machine
Session 01. Python for Data Science: Basics + Intuitive Understanding of Python
- Python as a calculator
- Elementary math in Python
- Brief intro into Python data types: numerics, sequences, sets, mappings
- Strings and basic string functions
- Brief Pandas DataFrame and Series intro
Session 02. Fundamental Data Structures and Classes. Subsetting a Pandas DataFrame
- Different ways of creating DataFrame and Series
- Slicing with loc and iloc
- Basic visualizations in Pandas
Session 03. Control Flow + Functions. Defensive programming. Pandas: I/O operations + apply, filter, groupby, agg
- Flow of controll in Python; if, else, for, while, continue, break
- List and dictionary comprehensions
- Pandas I/O, reading and writing files
- Pandas transformations and aggregations: apply, filter, groupby, agg
Session 04. Numpy: basic vector arithmetic, linear algebra, and broadcasting.
- Vectorization
- Array and Matrix subsetting
- Most used NumPy functions
- Algebraic array and matrix functions
- Broadcasting
- Simple Linear Regression model in NumPy
- NumPy and Pandas
Session 05. Strings in Python
- Python functions and arithmetic with Strings
- Most used String functions
- String encodings
- Formatting strings in Python
Session 06. Exploratory Data Analysis - EDA
- Cleaning data
- Working with missing values
- Parsing date and date/time data
- Data exploration
- Contigency tables
- Visualizations using Matplotlib and Seaborn
- Descriptive statistics and boxplot
- Heatmaps for different visualizations
Session 07. Relational Structure + Pivot
- Joining Pandas DataFrames
- Left and right joining
- Inner and outer joining
- Semi-join and anti-join
Session 08. Intro to Probability Theory. Experimental and Theoretical Probability. Discrete Random Variables
- Experimental probability
- Sigma algebra of events and theoretical probability
- Kolmogorov axioms
- Concept of random variable
- Discrete random variables
- Bernoulli, Binomial, Geometric, discrete uniform and Poisson distributions
Session 09. Intro to Probability Theory. Continuous Distribution. Chi-square test
- Continuous random variables
- Continuous uniform, exponential and normal distribution
- Mathematical expectation and variance
- Chi-square distribution and Chi-square test
Session 10. Relational Databases and Pandas
- Connecting to relational database - MariaDB
- Learning most used SQL keywords and writing queries
- Working with math functions
- Working with strings
- Working with dates
- Joining tables
- Performing aggregations
- Visualizing query results
- Difference between HAVING and WHERE
- SQL query execution order
Session 11. Conditional Probability. Multivariate Random Variables. Bias and Variance.
- Conditional probability
- Law of total probability
- Bayes theorem
- Discrete and continuous multivariate random variables
- Sampling mean and standard error
- Statistical bias and variance
Session 12. Statistical Hypothesis Testing Chi-square test + t-test. The Central Limit Theorem. Covariance and Correlation.
- Chi-square distribution and related test
- Student's t-distribution
- t-test sample mean vs. constant
- t-test and two sample means
- t-test for independent measures
- Central limit theorem
- Covariance and correlation
Session 13. Simple Linear Regression. Estimation Theory continued: the Parametric bootstrap.
- Simple linear regression
- Linear regression using statsmodels
- Linear regression using sklearn
- Parametric bootstrap
Session 14. Partial and Part Correlation. Multiple Linear Regression.
- Partial and part correlation
- Multiple linear regression
- Multicolinearity
- Variance inflation factor
- MLR using statsmodels and sklearn
Session 15. Regularization in MLR. The Maximum Likelihood Estimation (MLE).
- Problem of overfitting
- Regularization in multiple linear regression
- Ridge and Lasso regularization
- Maximum likelihood estimation - MLE
- MLE in linear regression
Session 16. Generalized Linear Models I. Binomial Logistic Regression and its MLE. Multinomial Regression.
- Binomial logistic regression - BLR
- Interpreting coefficients in BLR
- Akaike information criterion - AIC
- Binomial logistic regression using statsmodels and sklearn
- MLE for BLR
- Multinomial regression
Session 17. Generalized Linear Models II. Binomial Logistic Regression and ROC analysis. Regularization of BLR.
- Binomial logistic regression and ROC analysis
- Akaike information criterion
- Confusion matrix
- ROC AUC
- Regularization of BLR (ridge, lasso and elastic)
Session 18. Generalized Linear Models III. Multinomial Logistic Regression. Regularization of MNR.
- Multinomial logistic regression
- Regularization of MLR (ridge, lasso and elastic)
- Cross-validation in regularization
Session 19. Generalized Linear Models IV. Poisson Regression. Zero-Inflated Poisson Regression. Negative Binomial Regression.
- Poisson regression
- Regularization of Poisson regression
- Overdispersion
- Zero-inflated regression
- Negative binomial regression
Session 20. Introduction to Decision Trees for classification and regression problems.
- Information theory
- Information and probability
- Information content
- Information entropy
- Information gain
- Gini impurity
- Gini gain
- Regression and reduction of variance
Session 21. Decision Trees regularization and cost complexity prunning
- Decision trees regularization
- Pre-prunning
- Post-prunning
- Feature importance
Session 22. Random Forest classification and regression
- Random Forests: the algorithm
- Boostrap aggregating - bagging
- Out of bag (OOB) error
- Feature bagging
- Random forests classification
- Random forests regression

Name		Name	Last commit message	Last commit date
Latest commit History 291 Commits
img		img
session00		session00
session01		session01
session02		session02
session03		session03
session04		session04
session05		session05
session06		session06
session07		session07
session08		session08
session09		session09
session10		session10
session11		session11
session12		session12
session13		session13
session14		session14
session15		session15
session16		session16
session17		session17
session18		session18
session19		session19
session20		session20
session21		session21
session22		session22
session23_fp		session23_fp
session24_fp		session24_fp
session_bonus		session_bonus
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DATA SCIENCE SESSIONS :: A Foundational Python Data Science Course

About the course

Tech stack used for this course

Lecturers

Curriculum

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DATA SCIENCE SESSIONS :: A Foundational Python Data Science Course

About the course

Tech stack used for this course

Lecturers

Curriculum

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages