This code repository encompasses the material (data sets, notebooks, labs, task lists) for DataKolektiv's DATA SCIENCE SESSIONS VOL. 03 :: A Foundational Python Course in Data Science.
Welcome to the introductory course on Data Science with Python! In this course, we will delve into the fascinating world of data analysis, manipulation, and modeling using the powerful programming language Python. Originally developed in R, this course has been revamped and rewritten specifically for Python, catering to individuals who are eager to harness the potential of this versatile language in the field of data science.
The primary objective of this course is to equip you with a comprehensive set of skills and knowledge required to tackle real-world data challenges. Throughout the course, we will cover a wide range of supervised learning topics that will lay a strong foundation for your journey in data science. Our focus will be on practical applications, and you will gain hands-on experience by working on various datasets and projects.
The course is designed to achieve several key goals:
- We will explore techniques for handling diverse data sources, for example relational databases or files like CSV.
- You will learn how to effectively preprocess and clean datasets, ensuring that the data is in a suitable format for analysis.
- Additionally, we will delve into data visualization, enabling you to present insights and patterns in a visually appealing and understandable manner.
- Another vital aspect of data science is statistical analysis. You will acquire a solid understanding of both inferential and descriptive statistics, enabling you to draw meaningful conclusions from data and make informed decisions.
- Moreover, we will delve into the art of reporting, as effective communication of findings and results is essential in data-driven decision-making processes.
- Lastly, we will explore the exciting realm of machine learning. You will gain a firm grasp of various machine learning models and algorithms, understanding their strengths, weaknesses, and practical implementation.
By the end of the course, you will have the skills necessary to build and evaluate predictive models, allowing you to extract valuable insights and make accurate predictions from data.
Join us on this exhilarating journey into the world of Data Science with Python, and unlock the power to extract knowledge, uncover patterns, and make data-driven decisions that drive success.
Python: Python is a high-level programming language known for its simplicity and readability, making it an ideal choice for data analysis and manipulation tasks in the field of data science.
NumPy: NumPy is a fundamental library for scientific computing in Python, providing powerful tools for efficient numerical operations, array manipulation, and mathematical functions, making it a cornerstone for data analysis and machine learning workflows.
pandas: pandas is a versatile and user-friendly data manipulation library in Python, offering powerful data structures and functions for cleaning, transforming, and analyzing structured data, allowing for seamless data exploration and preprocessing.
scikit-learn (sklearn): sklearn is a widely-used machine learning library in Python, offering a rich set of algorithms and tools for tasks such as classification, regression, clustering, and dimensionality reduction, empowering data scientists to build robust and scalable machine learning models.
scipy: scipy is a comprehensive library for scientific computing and advanced mathematics in Python, providing a wide range of functions and tools for optimization, integration, interpolation, signal processing, and more, making it an essential resource for data scientists working with complex data analysis tasks.
matplotlib: matplotlib is a popular plotting library in Python, enabling the creation of high-quality, customizable visualizations for data exploration and presentation, making it an invaluable tool for communicating insights and patterns in data.
seaborn: seaborn is a powerful data visualization library built on top of matplotlib, offering a higher-level interface and a variety of statistical visualizations, allowing for the creation of attractive and informative plots with minimal effort.
plotly: plotly is an interactive and dynamic visualization library for Python, providing a wide range of chart types and interactive features, enabling the creation of engaging and interactive visualizations that can be shared and explored in web-based environments.
Goran S. Milovanović, Phd.
He studied mathematics, philosophy, and psychology at the University of Belgrade and New York University (NYU), where he obtained a doctorate in psychology. A programmer since the age of ten, he published his first scientific paper at the age of twenty, collaborated with top cognitive scientists, and had papers cited in Stevens' Handbook of Experimental Psychology. With several years of experience in analytics and machine learning on some of the world's largest data systems (Wikidata), he served on the program committees of European conferences in Data Science. As an independent consultant, in collaboration with American educational startups and DataKolektiv, he has trained dozens of individuals for work in Data Science and ML.
Aleksandar Cvetković, Phd. He completed his doctoral studies in applied mathematics in Italy (GSSI - L'Aquila, SISSA - Trieste), conducting scientific research in the field of control and optimization. He is the author and co-author of several scientific papers published in leading international journals. With years of experience in the ML industry and education, focusing on machine learning, 3D computer vision, graph neural networks, and accelerating ML algorithms on hardware, he is currently working in the gaming industry.
Ilija Lazarevic, MSc. He completed his master's studies in computer science at the University of Kragujevac and was involved in teaching the subject of Operating Systems and Computer Networks. With years of experience as a software engineer and machine learning engineer, he specializes in (a) automatic pricing assessment in the real estate domain in the international market and (b) developing recommendation systems in the domain of casino games.
-
Session00 - Preparing for the course
- Installation of Python 3.8
- Creating local virtual environment + libraries installation
- Installing Visual Studio Code IDE
- Installing MariaDB
- Installing git on local machine
-
Session 01. Python for Data Science: Basics + Intuitive Understanding of Python
- Python as a calculator
- Elementary math in Python
- Brief intro into Python data types: numerics, sequences, sets, mappings
- Strings and basic string functions
- Brief Pandas DataFrame and Series intro
-
Session 02. Fundamental Data Structures and Classes. Subsetting a Pandas DataFrame
- Different ways of creating DataFrame and Series
- Slicing with
locandiloc - Basic visualizations in Pandas
-
Session 03. Control Flow + Functions. Defensive programming. Pandas: I/O operations +
apply,filter,groupby,agg- Flow of controll in Python;
if,else,for,while,continue,break - List and dictionary comprehensions
- Pandas I/O, reading and writing files
- Pandas transformations and aggregations: apply, filter, groupby, agg
- Flow of controll in Python;
-
Session 04. Numpy: basic vector arithmetic, linear algebra, and broadcasting.
- Vectorization
- Array and Matrix subsetting
- Most used NumPy functions
- Algebraic array and matrix functions
- Broadcasting
- Simple Linear Regression model in NumPy
- NumPy and Pandas
-
Session 05. Strings in Python
- Python functions and arithmetic with Strings
- Most used String functions
- String encodings
- Formatting strings in Python
-
Session 06. Exploratory Data Analysis - EDA
- Cleaning data
- Working with missing values
- Parsing date and date/time data
- Data exploration
- Contigency tables
- Visualizations using Matplotlib and Seaborn
- Descriptive statistics and boxplot
- Heatmaps for different visualizations
-
Session 07. Relational Structure + Pivot
- Joining Pandas DataFrames
- Left and right joining
- Inner and outer joining
- Semi-join and anti-join
-
Session 08. Intro to Probability Theory. Experimental and Theoretical Probability. Discrete Random Variables
- Experimental probability
- Sigma algebra of events and theoretical probability
- Kolmogorov axioms
- Concept of random variable
- Discrete random variables
- Bernoulli, Binomial, Geometric, discrete uniform and Poisson distributions
-
Session 09. Intro to Probability Theory. Continuous Distribution. Chi-square test
- Continuous random variables
- Continuous uniform, exponential and normal distribution
- Mathematical expectation and variance
- Chi-square distribution and Chi-square test
-
Session 10. Relational Databases and Pandas
- Connecting to relational database - MariaDB
- Learning most used SQL keywords and writing queries
- Working with math functions
- Working with strings
- Working with dates
- Joining tables
- Performing aggregations
- Visualizing query results
- Difference between
HAVINGandWHERE - SQL query execution order
-
Session 11. Conditional Probability. Multivariate Random Variables. Bias and Variance.
- Conditional probability
- Law of total probability
- Bayes theorem
- Discrete and continuous multivariate random variables
- Sampling mean and standard error
- Statistical bias and variance
-
Session 12. Statistical Hypothesis Testing Chi-square test + t-test. The Central Limit Theorem. Covariance and Correlation.
- Chi-square distribution and related test
- Student's t-distribution
- t-test sample mean vs. constant
- t-test and two sample means
- t-test for independent measures
- Central limit theorem
- Covariance and correlation
-
Session 13. Simple Linear Regression. Estimation Theory continued: the Parametric bootstrap.
- Simple linear regression
- Linear regression using statsmodels
- Linear regression using sklearn
- Parametric bootstrap
-
Session 14. Partial and Part Correlation. Multiple Linear Regression.
- Partial and part correlation
- Multiple linear regression
- Multicolinearity
- Variance inflation factor
- MLR using statsmodels and sklearn
-
Session 15. Regularization in MLR. The Maximum Likelihood Estimation (MLE).
- Problem of overfitting
- Regularization in multiple linear regression
- Ridge and Lasso regularization
- Maximum likelihood estimation - MLE
- MLE in linear regression
-
Session 16. Generalized Linear Models I. Binomial Logistic Regression and its MLE. Multinomial Regression.
- Binomial logistic regression - BLR
- Interpreting coefficients in BLR
- Akaike information criterion - AIC
- Binomial logistic regression using statsmodels and sklearn
- MLE for BLR
- Multinomial regression
-
Session 17. Generalized Linear Models II. Binomial Logistic Regression and ROC analysis. Regularization of BLR.
- Binomial logistic regression and ROC analysis
- Akaike information criterion
- Confusion matrix
- ROC AUC
- Regularization of BLR (ridge, lasso and elastic)
-
Session 18. Generalized Linear Models III. Multinomial Logistic Regression. Regularization of MNR.
- Multinomial logistic regression
- Regularization of MLR (ridge, lasso and elastic)
- Cross-validation in regularization
-
Session 19. Generalized Linear Models IV. Poisson Regression. Zero-Inflated Poisson Regression. Negative Binomial Regression.
- Poisson regression
- Regularization of Poisson regression
- Overdispersion
- Zero-inflated regression
- Negative binomial regression
-
Session 20. Introduction to Decision Trees for classification and regression problems.
- Information theory
- Information and probability
- Information content
- Information entropy
- Information gain
- Gini impurity
- Gini gain
- Regression and reduction of variance
-
Session 21. Decision Trees regularization and cost complexity prunning
- Decision trees regularization
- Pre-prunning
- Post-prunning
- Feature importance
-
Session 22. Random Forest classification and regression
- Random Forests: the algorithm
- Boostrap aggregating - bagging
- Out of bag (OOB) error
- Feature bagging
- Random forests classification
- Random forests regression




