ANOVA – Analysis of Variance is the statistical method to test whether there are significant differences between 2 or more groups. This test gives 2 outputs: F-test score, p-value (statistical significant of a score value)
F-test score: ANOVA assumes the means of all groups are the same, calculates how much the actual means deviate from the assumption, and reports it as the F-test score. A larger score means there is a larger difference between the means.
Binning – Grouping data into categories
Chi-Square – Test to find association between two categorical variables. Done to understand the probability of distribution. Hypothesis – Variables are independent. Result: If there is a relationship between 2 variables. Doesn’t tell about the strength or certainty of the relationship
Categorical variable: A variable with finite number of possible values which are not continuous. Ex: Hair color, blood type, academic courses
Continuous variable: A variable that is a continuous number. Ex: Age, weight, income
Correlation – Relationships between 2 or more variables
Causation – Relationship between cause and effect on 2 or more variables
Pearson Correlation – Measures the strength of correlation between 2 features using Correlation co-efficient and P-value
Correlation co-efficient: Close to +1 (Positive Relationship); Close to -1 (Negative Relationship); Close to 0 (No Relationship)
P-value: <0.001 (Strong certainty in the result-99%); <0.05 (Moderate- 95%); <0.01 (Weak); >0.1 (No certainty in the result)
Polynomial Regression – When the regression model has curve (Ex: Parabola, hyperbola) or to describe curvilinear relationships
Curvilinear relationships – Squaring or setting higher-order terms of the predictor variables
Data Normalization –
Data Standardization –
Pipeline – Normalization + Transformation + Prediction
Pandas Series – [ ]
Pandas Dataframe – [[ ]]
Linear Regression – Determine relationships between two variables
Multiple Linear Regression – Determine relationships between two or more variables
Logistic Regression – 0 or 1
Mean Square Error:- Determines model fit – (y-yhat)square +(y-yhat)square+(y-yhat)square../total no of samples
R-Squared Error or Co-efficient of determination – Determines model fit – measures how close is the actual data to the fitted regression line. R2 = (1-MSE of regression line/MSE of average data). If R2 = 1 – good fit; 0- bad fit
Does a lower MSE mean better fit? – Not necessarily. Errors will decrease as the no of variables introduced increases
Residual Plot –
Overfitting –
Underfitting –
Ridge Regression –