Glossary

ANOVA – Analysis of Variance is the statistical method to test whether there are significant differences between 2 or more groups. This test gives 2 outputs: F-test score, p-value (statistical significant of a score value)

F-test score: ANOVA assumes the means of all groups are the same, calculates how much the actual means deviate from the assumption, and reports it as the F-test score. A larger score means there is a larger difference between the means.

Binning – Grouping data into categories

Chi-Square – Test to find association between two categorical variables. Done to understand the probability of distribution. Hypothesis – Variables are independent. Result: If there is a relationship between 2 variables. Doesn’t tell about the strength or certainty of the relationship

Categorical variable: A variable with finite number of possible values which are not continuous. Ex: Hair color, blood type, academic courses

Continuous variable: A variable that is a continuous number. Ex: Age, weight, income

Correlation – Relationships between 2 or more variables

Causation – Relationship between cause and effect on 2 or more variables

Pearson Correlation – Measures the strength of correlation between 2 features using Correlation co-efficient and P-value

Correlation co-efficient: Close to +1 (Positive Relationship); Close to -1 (Negative Relationship); Close to 0 (No Relationship)

P-value: <0.001 (Strong certainty in the result-99%); <0.05 (Moderate- 95%); <0.01 (Weak); >0.1 (No certainty in the result)

Polynomial Regression – When the regression model has curve (Ex: Parabola, hyperbola) or to describe curvilinear relationships

Curvilinear relationships – Squaring or setting higher-order terms of the predictor variables

Data Normalization – 

Data Standardization – 

Pipeline – Normalization + Transformation + Prediction

Pandas Series – [ ]

Pandas Dataframe – [[ ]]

Linear Regression – Determine relationships between two variables

Multiple Linear Regression – Determine relationships between two or more variables

Logistic Regression –   0 or 1

Mean Square Error:- Determines model fit – (y-yhat)square +(y-yhat)square+(y-yhat)square../total no of samples

R-Squared Error or Co-efficient of determination – Determines model fit – measures how close is the actual data to the fitted regression line. R2 = (1-MSE of regression line/MSE of average data). If R2 = 1 – good fit; 0- bad fit

Does a lower MSE mean better fit? – Not necessarily. Errors will decrease as the no of variables introduced increases

Residual Plot –

Overfitting – 

Underfitting –

Ridge Regression –