FEF3001 Intoduction to Artificial Intelligence - Lecture3 - Part1
2024-10-17
No. | Profession | Marital Status | Income | Education Level |
---|---|---|---|---|
1 | Teacher | Single | 8000 | Postgraduate |
2 | Nurse | Single | 6000 | Bachelor’s |
3 | Worker | Married | 5000 | High School |
4 | Worker | Single | 7200 | High School |
5 | Police | Married | 8500 | Bachelor’s |
6 | Teacher | Married | 8500 | Bachelor’s |
7 | Doctor | Married | 12000 | Postgraduate |
8 | Worker | Single | 5500 | High School |
9 | Police | Married | 8250 | Bachelor’s |
10 | Lawyer | Married | 12500 | Bachelor’s |
Purpose: Better understanding of data.
Arithmetic Mean is defined as
\[\frac{\sum_{i=1}^n x_i}{n}\]
Where:
Sample data: 5 7 4 6 8 16 11 7
Arithmetic Mean: 8
The value that remains in the middle when data is arranged in ascending or descending order.
Sample data: 5 7 4 6 8 16 11 7
Ordered: 4 5 6 7 7 8 11 16
Median: 7 7 -> 7
The most frequently occurring value.
Sample data: 5 7 4 6 8 16 11 7
In the sample data, 7 is the mode, occurring twice
Variance measures the spread of a set of data points around their mean. It quantifies how much the values differ from the average (mean). A higher variance indicates greater spread, while a lower variance suggests the data points are closer to the mean.
\[ \sigma^2 = \frac{\sum_{i=1}^N (x_i - \mu)^2}{N} \]
Standard deviation is a measure of how much individual data points deviate, on average, from the mean of the dataset. It provides a sense of the spread of the data in the same units as the original data, making it easier to interpret compared to variance (which is in squared units).
Relation with Variance:
The standard deviation is simply the square root of the variance. Mathematically:
\[ std.dev = \sqrt{\sigma^2} \]
Refers to identifying, correcting, or deleting missing, noisy (erroneous), or inconsistent data.
Causes of missing data records:
It is the most difficult type of error to identify and correct –>
Causes of noisy (erroneous) data records:
Causes of inconsistent data records:
How to handle missing data?
How to correct noisy data?
Creating new attributes from given attributes.
Z-score normalization (or standardization) transforms data to have a mean of 0 and a standard deviation of 1. This technique is useful for comparing data points from different distributions or preparing data for machine learning algorithms sensitive to scale.
The Z-score of a data point xx is calculated as:
\[ z = \frac{x - \mu}{\sigma} \]
Where:
Min-Max Normalization scales data to a specified range, typically [0,1][0,1], by adjusting the values proportionally within the given range. It is useful for ensuring that all features contribute equally to analyses or machine learning models.
The Min-Max Normalization formula is:
\[ x' = \frac{x - x_{min}}{x_{max} - x_{min}} \]
Min-max example: Sample data {10,20,30,40,50}
Min-max normalized data: {0.00, 0.25, 0.50, 0.75, 1.00}
In the table below, we have two different attributes with unrelated scales. We applied z-score normaliation to Experience and min-max normaliation to savings data.
Notice that both features are now between (-0.97, 1.79) and (0, 1)
Experience (Years) | Savings (TL) | zscore_experience | minmax_savings |
---|---|---|---|
6 | 100,000 | -0.97 | 0.00 |
7 | 250,000 | -0.87 | 0.23 |
15 | 750,000 | -0.08 | 1.00 |
15 | 150,000 | -0.08 | 0.08 |
18 | 400,000 | 0.21 | 0.46 |
34 | 650,000 | 1.79 | 0.85 |