Data Types and Preprocessing Data

FEF3001 Intoduction to Artificial Intelligence - Lecture3 - Part1

Alper Yılmaz

2024-10-17

Sample data

No.	Profession	Marital Status	Income	Education Level
1	Teacher	Single	8000	Postgraduate
2	Nurse	Single	6000	Bachelor’s
3	Worker	Married	5000	High School
4	Worker	Single	7200	High School
5	Police	Married	8500	Bachelor’s
6	Teacher	Married	8500	Bachelor’s
7	Doctor	Married	12000	Postgraduate
8	Worker	Single	5500	High School
9	Police	Married	8250	Bachelor’s
10	Lawyer	Married	12500	Bachelor’s

Data Types

Nominal (Categorical) Data: Data type consisting of categories. Not used with ‘more than’ expressions.
- Binary (Two-Category) Data: Marital Status {Married, Single}
- Multi-Category Data: Profession {Teacher, Nurse, Worker, Police, Doctor, Lawyer}
Ordinal Data: Data type consisting of categories where categories indicate rank (importance, priority). Can be used with ‘more than’ expressions. Example: Education Level {High School, Bachelor’s, Postgraduate}
Interval Data: Data type measured on a scale divided into equal parts. Example: Income [5000,12500]
Ratio Data: Data type consisting of continuous values within a certain range. Example: Weight {65.2, 68.1, 73.5, …}

Data Descriptive Characteristics

Purpose: Better understanding of data.

Measures of Central Tendency (Arithmetic Mean, Median, Mode)
Measures of Dispersion (Variance, Standard Deviation, Quartiles)

Measures of Central Tendency

Arithmetic Mean

Arithmetic Mean is defined as

\[\frac{\sum_{i=1}^n x_i}{n}\]

Where:

\(x_i\) are the individual data points.
n is the total number of data points.

Sample data: 5 7 4 6 8 16 11 7
Arithmetic Mean: 8

Median

The value that remains in the middle when data is arranged in ascending or descending order.

Sample data: 5 7 4 6 8 16 11 7
Ordered: 4 5 6 7 7 8 11 16
Median: 7 7 -> 7

Mode

The most frequently occurring value.

Sample data: 5 7 4 6 8 16 11 7

In the sample data, 7 is the mode, occurring twice

Measures of Dispersion

Variance

Variance measures the spread of a set of data points around their mean. It quantifies how much the values differ from the average (mean). A higher variance indicates greater spread, while a lower variance suggests the data points are closer to the mean.

\[ \sigma^2 = \frac{\sum_{i=1}^N (x_i - \mu)^2}{N} \]

\(\mu\) : average or mean of the sample

Standard deviation

Standard deviation is a measure of how much individual data points deviate, on average, from the mean of the dataset. It provides a sense of the spread of the data in the same units as the original data, making it easier to interpret compared to variance (which is in squared units).

Relation with Variance:

The standard deviation is simply the square root of the variance. Mathematically:

\[ std.dev = \sqrt{\sigma^2} \]

Unclean data

Data collected in real applications:
- Can be missing: Some attribute values may not be entered for some objects.
  - Profession = ’ ’
- Can be noisy: May contain errors.
  - Salary = -10
- Can be inconsistent: Attribute values or attribute names may be incompatible.
  - Age = 35 while Date of Birth = 03/10/2004

Data Preprocessing

Data Cleaning: Completing missing attribute values, correcting erroneous data, detecting and cleaning outliers, resolving inconsistencies
Data Integration: Combining data from different data sources
Data Reduction: Excluding data while maintaining same results as original data
Data Transformation: Normalization

Data Cleaning

Refers to identifying, correcting, or deleting missing, noisy (erroneous), or inconsistent data.

Causes of missing data records:

Inability to obtain or unknown attribute value when collecting data
Failure to recognize necessity of certain attributes during data collection
Human, software, or hardware problems

It is the most difficult type of error to identify and correct –>

Causes of noisy (erroneous) data records:

Faulty data collection tools
Data entry problems
Data transmission problems
Technological limitations
Inconsistency in attribute names

Causes of inconsistent data records:

Data stored in different data sources
Non-compliance with functional dependency rules

Data preprocessing

How to handle missing data?

Exclude data records with missing attribute values
Fill missing attribute values manually
Use a global variable for missing attribute values (Null, unknown, …)
Fill missing attribute values with the mean value of that attribute
Fill with the average of attribute values from records belonging to the same class
Fill with the most probable attribute values

Data preprocessing

How to correct noisy data?

Binning: Data is sorted and divided into equal intervals. Each bin is represented by mean, median, and boundary values.
Regression: Data is fitted to regression functions.
Clustering: Data is grouped based on similarity. Outliers and extreme values are identified and deleted.
Manual detection of erroneous data: Suspicious values are found and checked by humans.

Data Transformation

Creating new attributes from given attributes.

Generalization: Summarizing the data.
Normalization (Statistical Normalization):
- Useful when there are significant differences between data, helps bring data into a single format. Enables reducing data to smaller ranges.
- Allows comparison of data from different scaling systems by bringing them into a similar format. The goal here is to transfer data from different systems into a common system and make them comparable using mathematical functions.

Types of Normalization

Z-Score Normalization
Min-Max Normalization

Z-score normalization

Z-score normalization (or standardization) transforms data to have a mean of 0 and a standard deviation of 1. This technique is useful for comparing data points from different distributions or preparing data for machine learning algorithms sensitive to scale.

The Z-score of a data point xx is calculated as:

\[ z = \frac{x - \mu}{\sigma} \]

Where:

\(x\): The data point.
\(\mu\): The mean of the dataset.
\(\sigma\): The standard deviation of the dataset.

Min-max normalization

Min-Max Normalization scales data to a specified range, typically [0,1][0,1], by adjusting the values proportionally within the given range. It is useful for ensuring that all features contribute equally to analyses or machine learning models.

The Min-Max Normalization formula is:

\[ x' = \frac{x - x_{min}}{x_{max} - x_{min}} \]

Min-max example: Sample data {10,20,30,40,50}

Step 1: identify min and max, 10 and 50, respectively.
Step 2: Apply min-max formula to each value
- 10: (10 - 10)/(50 -10) = 0
- 20: (20 - 10)/(50 -10) = 0.25
- 30: (30 - 10)/(50 -10) = 0.50
- 40: (40 - 10)/(50 -10) = 0.75
- 50: (50 - 10)/(50 -10) = 1

Min-max normalized data: {0.00, 0.25, 0.50, 0.75, 1.00}

Example

In the table below, we have two different attributes with unrelated scales. We applied z-score normaliation to Experience and min-max normaliation to savings data.

Notice that both features are now between (-0.97, 1.79) and (0, 1)

Experience (Years)	Savings (TL)	zscore_experience	minmax_savings
6	100,000	-0.97	0.00
7	250,000	-0.87	0.23
15	750,000	-0.08	1.00
15	150,000	-0.08	0.08
18	400,000	0.21	0.46
34	650,000	1.79	0.85