FEF3001 Yapay zekaya giriş - Ders8
2024-11-28
In the zoom chat window please write down your department and an example of classification task related to your domain
Pick one example and discuss about the data
Visit Kaggle and find related dataset
In case of Imbalance - Down sampling - Up sampling
Decision Trees are a classification method that uses a tree-like model of decisions and their possible consequences. The algorithm learns a series of if-then-else decision rules that split the data based on feature values, creating a structure that resembles a flowchart. Each internal node represents a “test” on an attribute, each branch represents the outcome of the test, and each leaf node represents a class label or decision.
branch, test, leaf
Hours studied | Previous Score | Attended Review | Pass? |
---|---|---|---|
3 | 60 | No | ? |
4 | 75 | No | ? |
7 | 80 | Yes | ? |
Questions: Which feature is the first branch? At what value we create a branch (5 hours, 70 points, etc.)
Formula: \(H(S) = -\sum_{i=1}^{c} p_i \log_2(p_i)\)
Where \(S\) is the dataset, \(c\) is the number of classes, and \(p_i\) is the proportion of examples belonging to class \(i\).
Formula: \(IG(S, A) = H(S) - \sum_{v \in Values(A)} \frac{|S_v|}{|S|} H(S_v)\)
Where \(S\) is the dataset, \(A\) is the feature being considered for splitting, \(Values(A)\) are the possible values of feature \(A\), and \(S_v\) is the subset of \(S\) where feature \(A\) has value \(v\).
Formula: \(Gini(S) = 1 - \sum_{i=1}^{c} (p_i)^2\)
Where \(S\) is the dataset, \(c\) is the number of classes, and \(p_i\) is the proportion of examples belonging to class \(i\).
The choice between using entropy (with information gain) or Gini impurity often depends on the specific implementation of the decision tree algorithm. In practice, they often yield similar results.
Please visit this link for details about algoritms
https://www.dataspoof.info/post/decision-tree-classification-in-r/
https://forum.posit.co/t/decision-tree-in-r/5561/5
Advantages of Decision Trees:
Disadvantages:
Random forest is a commonly-used machine learning algorithm, trademarked by Leo Breiman and Adele Cutler, that combines the output of multiple decision trees to reach a single result.
Random forests is an ensemble learning method for classification, regression and other tasks that operates by constructing a multitude of decision trees at training time. For classification tasks, the output of the random forest is the class selected by most trees.
Please visit: https://www.kaggle.com/code/lara311/diabetes-prediction-using-machine-learning
The Basic Idea
Imagine you’re trying to separate different types of objects, like apples and oranges, based on their characteristics, such as color, shape, and size. You want to find a way to draw a line (or a hyperplane in higher dimensions) that separates the two types of objects as accurately as possible.
A Support Vector Machine is a type of supervised learning algorithm that aims to find the best hyperplane that separates the data into different classes. Here’s how it works:
Key Concepts
SVMs are powerful because they:
H1 does not separate the classes. H2 does, but only with a small margin. H3 separates them with the maximal margin. Source
Maximum-margin hyperplane and margins for an SVM trained with samples from two classes. Samples on the margin are called the support vectors. Source
Please visit SVM demo site for an online interactive demo for SVM
The Basic Idea
Logistic regression is a supervised machine learning algorithm that accomplishes binary classification tasks by predicting the probability of an outcome, event, or observation. The model delivers a binary outcome limited to two possible outcomes: yes/no, 0/1, or true/false.
Logistic Regression is a type of supervised learning algorithm that models the probability of an event occurring (e.g., passing an exam) based on a set of input variables (e.g., scores). Here’s how it works:
Logistic Regression is a popular algorithm because it: