What is the objective for each method? Jiawei Han and Micheline Kamber, "Data Mining: Concepts and Techniques", Second Edition, 2006 ! Task of inferring a model from labeled training data … Goal: To find groups of documents that are similar to each other based on the important terms appearing in them. The goal of the course is to introduce students to the current theories, practices, tools and techniques in data mining. Introduction to Data Mining by Tan, Steinbach, Kumar (C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002. Classification: Application 2 (Fraud Detection), Goal: Predict fraudulent cases in credit card transactions, Classification: Application 3 (Sky Survey Cataloging), Goal: To predict class (star or galaxy) of sky objects, especially visually faint ones, based on the telescopic survey images (from Palomar Observatory), Given a set of data points, each having a set of attributes and similarity measure among them, find clusters such that, Clustering: Application 1 (Market Segmentation). In this introduction to data mining, we will understand every aspect of the business objectives and needs. Precision: means the closesness of repeated measurements to one another. Question 1 Suppose that you are employed as a data mining consultant for an Internet search engine company. Terms in this set (14) Data Mining. Data selection: to retrieve data from databases 4. The problem of finding hidden structure in unlabeled data is called A. Data mining is a process of discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems. A set of columns in data that can be used for identifying each record uniquely, C. Non-trivial extraction of possibly useful and previously unknown information in data, A. Pattern evaluation to identify interesting patterns 7. In this course, we will begin with an exploration of the field and profession of data science with a focus on the skills and ethical considerations required when working with data. Goal: subdivide a market into distinct subsets of customers where any subset may conceivably be selected as a market target to be reach with a distinct marketing mix. Typically 70/30 is the split for train/test data set. Finding stuff; The format of the book Whether you are a layman or a junior data scientist, check out these data mining quiz questions and answers to test your knowledge. Clustering: Application 2(Document Clustering). Data integration: to combine multiple data sources 3. zip codes, ID numbers, dates, colours, standard sizes, etc, The dimensionality of a data set is the number of attributes that the objects in the data set posses, The sparsity of a data set means frequency of attribute appearances in the descriptions of the objects, The resolution of a data set means an average "distance" between the measurement of the attributes of the data objects, No explicit relationship among records, or data fields, every record has the same set of attributes, A set of records, where each record involves a set of items, All records have fixed set of numeric attributes, data objects can be considered as "points" in a multidimensional space where each dimension represents a distinct attribute describing the object, A data matrix with missing or unavailable elements, Extension of record data where each record has a time moment associated with it, Data set that is a sequence of individual entities, such as a sequence of words or letters, Special type of sequential data in which each record is a time series, i.e., a series of measurements taken over time, Records data that have spatial attributes such as positions or areas and other types of attributes, Relationships among the objects convey important information, the data is represented as a graph, If objects have internal structure then the objects contain sub-objects that have relationships among them, Data with objects that are graphs and have relationships amongst objects, Data Quality: Measurement and data collection errors, Measurement error happens when a value recorded differs from the true value, Noise is a random component of a measurement error, it distorts a value or it adds spurious objects, Data Quality: Precision, bias, accuracy and Outliers. Weka supports major data mining tasks including data mining, processing, visualization, regression etc. - Customer relationship management applications, Medicine/ Science/ Engineering applications, - Understand the mapping relationship between the inter-individual variation in human DNA sequences, Input Data can be described by the following, Data set is a collection of data objects (records, points, vectors, graphs, observations, etc), An attribute is a property or characteristics of an object that may vary either from one object to another or from one time to another, Attribute type is determined by the properties of its values that correspond to underlying properties of the attribute, - Nominal: The values of nominal attribute are just different names, - Discrete attribute: has a finite or countably infinite set of values, e.g. 