Understanding Data Analytics and Data Mining
An important aspect of the decision-making process is the ability to transform seemingly unrelated data into useful information which is used to influence a person’s decision. Understanding what data is needed to make effective decisions and where that data comes from is just one step in the process: the next step is mining or analyzing that data to draw up useful conclusions to aid in decision making.
The Understanding Data Analysis and Data Mining presentation is designed to explore the general principles behind this second step and support the organization in understanding their options related to using data effectively in their business.
Distinguishing Analysis and Mining
The terms, “data analysis” and “data mining,” are sometimes used interchangeably, but they are distinctly different in practice.
In data analysis, a hypothesis is formed and the data is analyzed to support or disprove the hypothesis.
In data mining, no hypothesis is formed initially but the data is analyzed to identify any interesting patterns from which a hypothesis can be drawn.
Despite their differences, the techniques and methods for both data analysis and data mining are similar.
Knowledge Discovery in Databases
The Knowledge Discovery in Databases process includes the following steps:
Data are a set of facts.
Facts are true or proven.
Data can come in a variety of types:
Define Data Entry
A data entry is a single instance or record in a database. They are also called data objects.
A data entry establishes relationship between data elements.
person and address
customers and purchases
events and outcomes
A dimension is a collection of facts about a measurable situation.
Dimensions define the who, what, where, when, and how of a particular focus on the data.
Dimensions are used to construct how data patterns are identified and analyzed.
Dimensions – Cube Schema
The cube rendering is a product of online analytical processing (OLAP) and is used to show how the different dimensions of data can be viewed.
4 retail locations
2 age groups
Dimensions – Star Schema
Star schemas are used to design how data is organized in data warehouses.
Online Analytical Processing
Online Analytical Processing is an approach for analyzing multidimensional data from multiple perspectives interactively.
The acronym for online analytical processing is OLAP.
A pattern is an expression of data which can be modeled.
Data analysis and data mining focuses on identifying, understanding, and drawing conclusions about interesting patterns.
An interesting pattern has the following characteristics:
It can be understood easily by humans
It can be recreated, meaning it has some level certainty to its validity
It can be potentially used by the organization
It is novel, innovative, and requires investigation
For data analysis, it validates and confirms the hypothesis
Queries are a mechanism for retrieving information from a database: they consist of questions.
Standard queries are predefined questions to ask a database.
Data Mining Techniques
There are several techniques of note in data mining:
Characterization and Discrimination
Associations and Correlations
Classification and regression
Characterization and Discrimination
Characterization will describe the data in summary or general terms.
Discrimination will describe the data, usually by means of comparison.
Association and Correlation
Associations and correlations are pattern relationships made against data objects.
Often used in frequent pattern mining.
Classification and Regression
Classification attempts to find a predefined data model to describe the data set.
Regression attempts to find an existing data model to describe missing or unavailable numerical data sets.
These are predictive approaches and utilize methods such as decision trees and neural networks.
Data objects are analyzed without using class labels, or generating class labels.
Looks at the abnormalities in data: data that does not behave as expected.
Cross Industry Standard Process for Data Mining (CRISP-DM) was developed by the European Strategic Program on Research in Information Technology
Sample, Explore, Modify, Model, and Assess (SEMMA) was developed by SAS Institute Inc.
The Toolkit is designed to enable an organization to improve their capabilities in data warehousing and data analysis, while maintaining a level of neutrality between specific technical solutions. The toolkit is comprised of two parts: an introduction to the concepts and terms used in these areas, and usable templates to pursue and implement specific technical solutions
The goal of the Data Warehouse and Data Analysis Toolkit is to define the contributing factors, major components, and their relationships, while provide the basic tools to take action based on the organization’s needs.
The presentations found within the Toolkit provide education about the different facets of Data Warehousing and Data Analysis: they can be used for self-edification or as the foundation for presenting a case to different levels of the organization.
The process document, Developing Data Analysis Capabilities, is intended to be a step-by-step guide in creating a Data Analysis foundation in your organizations. Multiple templates have been created to support the process and aid organizations in their efforts to improve their Data Analysis capabilities.