Download (PPT, 570KB)


store.theartofservice.com/itil.html
Understanding Data Analytics and Data Mining

Introduction

Introduction

An important aspect of the decision-making process is the ability to transform seemingly unrelated data into useful information which is used to influence a person’s decision. Understanding what data is needed to make effective decisions and where that data comes from is just one step in the process: the next step is mining or analyzing that data to draw up useful conclusions to aid in decision making.

The Understanding Data Analysis and Data Mining presentation is designed to explore the general principles behind this second step and support the organization in understanding their options related to using data effectively in their business.

Distinguishing Analysis and Mining

The terms, “data analysis” and “data mining,” are sometimes used interchangeably, but they are distinctly different in practice.

In data analysis, a hypothesis is formed and the data is analyzed to support or disprove the hypothesis.

In data mining, no hypothesis is formed initially but the data is analyzed to identify any interesting patterns from which a hypothesis can be drawn.

Despite their differences, the techniques and methods for both data analysis and data mining are similar.

Knowledge Discovery in Databases

The Knowledge Discovery in Databases process includes the following steps:

Selection

Preprocessing

Transformation

Data Mining

Interpretation/Evaluation

Knowledge Presentation

Defining Data

Data are a set of facts.

Facts are true or proven.

Data can come in a variety of types:

Relational data

Operational data

Transactional data

Define Data Entry

A data entry is a single instance or record in a database. They are also called data objects.

A data entry establishes relationship between data elements.

person and address

customers and purchases

events and outcomes

Define Dimensions

A dimension is a collection of facts about a measurable situation.

Dimensions define the who, what, where, when, and how of a particular focus on the data.

Dimensions are used to construct how data patterns are identified and analyzed.

Dimensions – Cube Schema

The cube rendering is a product of online analytical processing (OLAP) and is used to show how the different dimensions of data can be viewed.

Retail Example:

4 retail locations

10 products

12 months

2 age groups

Dimensions – Star Schema

Star schemas are used to design how data is organized in data warehouses.

Online Analytical Processing

Online Analytical Processing is an approach for analyzing multidimensional data from multiple perspectives interactively.

The acronym for online analytical processing is OLAP.

Defining Patterns

A pattern is an expression of data which can be modeled.

Data analysis and data mining focuses on identifying, understanding, and drawing conclusions about interesting patterns.

An interesting pattern has the following characteristics:

It can be understood easily by humans

It can be recreated, meaning it has some level certainty to its validity

It can be potentially used by the organization

It is novel, innovative, and requires investigation

For data analysis, it validates and confirms the hypothesis

Queries

Queries are a mechanism for retrieving information from a database: they consist of questions.

Standard queries are predefined questions to ask a database.

Data Mining Techniques

There are several techniques of note in data mining:

Characterization and Discrimination

Associations and Correlations

Classification and regression

Clustering analysis

Outlier analysis

Characterization and Discrimination

Characterization will describe the data in summary or general terms.

Discrimination will describe the data, usually by means of comparison.

Association and Correlation

Associations and correlations are pattern relationships made against data objects.

Often used in frequent pattern mining.

Classification and Regression

Classification attempts to find a predefined data model to describe the data set.

Regression attempts to find an existing data model to describe missing or unavailable numerical data sets.

These are predictive approaches and utilize methods such as decision trees and neural networks.

Cluster Analysis

Data objects are analyzed without using class labels, or generating class labels.

Outlier Analysis

Looks at the abnormalities in data: data that does not behave as expected.

Standards

Cross Industry Standard Process for Data Mining (CRISP-DM) was developed by the European Strategic Program on Research in Information Technology

Sample, Explore, Modify, Model, and Assess (SEMMA) was developed by SAS Institute Inc.

The Toolkit

The Toolkit is designed to enable an organization to improve their capabilities in data warehousing and data analysis, while maintaining a level of neutrality between specific technical solutions. The toolkit is comprised of two parts: an introduction to the concepts and terms used in these areas, and usable templates to pursue and implement specific technical solutions

The goal of the Data Warehouse and Data Analysis Toolkit is to define the contributing factors, major components, and their relationships, while provide the basic tools to take action based on the organization’s needs.

Moving Forward

The presentations found within the Toolkit provide education about the different facets of Data Warehousing and Data Analysis: they can be used for self-edification or as the foundation for presenting a case to different levels of the organization.

The process document, Developing Data Analysis Capabilities, is intended to be a step-by-step guide in creating a Data Analysis foundation in your organizations. Multiple templates have been created to support the process and aid organizations in their efforts to improve their Data Analysis capabilities.

store.theartofservice.com/itil.html