a practical guide to analytics e-book

A Practical Guide to Advanced Analytics - EBook

By Wayne Eckerson

OUTLINE:

Part I – What is Analytics and Why Should You Care?

Part II – Advanced Analytics: Where Do You Start?

Part III – The Human Side of Advanced Analytics – TBD

Part IV – Architecting for Analytics

Part V – Tricks and Tips for Starting an Advanced Analytics Practice

Part VI – Crafting an Analytical Data Model

PART I - What is Analytics and Why Should You Care?

One of the hottest technology topics today is analytics. The problem with analytics is that few people agree what it is. This often happens with commonly used terms because everyone attaches a slightly meaning to them based on their needs and perspectives.

I prefer to assign two definitions to analytics to reflect the primary dimensions of the term: its industry context and its technology context. For simplicity’s sake, Analytics with a capital “A” is an umbrella term representing our industry, while analytics with a small “a” refers to technology used to analyze data.

Analytics With a Capital “A”

Analytics as an umbrella term refers to the processes, technologies, and techniques that turn data into information and knowledge that drive business decisions. The cool thing about such industry definitions is that you can reuse them every five years or so. For example, I used this same definition to describe “Data Warehousing” in 1995, “Business Intelligence” in 2000, and “Performance Management” in 2005. Our industry perpetually recreates itself under a new moniker with a slightly different emphasis to expand its visibility and reenergize its base. (See my blog “What’s in a Word? The Evolution of BI Semantics.”)

Today, many people use the term Analytics as a proxy for everything we do in this space, from data warehousing and data integration to reporting and advanced analytics. The most prominent person who defines Analytics this way is Tom Davenport, whose terrific Harvard Business Review articles and books on the subject have prompted many executives to pursue Analytics as a sustainable source of competitive advantage. Davenport is savvy enough to know that if he had called his book “Competing on Business Intelligence” instead of “Competing on Analytics”, he would not be the industry rock star that he is today. (I personally still prefer the term “Business Intelligence” because it perfectly describes what we do: use information to make the business run more intelligently.)

Analytics with a Small “a”

This leaves the term analytics with a small “a” to describe various technologies that business people use to analyze data. This is a broad category of tools that spans everything from Excel, OLAP, and visual analysis tools on one hand, to statistical modeling and optimization tools on the other.

One way to segment analytical tools is to show how they’ve evolved over time, along with reporting tools. Figure 1 shows that we’ve had four waves of business intelligence tools since the 1980s. Specifically, there have been two waves of reporting followed by two waves of analytics. The first wave of analytics took place in the 1990s when business analysts began using ad hoc query/reporting and OLAP tools to explore historical data in a data mart or data warehouse. The second wave of analytics, which just began, involves modeling historical data to optimize the present and predict the future. Most people who talk about analytical tools today refer to this latter type.

Figure 1. Waves of BI

http://www.b-eye-network.com/blogs/eckerson/archives/2011/02/whats_in_a_word.php�

http://www.b-eye-network.com/blogs/eckerson/archives/2011/02/whats_in_a_word.php�

Interestingly, each wave of analytics follows a wave of reporting. This makes sense if you consider that reporting tools are primarily designed for casual users, who comprise 80% of all BI users, and analytical tools are primarily designed for power users, who constitute the remaining 20%. These are two separate, but inter-related markets, which BI vendors need to address.

Deductive and Inductive Analytics

The first wave of analytics—which addresses the question “Why did it happen?”—is deductive in nature, while the second wave of analytics—which addresses the question “What will happen?”—is primarily inductive.

With deductive analytics, business users use tools like Excel, OLAP, and visual analysis tools to explore a hypothesis. They take an educated guess about what might be at the root cause of some anomaly or performance alert and then use analytical tools to explore the data and either verify or negate the hypothesis. If the hypothesis proves false, they come up with a new idea and start looking in that direction.

Inductive analytics is the opposite. Business users don’t start with a hypothesis, they start with a business outcome or goal (e.g., “find the top 10% of our customers and prospects who are most likely to respond to this offer”) and then gather historical data that will help them discern the answer. They then use analytics to create statistical or machine learning models of the data to answer their question. In other words, they don’t start with a hypothesis, they start with the data and let the analytical tools discover the patterns and anomalies for them.

Interestingly, our industry’s former umbrella terms now refer to categories of tools: data warehousing refers to analytical databases and ETL tools; business intelligence refers to query and reporting tools;

and performance management refers to dashboard, scorecard, and planning tools. In time, analytics will be replaced as an umbrella term by some other industry buzzword, and the term will simply refer to deductive and inductive tools, or perhaps just one or the other.

The Value of Analytics

Now that you know what advanced analytics is, the next question is, why should you care?

Your chief financial officer will be glad to know that analytic applications have a higher return on investment than all other BI applications. A landmark study by IDC in 2003 showed that the median ROI for projects that incorporated predictive technologies was 145% compared to 89% for all other projects. This uplift is gained largely by optimizing business processes, making them more efficient and profitable, according to IDC. 1

But what kinds of questions does advanced analytics address? There are four major categories:

1. Analyze the past. Although we mainly use deductive tools to examine past trends, advanced analytical tools model the past. Some seemingly easy questions can be maddingly difficult to answer because they involve the interaction of so many variables. These include, “Why did sales drop last quarter?”

2. Optimize the present. Once we model past activity and understand relationships among key variables, we can harness that information to optimize current processes. For instance, an market basket model can help retailers design store layouts to maximize profits.

3. Predict the future. By applying the model (i.e., mathematical equation) to each new record, we can guess with a reasonable degree of accuracy whether a customer may respond positively to a promotion or a transaction is fraudulent.

4. Test assumptions. Advanced analytics can also be used to test assumptions about what drives the business. For example, prior to spending millions on a marketing campaign, an online retailer might test an assumption that customers located within one square mile of a big box competitor are more likely to churn than others.

Although advanced analytics can be applied to almost any business function, marketing seems to attract the lionshare of analytical work. Research I conducted in 2007 at The Data Warehousing Institute shows that five of the top seven applications for advanced analytics hail from the marketing department. These include cross sell/upsell (47%), campaign management (46%), customer acquisition (41%), attrition/churn/retention (40%), and promotions (31%). (See figure 1.)

Figure 1. Most Common Applications for Advanced Analytics

1 See “The Financial Impact of Business Analytics: Key Findings” IDC, January, 2003.

From Wayne Eckerson, “Predictive Analytics: Extending the Value of Your Data Warehousing Investment,” TDWI, 2007. Based on 166 respondents that had implemented predictive analytics.

In addition, each industry has a handful of applications that are traditional candidates for advanced analytics. (See figure 2.)

Retail

Promotions, replenishment, shelf management, demand forecasting, inventory replenishment, price and merchandizing optimization

Manufacturing

Supply chain optimization, demand forecasting, inventory replenishment, warranty analysis, product customization, new product development

Financial services Credit scoring, fraud detection, pricing, underwriting, claims, customer profitability

Transportation Scheduling, routing, yield management

Healthcare- Drug interaction, preliminary diagnosis, disease management Hospitality Pricing, customer loyalty, yield management Energy Trading, supply, demand forecasting, compliance Government Fraud, case management, crime prevention Online Web Metrics, site design, online recommendations

Adapted from “Analytics at Work” by Tom Davenport, Jeanne Harris, and Robert Morison. (Wiley, 2010.)

Summary

Analytics is a hot technology these days. But like any hot technology, there are multiple definitions of what it means. Analytics with a capital “A” refers the entire domain of using information to make smarter decisions, while analytics with a small “a” refers to tools and techniques to do analysis. On the

47%

46%

41%

41%

40%

32%

31%

30%

30%

26%

25%

18%

17%

12%

Cross-sell/upsell

Campaign management

Customer acquisition

Budgeting and forecasting

Attrition/churn/retention

Fraud detection

Promotions

Pricing

Demand planning

Customer service

Quality improvement

Market Research (Surveys)

Supply chain

Other

technology front, there are two major categories of tools: deductive and inductive. The latter is getting a lot of attention since it’s required to optimize processes and predict future behaviors and activities.

Advanced analytics (which is more inductive in nature) offers significantly more value than other types of BI applications because it helps optimize business processes and answer questions that enable the business to analyze the past, optimize the present, predict the future, and test core assumptions. Today, marketing is the biggest user of advanced analytics technologies although its uses spread wide and far.

PART II

Advanced Analytics: Where Do You Start?

Perhaps your executives have read the articles and books that testify to the transformative power of analytics. Or maybe they have been impressed by IBM's "Smarter Planet" television ads that provide concrete examples of how companies can harness information to make smarter, faster decisions that dramatically improve operations and outcomes. As a result, your executives want to do analytics, and they've asked you to lead the initiative.

Your first question might be: Where do we start? Or, more specifically, where does it make sense to apply analytics in our organization?

To answer this question, the first step is to define analytics. This e-book defines analytics as the use of machine-learning tools to surface trends and patterns in large volumes of complex data. (For a more thorough definition of analytics see See Part I - "What is Analytics?") Some call this class of tools data mining, predictive analytics, knowledge discovery, or advanced analytics. The second, and most important step, is to recognize that there are three main reasons why organizations implement advanced analytics: 1) big data 2) big constraints and 3) big opportunities. Let's address each driver.

Big Data

If you have small amounts of data, you don't need sophisticated machine learning tools and algorithms to identify patterns, associations, trends, and outliers in the data. You can probably eyeball relevant trends by applying simple statistical functions (e.g., min/max, mean, and median) or graphing the data as histograms or simple charts. Taking it one step further, you might want to dimensionalize the data and use OLAP tools or in-memory visual analysis tools to navigate across dimensions and down hierarchies using various grids, graphs, and filters. All these techniques are largely deductive in nature--you first need to know where and how to look before you can find relevant trends and patterns.

But with massive amounts of data, just hunting for patterns using ad hoc analytical tools may prove fruitless or become too unwieldy. Plus, once you detect a pattern, you have no way of modeling it for reuse in other applications. This is where advanced analytical tools shine: you can give them a problem to solve and point them at a large data set; they then discover the patterns and relationships in the data, which they express as mathematical equations. You can use these equations to make strategic decisions or score new records to support just-in-time actions, such as online cross-sell offers, hourly sales forecasts, or event-driven maintenance.

Big data also is likely to have more patterns and relationships than small data. In other words, it’s offers the potential of a bountiful harvest that is worth reaping. Of course, big data also probably has more noise—apparent patterns and trends that are meaningless. For instance, an analytical tool might discover a strong correlation in a customer data set between age and date of birth fields. This is an

accurate, but worthless, correlation. This example also illustrates the importance of transforming data prior to modeling it to increase the relevance of the results.

How Big is Big? Although it doesn't make sense to apply advanced analytics to small data sets, it's not the volume of data that ultimately counts; it's the complexity of data.

For example, you probably don't need advanced analytics to analyze a terabyte of data that contains just two fields; all you really need is a simple calculation and a lot of horsepower. In contrast, a much smaller data set with hundreds of fields makes a much better candidate for advanced analytics. The tools' algorithms calculate the relationships among all these fields, which is nearly impossible to do with traditional reporting and analysis tools. These small data sets are often created by merging together data from dozens of different systems into a wide flat table desired by analytical modelers.

Big Constraints

Although advanced analytics helps when examining big or complex data, it's even more valuable as a method for overcoming internal constraints that prevent you from optimizing a business process. Advanced analytics can help fill the gap when you don't have enough time, money, or people to achieve success. When facing such constraints, advanced analytics can optimize or automate data-intensive processes.

For instance, a social services agency wants to decrease the number of clients affected by delinquent child support payments but it only has two social workers to call 5,000 deadbeat Dads. The agency uses advanced analytics to rank the targeted fathers by their propensity to pay if they receive a call from a social worker. Here, advanced analytics overcomes a labor constraint.

Another common constraint is money. For example, a retailer has $1 million dollars to spend on a direct mail campaign, which means it can only send its new catalog to 100,000 of its 500,000 customers. It uses advanced analytics to rank customers by their propensity purchase an item from the new catalog so it can optimize the uplift of its campaign.

Time can also be a constraint. For example, a company that leases rail cars must fix them when they break. The longer the company takes to fix the rail cars, the more money it loses and the less satisfied customers become. But deciding which repair shop to send the railcars requires considering many variables, including distances to various repair shops, the current wait time at each shop, the expertise at each shop, distance to the exit destination, additional problems that should be fixed while the railcar is in the shop, and so on. An online application that embeds dvanced analytic can consider all these variables and issue a recommendation to the dispatcher while he is still on the phone with the customer who called in the repair.

Another common constraint is lack of management oversight. For instance, a bank wants to standardize how it evaluates and approves loans across its branches. It uses advanced analytics to evaluate each loan and generate an automated recommendation for loan officers. In the same way, a Web site can use

advanced analytics to generate personalized cross-sell recommendations to every customer, based on their past purchases and what other customers like them have purchased.

In short, organizations use advanced analytics to overcome built-in constraints that prevent them from optimizing data-intensive business processes.

Big Opportunity Finally, it makes sense to apply advanced analytics when the business upside justifies the cost. Analytics requires hiring experts who have a strong working knowledge of statistics and know how to create analytical models. They also must be conversant in the business process that the organization wants to optimize and the data that supports that process. Obviously, these individuals aren't inexpensive. And the tools to support the modeling process and the hardware they run on cost money as well. So, before you undertake an analytics project, make sure that the business value justifies the upfront investment.

Fortunately, the cost of building analytical models is declining. A decade ago, you had to hire a PhD statistician who could also write C code or SQL to create analytical models. Today, that is not necessarily true. A good business analyst with some data mining training can create a majority of the analytical models that organizations might need.

However, you still need PhD statisticians when models must be continuously updated, the degree of model accuracy has a huge impact on costs or profits, or the core business runs on analytical models. For example, PhD statisticians are often used to create analytical models for credit card marketing campaigns, fraud detection, and government intelligence.

Costs? How much does it cost to set up an analytical center of excellence? Assuming you hire a handful of analysts and purchase the requisite software and hardware, it's likely to cost about $1 million a year at a minimum. Many companies start smaller by exploiting open source data mining tools and data mining extensions to BI tools and databases. They also might send a talented analyst to training and give him a one-time project to test the approach. If the project succeeds, the organization makes a bigger, more permanent investment in the people and technology. Or they may hire a consultancy to run the initial project and train internal analysts in the tools and techniques.

Summary

It's best to apply analytics to data-intensive business processes that are sub-optimized due to built-in constraints, such as lack of time, people, money, and oversight. Also, advanced analytics only makes sense when the business upside is big enough and the data complex enough to justify the costs.

PART III

The Human Side of Advanced Analytics

Advanced analytics promises to unlock hidden potential in organizational data. If that’s the case, why have so few organizations embraced advanced analytics in a serious way? Most organizations have dabbled with advanced analytics, but outside of credit card companies, online retailers, and government intelligence agencies, few have invested sufficient resources to turn analytics into a core competency.

Advanced analytics refers to the use of machine learning algorithms to unearth patterns and relationships in large volumes of complex data. It’s best applied to overcome various resource constraints (e.g., time, money, labor) where the output justifies the investment of time and money. (See “What is Analytics and Why Should You Care?” and “Advanced Analytics: Where Do You Start?”)

Once an organization decides to invest in advanced analytics, it faces many challenges. To succeed with advanced analytics, organizations must have the right culture, people, organization, architecture, and data. (See figure 1.) This is a tall task. This article examines the “soft stuff” required to implement analytics—the culture, people, and organization—the first three dimensions of the analytical framework in figure 1. A subsequent article examines the “hard stuff”—the architecture, tools, and data.

Figure 1. Framework for Implementing Advanced Analytics

The Right Culture

Culture refers to the rules—both written and unwritten—for how things get done in an organization. These rules emanate from two places: 1) the words and actions of top executives and 2) organizational inertia and behavioral norms of middle management and their subordinates (i.e., “the way we’ve always done it.”) Analytics, like any new information technology, requires executives and middle managers to make conscious choices about how work gets done.

Executives. For advanced analytics to succeed, top executives must first establish a fact-based decision making culture and then adhere to it themselves. Executives must consciously change the way they make decisions. Rather than rely on gut feel alone, executives must make decisions based on facts or intuition validated by data. They must designate authorized data sources for decision making and establish common metrics for measuring performance. They must also hold individuals accountable for outcomes at all levels of the organization.

Executives also need to evangelize the value and importance of fact-based decision making and the need for a performance-driven culture. They need to recruit like-minded executives and continuously reinforce the message that the organization “runs on data.” Most importantly, they not only must “talk the talk,” they must “walk the walk.” They need to hold themselves accountable for performance outcomes and use certifiable information sources, not resort to their trusted analyst to deliver the data view they desire. Executives who don’t follow their own rules send a cultural signal that this analytics fad will pass and so it’s “business as usual.”

Managers and Organizational Inertia. Mid-level managers often pose the biggest obstacles to implementing new information technologies because their authority and influence stems from their ability to control the flow of information, both up and down organizational ladders. Mid-level managers have to buy into new ways of capturing and using information for the program to succeed. If they don’t, they, too, will send the wrong signals to lower level workers. To overcome organizational inertia, executives need to establish new incentives for mid-level managers and hold them accountable for performance metrics aligned with strategic goals around the decision making and the use of information.

The Right People

It’s impossible to do advanced analytics without analysts. That’s obvious. But hiring the right analysts and creating an environment for them to thrive is not easy.

Analysts are a rare breed. They are critical thinkers who need to understand a business process inside and out and the data that supports it. They also must be computer-literate and know how to use various data access, analysis, and presentation tools to do their jobs. Compared to other employees, they are generally more passionate about what they do, more committed to the success of the organization, more curious about how things work, and more eager to tackle new challenges.

But not all analysts do the same kind of work, and it’s important to know the differences. There are four major types of analysts:

• Super Users. These are tech-savvy business users who gravitate to reporting and analysis tools deployed by the business intelligence (BI) team. These analysts quickly become the “go to” people in each department to get an ad hoc report or dashboard, if you don’t want to wait for the BI team. While super users don’t normally do advanced analytics, they play an important role because they offload ad hoc reporting requirements from more skilled analysts.

• Business Analysts. These are Excel jockeys that executives and managers answer to create and evaluate plans, crunch numbers, and generally answer any question an executive or manager might have that can’t be addressed by a standard report or dashboard. With training, they can also create analytical models.

• Analytical Modelers. These analysts have formal training in statistics and a data mining workbench, such as those from IBM (i.e., SPSS) or SAS. They build descriptive and predictive models that are the heart and soul of advanced analytics.

• Data Scientists. These analysts specialize in analyzing unstructured data, such as Web traffic and social media. They write Java and other programs to run against Hadoop and NoSQL databases and know how to write efficient MapReduce jobs that run in “big data” environments.

Where You Find Them. Most organizations struggle to find skilled analysts. Many super users and business analysts are self-taught Excel jockeys, essentially tech-savvy business people who aren’t afraid to learn new software tools to do their jobs. Many business school graduates fill this role, often as a stepping stone to management positions. Conversely, a few business-savvy technologists can grow into this role, including data analysts and report developers who have a proclivity toward business and working with business people.

Analytical modelers and data scientists require more training and skills. These analysts generally have a background in statistics or number crunching. Statisticians with business knowledge or social scientists with computer skills tend to excel in these roles. Given advances in data mining workbenches, it’s not critical that analytical modelers know how to write SQL or code in C, as in the past. However, data scientists aren’t so lucky. Since Hadoop is an early stage technology, data scientists need to know the basics of parallel processing and how to write Java and other programs in MapReduce. As such, they are in high demand right now.

The Right Organization

Business analysts play a key role in any advanced analytics initiative. Given the skills required to build predictive models, analysts are not cheap to hire or easy to retain. Thus, building the right analytical organization is key to attracting and retaining skilled analysts.

Today, most analysts are hired by department heads (e.g., finance, marketing, sales, or operations) and labor away in isolation at the departmental level. Unless given enough new challenges and opportunities for advancement, analysts are easy targets for recruiters.

Analytics Center of Excellence. The best way to attract and retain analysts is to create an Analytics Center of Excellence. This is a corporate group that oversees and manages all business analysts in an

organization. The Center of Excellence provides a sense of community among analysts and enables them to regularly exchange ideas and knowledge. The Center also provides a career path for analysts so they are less tempted to look elsewhere to advance their careers. Finally, the Center pairs new analysts with veterans who can give them the mentoring and training they need to excel in their new position.

The key with an Analytics Center of Excellence is to balance central management with process expertise. Nearly all analysts should be embedded in departments and work side by side with business people on a daily basis. This enables analysts to learn business processes and data at a granular level while immersing the business in analytical techniques and approaches. At the same time, the analyst needs to work closely with other analysts in the organization to reinforce the notion that they are part of a larger analytical community.

The best way to accommodate these twin needs is by creating a matrixed analytical team. Analysts should report directly to department heads and indirectly to a corporate director of analytics or vice versa. In either case, the analyst should physically reside in his assigned department most or all days of the week, while participating in daily “stand up” meetings with other analysts so they can share ideas and issues as well as regular off-site meetings to build camaraderie and develop plans. The corporate director of analytics needs to work closely with department heads to balance local and enterprise analytical requirements.

Summary

Advanced analytics is a technical discipline. Yet, some of the keys to its success involve non-technical facets, such as culture, people, and organization. For an analytics initiative to thrive in an organization, executives must create a fact-based decision making culture, hire the right people, and create an analytics center of excellence that attracts, trains, and retains skilled analysts.

Part IV

Architecting for Analytics

The prior article in this series discussed the human side of analytics. It explained how companies need to have the right culture, people, and organization to succeed with analytics. The flip side is the “hard stuff”– the architecture, platforms, tools, and data—that makes analytics possible. Although analytical technology gets the lionshare of attention in the trade press—perhaps more than it deserves for the value it delivers--it nonetheless forms the bedrock of all analytical initiatives. This article examines the architecture, platforms, tools, and data needed to deliver robust analytical solutions.

Architecture

The term “analytical architecture” is an oxymoron. In most organizations, business analysts are left to their own devices to access, integrate, and analyze data. By necessity, they create their own data sets and reports outside the purview and approval of corporate IT. By definition, there is no analytical architecture in most organizations—just a hodge-podge of analytical silos and spreadmarts, each with conflicting business rules and data definitions.

Analytical sandboxes. Fortunately, with the advent of specialized analytical platforms (discussed below), BI architects have more options for bringing business analysts into the corporate BI fold. They can use these high-powered database platforms to create analytical sandboxes for the explicit use of business analysts. These sandboxes, when designed properly, give analysts the flexibility they need to access corporate data at a granular level, combine it with data that they’ve sourced themselves, and conduct analyses to answer pressing business questions. With analytical sandboxes, BI teams can transform business analysts from data pariahs to full-fledged members of the BI community.

There are four types of analytical sandboxes:

• Staging Sandbox. This is a staging area for a data warehouse that contains raw, non-integrated data from multiple source systems. Analysts generally prefer to query a staging area that contains all the raw data than each source system individually. Hadoop is a staging area for large volumes of unstructured data that a growing number of companies are adding to their BI ecosystems.

• Virtual Sandbox. A virtual sandbox is a set of tables inside a data warehouse assigned to individual analysts. Analysts can upload data into the sandbox and combine it with data from the data warehouse, giving them one place to go to do all their analyses. The BI team needs to carefully allocate compute resources so analysts have enough horsepower to run ad hoc queries without interfering with other workloads running on the data warehouse.

• Free-standing sandbox. A free-standing sandbox is a separate database server that sits alongside a data warehouse and contains its own data. It’s often used to offload complex, ad hoc queries from an enterprise data warehouse and give business analysts their own space to

play. In some cases, these sandboxes contain a replica of data in the data warehouse, while in others, they support entirely new data sets that don’t fit in a data warehouse or run faster on an analytical platform.

• In-memory BI sandbox. Some desktop BI tools maintain a local data store, either in memory or on disk, to support interactive dashboards and queries. Analysts love these types of sandboxes because they connect to virtually any data source and enable analysts to model data, apply filters, and visually interact with the data without IT intervention.

Next-Generation BI Architecture. Figure 1 depicts a BI architecture with the four analytical sandboxes colored in green. The top half of the diagram represents a classic top-down, data warehousing architecture that primarily delivers interactive reports and dashboards to casual users (although the streaming/complex event processing (CEP) engine is new.) The bottom half of the diagram depicts a bottom-up analytical architecture with analytical sandboxes along with new types of data sources. This next-generation BI architecture better accommodates the needs of business analysts and data scientists, making them full-fledged members of the corporate BI ecosystem.

Figure 1. The New BI Architecture

The next-generation BI architecture is more analytical, giving power users greater options to access and mix corporate data with their own data via various types of analytical sandboxes. It also brings unstructured and semi-structured data fully into the mix using Hadoop and nonrelational databases.

Analytical Platforms

Machine Data

Web Data

Hadoop Cluster

Operational Systems(Structured data)

Power User

BI Server

Casual UserOperational System

Operational System

Documents & Text

Free-StandingSandbox

Dept Data Mart

Data Warehouse

Virtual Sandboxes

Top-down Architecture

Bottom-up Architecture

In-memory BI Sandbox

External Data

Audio/video Data

Streaming/ CEP

Engine

Extract, Transform, Load(Batch, near real-time, or real-time)

Analytical platform or non-relational database

Since the beginning of the data warehousing movement in the early 1990s, organizations have used general-purpose data management systems to implement data warehouses and, occasionally, multidimensional databases (i.e., “cubes”) to support subject-specific data marts, especially for financial analytics. General-purpose data management systems were designed for transaction processing (i.e., rapid, secure, synchronized updates against small data sets) and only later modified to handle analytical processing (i.e., complex queries against large data sets.) In contrast, analytical platforms focus entirely on analytical processing at the expense of transaction processing.2

The analytical platform movement. In 2002, Netezza (now owned by IBM), introduced a specialized analytical appliance, a tightly integrated, hardware-software database management system designed explicitly to run ad hoc queries against large volumes of data at blindingly fast speeds. Netezza’s success spawned a host of competitors, and there are now more than two dozen players in the market. (see Table 1).

Table 1. Types of Analytical Platforms

Technology Description Vendor/Product MPP analytical databases

Row-based databases designed to scale out on a cluster of commodity servers and run complex queries in parallel against large volumes of data.

Teradata Active Data Warehouse, Greenplum (EMC), Microsoft Parallel Data Warehouse, Aster Data (Teradata), Kognitio, Dataupia

Columnar databases Database management systems that store data in columns, not rows, and support high data compression ratios.

ParAccel, Infobright, Sand Technology, Sybase IQ (SAP), Vertica (Hewlett-Packard), 1010data, Exasol, Calpont

Analytical appliances Preconfigured hardware-software systems designed for query processing and analytics that require little tuning.

Netezza (IBM), Teradata Appliances, Oracle Exadata, Greenplum Data Computing Appliance (EMC)

Analytical bundles Predefined hardware and software configurations that are certified to meet specific performance criteria, but the customer must purchase and configure themselves.

IBM SmartAnalytics, Microsoft FastTrack

In-memory databases Systems that load data into memory to execute complex queries.

SAP HANA, Cognos TM1 (IBM), QlikView, Membase

Distributed file-based systems

Distributed file systems designed for storing, indexing, manipulating and querying large volumes of unstructured and semi-structured data.

Hadoop (Apache, Cloudera, MapR, IBM, HortonWorks), Apache Hive, Apache Pig

Analytical services Analytical platforms delivered as a 1010data, Kognitio

2 Like most things, there are exceptions to this rule. For example, Oracle Exadata runs on Oracle 10g and, as such, it supports both transactional and analytical processing, often with superior performance in both realms compared with standard Oracle 10g installations.

hosted or public-cloud-based service. Nonrelational Nonrelational databases optimized for

querying unstructured data as well as structured data.

MarkLogic Server, MongoDB, Splunk, Attivio, Endeca, Apache Cassandra, Apache Hbase

CEP/streaming engines Ingest, filter, calculate, and correlate large volumes of discrete events and apply rules that trigger alerts when conditions are met.

IBM, Tibco, Streambase, Sybase (Aleri), Opalma, Vitria, Informatica

Today, the technology behind analytical platforms is diverse: appliances, columnar databases, in memory databases, massively parallel processing (MPP) databases, file-based systems, nonrelational databases and analytical services. What they all have in common, however, is that they provide significant improvements in price-performance, availability, load times and manageability compared with general-purpose relational database management systems. Every analytical platform customer I’ve interviewed has cited an order-of-magnitude performance gains that most initially don’t believe.

Moreover, many of these analytical platforms contain built-in analytical functions that make life easier for business analysts. These functions range from fuzzy matching algorithms and text analytics to data preparation and data mining functions. By putting functions in the database, analysts no longer have to craft complex, custom SQL or offboard data to analytical workstations, which limits the amount of data they can analyze and model.

Companies use analytical platforms to support free-standing sandboxes (described above) or as replacements for data warehouses running on MySQL and SQL Server, and occasionally major OLTP databases from Oracle and IBM. They also improve query performance for ad hoc analytical tools, especially those that connect directly to databases to run queries (versus those that download data to a local cache.)

Analytical Tools

In 2010, vendors turned their attention to meeting the needs of power users after ten years of enhancing reporting and dashboard solutions for casual users. As a result, the number of analytical tools on the market has exploded.

Analytical tools come in all shapes and sizes. Analysts generally need one of every type of tool. Just as you wouldn’t hire a carpenter to build an addition to your house with just one tool, you don’t want to restrict an analyst to just one analytical tool. Like a carpenter, an analyst needs a different tool for every type of job they do. For instance, a typical analyst might need the following tools:

• Excel to extract data from various sources, including local files, create reports, and share them with others via a corporate portal or server (managed Excel).

• BI Search tools to issue ad hoc queries against a BI tool’s metadata.

• Planning tools (including Excel) to create strategic and tactical plans, each containing multiple scenarios.

• Mashboards and ad hoc reporting tools to create ad hoc dashboards and reports on behalf of departmental colleagues

• Visual discovery tools to explore data in one or more sources of data and create interactive dashboards on behalf of departmental colleagues

• Multidimensional OLAP (MOLAP) tools to explore small and medium sets of data dimensionally at the speed of thought and run complex dimensional calculations.

• Relational OLAP tools to explore large sets of data dimensionally and run complex calculations

• Text analytics tools to parse text data and put it in a relational structure for analysis.

• Data mining tools to create descriptive and predictive models.

• Hadoop and MapReduce to process large volumes of unstructured and semi-structured data in a parallel environment.

Figure 2. Types of Analytical Tools

Figure 2 plots these tools on a graph where the x axis represents calculation complexity and the y axis represents data volumes. Ad hoc analytical tools for casual users (or more realistically super users) are clustered in the bottom left corner of the graph, while ad hoc tools for power users are clustered slightly above and to the right. Planning and scenario modeling tools cluster further to the right, offering slightly more calculation complexity against small volumes of data. High-powered analytical tools, which generally rely on machine learning algorithms and specialized analytical databases, cluster in the upper right quadrant.

Data

Business analysts function like one-man IT shops. They must access, integrate, clean and analyze data, and then present it to other users. Figure 2 depicts the typical workflow of a business analyst. If an

organization doesn’t have a mature data warehouse that contains cross-functional data at a granular level, they often spend an inordinate amount of time sourcing, cleaning, and integrating data. (Steps 1 and 2 in the analyst workflow.) They then create a multiplicity of analytical silos (step 5) when they publish data, much to the chagrin of the IT department.

Figure 2. Analyst Workflow

In the absence of a data warehouse that contains all the data they need, business analysts must function as one-man IT shops where they spend an inordinate amount of time iterating between collecting, integrating, and analyzing data. They run into trouble when they distribute their hand-crafted data sets broadly.

Data Warehouse. The most important way that organizations can improve the productivity and effectiveness of business analysts is to maintain a robust data warehousing environment that contains most of the data that analysts need to perform their work. This can take many years. In a fast-moving market where the company adds new products and features continuously, the data warehouse may never catch up. But, nonetheless, it’s important for organizations to continuously add new subject areas to the data warehouse, otherwise business analysts have to spend hours or days gathering and integrating this data themselves.

Atomic Data. The data warehouse also needs to house atomic data, or data at the lowest level of transactional detail, not summary data. Analysts generally want the raw data because they can repurpose in many different ways depending on the nature of the business questions they’re addressing. This is the reason that highly skilled analysts like to access data directly from source systems or a data warehouse staging area. At the same time, less skilled analysts appreciate the heavy lifting

done by the IT group to clean and integrate disparate data sets using common metrics, dimensions, and attributes. This base level of data standardization expedites their work.

Once a BI team integrates a sufficient number of subject areas in a data warehouse at an atomic level of data, business analysts can have a field day. Instead of downloading data to an analytical workstation, which limits the amount of data they can analyze and process, they can now run calculations and models against the entire data warehouse using analytical functions built into the database or that they’ve created using database development toolkits. This improves the accuracy of their analyses and models and saves them considerable time.

Summary

The technical side of analytics is daunting. There are many moving parts that all have to work synergistically together. However, the most important part of the technical equation is the data. The old adage holds true: “garbage in, garbage out.” Analysts can’t deliver accurate insights if they don’t have access to good quality data. And it’s a waste of their time to spend days trying to prepare the data for analysis. A good analytics program is built on a solid data warehousing foundation that embeds analytical sandboxes tailored to the requirements of individual analysts.

PART V

Tricks and Tips for Starting an Advanced Analytics Practice

The previous two articles in this series covered the organizational and technical factors required to succeed with advanced analytics. But as with most things in life, the hardest part is getting started. This final article shows how to kickstart an analytics practice and rev it into high gear.

The problem with selling an analytics practice is that most business executives who would support and fund the initiative haven’t heard of the term. Some will think it’s another IT boondoggle in the making and will politely deny or put off your request. You’re caught in the chicken-or-egg riddle: it’s hard to sell the value of analytics until you’ve shown tangible results. But you can’t deliver tangible results until an executive buys into the program.

Of course, you may be fortunate to have enlightened executives who intuitively understand the value of analytics and are coming to you to build a practice. That’s a nice fairy tale. Even with enlightened executives, you still need to prove the value of the technology and, more importantly, your ability to harness it. Even in a best-case scenario, you get one chance to prove yourself.

So, here are ten steps you can take to jumpstart an analytics practice, whether you are working at the grassroots level or working at the behest of a eager senior executive.

1. Find an Analyst. This seems too obvious to state, but it’s hard to do in practice. Good analysts are hard to come by. They combine a unique knowledge of business process, data, and analytical tools. As people, they are critical thinkers who are inquisitive, doggedly persistent, and passionate about what they do. Many analysts have M.B.A. degrees or trained as social scientists, statisticians, or Six Sigma practitioners. Occasionally, you’ll be able to elevate a precocious data analyst or BI report developer into the role.

2. Find an Executive. Good sponsors are almost as rare as good analysts. A good sponsor is someone who is willing to test long-held assumptions using data. For instance, event companies mail their brochures 12 weeks before every conference. Why? No one knows; it’s the way it’s always been done. But maybe they could get a bigger lift from their marketing investments if they mailed the brochures 11 or 13 weeks out, or shifted some of their marketing spend from direct mail to email and social media channels. A good sponsor is willing to test such assumptions.

3. Focus Your Efforts. If you’ve piqued an executive’s interest, then explain what resources you need, if any, to conduct a test. But don’t ask for much, because you don’t need much to get going. Ideally, you should be able to make do with people and tools you have inhouse. A good analyst can work miracles with Excel and SQL and there are many open source data mining packages on the market today as well as low cost statistical add-ins to Excel and BI tools. Select a project that is interesting enough to be valuable to the company, but small enough to minimize risk.

4. Talk Profits. It’s very important to remember that your business sponsor won’t trust your computer model. They will go with their gut instinct rather than rely on a mathematical model to make a major decision. They will only trust the model if it shows either tangible lift (i.e., more revenues or profits), or it validates their own experience and knowledge. For example, the head of marketing for an online retailer will trust a market basket model if he realizes that the model has detected purchasing habits of corporate procurement officers who buy office items for new hires.

5. Act on Results. There is no point creating analytical models if the business doesn’t act on them. There are many ways to make models actionable. You can present the results to executives whose go-to-market strategies might be shaped by the findings. Or you can embed the models in a weekly churn report distributed to sales people that indicates which customers are likely to attrite in the near future. (See figure 1.) Or you can embed models in operational applications so they are triggered by new events (e.g., a customer transaction) and automatically spit out recommendations (e.g., cross-sell offers.)

Figure 1. An Actionable Report

6. Make it Useful. The models not only should be actionable, they should be proactive. The worst thing you can do is tell a salesperson something they already know. For instance, if the model says, “This customer is likely to churn because they haven’t purchased anything in 90 days”, a salesperson is likely to say, “Duh, tell me something I don’t already know.” A better model would be one that detects patterns not immediately obvious to the salesperson. For example, “This customer makes frequent purchases but their overall monthly expenditures have dropped ten percent since the beginning of the year.”

7. Consolidate Data. Too often, analysts play the role of IT manager by accessing, moving, and transforming data before they begin analyze it. Although the DW team will never be able to identify and consolidate all the data that analysts might need, it can always do a better job understanding their requirements and making the right data available at the right level of granularity. This might require purchasing demographic data and creating specialized wide, flat tables preferred by modelers. It might also mean supporting specialized analytical functions inside the database that lets the modelers profile, prepare, and model data.

8. Unlock Your Data. Unfortunately, most IT managers don’t provide analysts ready access to corporate data for fear that their SQL queries will grind an operational system or data warehouse to a halt. To balance access and performance, IT managers should create an analytical sandbox that enables modelers to upload their own data and mix it with corporate data in the warehouse. These sandboxes can be virtual table partitions inside the data warehouse or dedicated analytical machines that contain a replica of corporate data or an entirely new data set. In either case, the modelers get free and open access to data and IT managers get to worry less about resource contention.

9. Govern Your Data. Because analysts are so versatile with data, they often get pulled in multiple directions. The lowest value-added activity they perform is creating ad hoc queries for business colleagues. This type of work is better left to super users in each department. But to prevent Super Users from generating thousands of duplicate or conflicting reports, the BI team needs to establish a report governance committee that evaluates requests for new reports, maps them to an existing inventory, and decides which ones to build or roll into existing report structures. Ideally, the report governance committee is comprised of Super Users who are already creating most of the reports users use.

10. Centralize Analysts. It’s imperative that analysts feel part of a team and not isolated in some departmental silo. An Analytics Center of Excellence can help build camaraderie among analysts, cross train them in different disciplines and business processes, and mentor new analysts. A director of analytics needs to prioritize analytics projects, cultivate an analytics mindset in the corporation, and maintain a close alliance with the data warehousing team. In fact, it’s best if the director of analytics also has responsibility for the data warehouse. Ideally, 80% to 90% of analysts are embedded in the departments where they work side by side with business users and the rest reside at corporate headquarters where they focus on cross-departmental initiatives.

Summary

Although some of the steps defined above are clearly for novices, even analytics teams that are more advanced still struggle with many of the items. To succeed with analytics ultimately requires a receptive culture, top-notch people (i.e., analysts), comprehensive and clean data, and the proper tools. Success will not come quickly but takes a sustained effort. But the payoff, when it comes, is usually substantial.

Part VI - Crafting Analytical Models Model-making is at the heart of advanced analytics. Thankfully, few of us need to create analytical models or learn the statistical techniques upon which they’re based. However, any self-respecting business intelligence (BI) professional needs to understand the modeling process so he can better support the data requirements of analytical modelers.

Analytical Models

An analytical model is simply a mathematical equation that describes relationships among variables in a historical data set. The equation either estimates or classifies data values. In essence, a model draws a “line” through a set of data points that can be used to predict outcomes. For example, a linear regression draws a straight line through data points on a scatterplot that shows the impact of advertising spend on sales for various ad campaigns. The model’s formula—in this case, “Sales=17.813 + (.0897* advertising spend)”— enables executives to accurately estimate sales if they spend a specific amount on advertising. (See figure 1.)

Figure 1. Estimation Model (Linear Regression)

Algorithms that create analytical models (or equations) come in all shapes and sizes. Classification algorithms, such as neural networks, decision trees, clustering, and logistic regression, use a variety of techniques to create formulas that segregate data values into groups. Online retailers often use these algorithms to create target market segments or determine which products to recommend to buyers based on their past and current purchases. (See figure 2.)

Figure 2. Classification Algorithms

Classification models separate data values into logical groups.

Trusting Models. Unfortunately, some models are more opaque than others; that is, it’s hard to understand the logic the model used to identify relevant patterns and relationships in the data. The problem with these “black box” models is that business people often have a hard time trusting them until they see quantitative results, such as reduced costs or higher revenues. Getting business users to understand and trust the output of analytical models is perhaps the biggest challenge in data mining.

To earn trust, analytical models have to validate a business person’s intuitive understanding of how the business operates. In reality, most models don’t uncover brand new insights; rather they unearth relationships that people understand as true but aren’t looking at or acting upon. The models simply refocus people’s attention on what is important and true and dispel assumptions (whether conscious or unconscious) that aren’t valid.

Modeling Process

Given the power of analytical models, it’s important that analytical modelers take a disciplined approach. Analytical modelers need to adhere to a methodology to work productively and generate accurate models. The modeling process consists of six distinct tasks:

1) Define the project 2) Explore the data 3) Prepare the data 4) Create the model 5) Deploy the model 6) Manage the model

Interestingly, preparing the data is the most time-consuming part of the process, and if not done right, can torpedo the analytical model and project. “[Data preparation] can easily be the difference between success and failure, between usable insights and incomprehensible murk, between worthwhile predictions and useless guesses,” writes Dorian Pyle in his book, “Data Preparation for Data Mining.”

Figure 3 shows a breakdown of the time required for each of these six steps. Data preparation consumes one-quarter (25%) of an analytical modeler’s time, followed by model creation (23%), data exploration (18%), project definition (13%), scoring and deployment (12%), and model management (9%). Thus, almost half of an analytical modelers’ time (43%) is spent exploring and preparing data, although this varies based on the condition and availability of data. Analytical modelers are like house painters who must spend lots of time preparing a paint surface to ensure a long-lasting paint finish.

Figure 3. Analytical Modeling Tasks

From Wayne Eckerson, “Predictive Analytics: Extending the Value of Your Data Warehousing Investment,” 2007. Based on 166 respondents who have a predictive modeling practice.

Project Definition. Although defining an analytical project doesn’t take as long as some of the other steps, it’s the most critical task in the process. Modelers that don’t know explicitly what they’re trying to accomplish won’t be able to create useful analytical models. Thus, before they start, good analytical modelers spend a lot of time defining objectives, impact, and scope.

Project objectives consist of the assumptions or hypotheses that a model will evaluate. Often, it helps to brainstorm hypotheses and then prioritize them based on business requirements. Project impact defines the model output (e.g., a report, a chart, or scoring program), how the business will use that output (e.g., embedded in a daily sales report or operational application or used in strategic planning), and the projected return on investment. Project scope defines who, what, where, when, why, and how of the project, including timelines and staff assignments.

For example, a project objective might be: “Reduce the amount of false positives when scanning credit card transactions for fraud.” While the output might be: “A computer model capable of running on a server and measuring 7,000 transaction per minute, scoring each with probability and confidence, and routing transactions above a certain threshold to an operator for manual intervention.”3

Data Exploration. Data exploration or data discovery involves sifting through various sources of data to find the data sets that best fit the project. During this phase, the analytical modeler will document each potential data set with the following items:

• Access methods: Source systems, data interfaces, machine formats (e.g. ASCII or EBCDIC), access rights, and data availability.

• Data characteristics: Field names, field lengths, content, format, granularity and statistics (e.g. counts, mean, mode, median, and min/max values)

• Business rules: Referential integrity rules, defaults, other business rules

• Data pollution: Data entry errors, misused fields, bogus data

• Data completeness: Empty or missing values, sparsity

• Data consistency: Labels and definitions

Typically, an analytical modeler will compile all this information into a document and use it to help prioritize which data sets to use for which variables. (See figure 4.) A data warehouse with well documented metadata can greatly accelerate the data exploration phase because it also maintains much of this information. However, analytical modelers often want to explore external data and other data sets that don’t exist in the data warehouse and must compile this information manually.

Figure 4. Data Profile Document

3 From Dorian Pyle, “Data Preparation for Data Mining.”

A data profile document describes the properties of a potential data set.

Data Preparation. Once analytical modelers document and select their data sets, they then must standardize and enrich the data. First, this means correcting any data errors that exist in the data and standardizing the machine format (e.g. ASCII vs EBCDIC). Then, it involves merging and flattening the data into a single wide table which may consist of hundreds of variables (i.e., columns). Finally, it means enriching the data with third party data, such as demographic, psychographic, or behavioral data that can enhance the models.

From there, analytical modelers transform the data so it’s in an optimal form to address project objectives and meet processing requirements for specific machine learning techniques. Common transformations include summarizing data using reverse pivoting(See figure 5), transforming categorical values into numerical values, normalizing numeric values so they range from 0 to 1, consolidating continuous data into a finite set of bins or categories, removing redundant variables, and filling in missing values. Modelers try to eliminate variables and values that aren’t relevant as well as fill in empty fields with estimated or default values. In some cases, modelers may want to increase the bias or skew in a data set by duplicating outliers, giving them more weight in the model output. These are just some of the many data preparation techniques that analytical modelers use.

Figure 5. Reverse Pivoting

To model a banking “customer” not bank transactions, analytical modelers use a technique called reverse pivoting to summarize banking transactions to show customer activity by period.

Analytical Modeling. Analytical modeling is as much art as science. Much of the craft involves knowing what data sets and variables to select and how to format and transform the data for specific data models. Often, a modeler will start with 100+ variables and then, through data transformation and experimentation, winnow them down to 12 to 20 variables that are most predictive of the desired outcome.

In addition, an analytical modeler needs to select historical data that has enough of the “answers” built in it with a minimal amount of noise. Noise consists of patterns and relationships that have no business value, such as a person’s birth date and age, which gives a 100 percent correlation. A data modeler will eliminate one of those variable to reduce noise. In addition, they will validate their models by testing them against random subsets of the data which they set aside in advance. If the scores remain compatible across training, testing, and validation data sets then they know they have a fairly accurate and relevant model.

Finally, the modeler must choose the right analytical techniques and algorithms or combinations of techniques to apply to a given hypothesis. This is where modelers’ knowledge of business processes, project objectives, corporate data, and analytical techniques come into play. They may need to try many combinations of variables and techniques before they generate a model with sufficient predictive value.

Every analytical technique and algorithm has its strengths and weaknesses, as summarized in the tables below. The goal is to pick the right modeling technique so you have to do as little preparation and transformation as possible, according to Michael Berry and Gordon Linhoff in their book, “Data Mining Techniques: For Marketing, Sales, and Customer Support.”

Table 1. Analytical Models

Table 2. Analytical Techniques

Task Use TechniquesClassification Assign new records to a predefined class

based on its features; used to predict an outcome: yes/no; high/medium/low

Logistic regression; Decision Trees; Neural Networks; Link Analysis

Forecasting Technique for predicting a continuous numerical outcome.

Linear Regression; Neural Networks

Prediction Uses estimation or classification to predict future behavior or value.

Neural networks, Decision trees, Link Analysis, Genetic algorithms, Market Basket Analysis

Affinity Grouping Finds rules that define which items go together; good for market basket, cross-selling, & root cause analysis

Market Basket Analysis, Memory Based Reasoning, Link Analysis

Clustering Find natural groupings of things that are more like each other than members of another cluster.

Neural Networks, Decision Trees, Cluster Detection, Market Basket Analysis, Memory-Based Reasoning

Deploy the Model. Model deployment takes many forms, as mentioned above. Executives can simply look at the model, absorb its insights, and use it to guide their strategic or operational planning. But models can also be operationalized. The most basic way to do operationalize a model is to embed it in an operational report. For example, a daily sales report for a telecommunications company might list each sales representative’s customers by their propensity to churn. Or a model might be applied at the point of customer interaction, whether at a branch office or at an online checkout counter.

To apply models, you first have to score all the relevant records in your database. This involves converting the model into SQL or some other program that can run inside the database that holds the records that you want to score. Scoring involves running the model against each record and generating a numeric value, usually between 0 and 1, which is then appended to the record as an additional column. A higher score generally means a higher propensity to portray the desired or predicted behavior. Scoring is usually a batch process that happens at night or on the weekend depending on the volume of records that need to be scored. However, scoring can also happen in real-time, which is essentially what online retailers do when they make real-time recommendations based on purchases a customer just made.

Model Management. Once the model is built and deployed, it must be maintained. Models become obsolete over time, as the market or environment in which they operate changes. This is particularly true for volatile environments, such as customer marketing or risk management. Also, complex models that deliver high business value usually require a team of people to create, modify, update, and certify the models. In such an environment, it’s critical to have a model repository that can track versions, audit usage, and manage a model through its lifecycle. Once an organization has more than one operational

Technique Strengths ConsiderationsNeural Networks Flexible; mimics interactions of neurons in

human brain; can handle time-based inputs; can model multiple variables at once.

Models aren’t easily explained; all values must be between 0 and 1 with no nulls. Not great for categorical variables or lots of variables.

Decision Trees Models are easy to explain; good for categorical and numeric data; good for creating a subset of fields as input to another technique.

Models can get “bushy” with sparse data and have to be “pruned.”

Memory-based Reasoning

Finds values that most resemble the variable to make a prediction. Little prep. Adapts to new inputs without training. Works with text.

Don’t work with numeric variables. Only categorical variables. Doesn’t work well with lots of variables.

Market Basket Analysis

A form of clustering that creates rules about which items are purchased together.

Don’t work with numeric variables. Only categorical variables.

Genetic Algorithms

Uses natural selection; tests each prediction against each other to determine the best one.

Not for classification

Clustering Undirected learning. Finds natural groups. Good way to start.

Not predictive

model, it’s imperative it implements model management utilities, which most data mining vendors now support.

Summary

Analytical models can be powerful. They can help organizations use information proactively instead of reactively. They can make predictions that streamline business processes, reduce costs, increase revenues, and improve customer satisfaction.

To create analytical models is as much art as science. A well-trained modeler needs to step through a variety of data-oriented tasks to create accurate models. Much of the heavy lifting involved in creating analytical models involves exploring and preparing the data. A well designed data warehouse or data mart can accelerate the modeling process by collecting and documenting a large portion of the data that modelers require and transforming that data into wide, flat tables conducive to the modeling process.

a practical guide to analytics e-book

Documents

term analytics

waves of analytics

wave of analytics

advanced analytics ebook

advanced analytics tbd

advanced analytics practice

term business intelligence

data warehousing