data mining - cluster analysisgecnilokheri.ac.in/gpcontent/data mining-unit iv.docx · web viewa...

Unit IV: Cluster Analysis and MiningIntroduction to cluster analysis, Mining complex type of data: Multidimensional analysis and descriptive mining of complex data objects, Spatial databases, Multimedia databases, Mining time series and sequence data, Mining text databases, Mining World Wide Web, Data Chunking Techniques.

Assignment: Mining time series and sequence data

Data Mining - Cluster AnalysisCluster is a group of objects that belongs to the same class. In other words, similar objects are grouped in one cluster and dissimilar objects are grouped in another cluster.

What is Clustering?

Clustering is the process of making a group of abstract objects into classes of similar objects.

Points to Remember A cluster of data objects can be treated as one group.

While doing cluster analysis, we first partition the set of data into groups based on data similarity and then assign the labels to the groups.

The main advantage of clustering over classification is that, it is adaptable to changes and helps single out useful features that distinguish different groups.

Applications of Cluster Analysis

Clustering analysis is broadly used in many applications such as market research, pattern recognition, data analysis, and image processing.

Clustering can also help marketers discover distinct groups in their customer base. And they can characterize their customer groups based on the purchasing patterns.

In the field of biology, it can be used to derive plant and animal taxonomies, categorize genes with similar functionalities and gain insight into structures inherent to populations.

Clustering also helps in identification of areas of similar land use in an earth observation database. It also helps in the identification of groups of houses in a city according to house type, value, and geographic location.

Clustering also helps in classifying documents on the web for information discovery.

Clustering is also used in outlier detection applications such as detection of credit card fraud.

As a data mining function, cluster analysis serves as a tool to gain insight into the distribution of data to observe characteristics of each cluster.

Requirements of Clustering in Data Mining

The following points throw light on why clustering is required in data mining − Scalability − We need highly scalable clustering algorithms to deal with large databases.

Ability to deal with different kinds of attributes − Algorithms should be capable to be applied on any kind of data such as interval-based (numerical) data, categorical, and binary data.

Discovery of clusters with attribute shape − The clustering algorithm should be capable of detecting clusters of arbitrary shape. They should not be bounded to only distance measures that tend to find spherical cluster of small sizes.

High dimensionality − The clustering algorithm should not only be able to handle low-dimensional data but also the high dimensional space.

Ability to deal with noisy data − Databases contain noisy, missing or erroneous data. Some algorithms are sensitive to such data and may lead to poor quality clusters.

Interpretability − The clustering results should be interpretable, comprehensible, and usable.

Clustering Methods

Clustering methods can be classified into the following categories −

Partitioning Method Hierarchical Method Density-based Method Grid-Based Method Model-Based Method Constraint-based Method

Partitioning Method

Suppose we are given a database of ‘n’ objects and the partitioning method constructs ‘k’ partition of data. Each partition will represent a cluster and k ≤ n. It means that it will classify the data into k groups, which satisfy the following requirements −

Each group contains at least one object.

Each object must belong to exactly one group.

Points to remember − For a given number of partitions (say k), the partitioning method will create an initial partitioning.

Then it uses the iterative relocation technique to improve the partitioning by moving objects from one group to other.

Hierarchical Methods

This method creates a hierarchical decomposition of the given set of data objects. We can classify hierarchical methods on the basis of how the hierarchical decomposition is formed. There are two approaches here −

Agglomerative Approach Divisive Approach

Agglomerative Approach

This approach is also known as the bottom-up approach. In this, we start with each object forming a separate group. It keeps on merging the objects or groups that are close to one another. It keep on doing so until all of the groups are merged into one or until the termination condition holds.

Divisive Approach

This approach is also known as the top-down approach. In this, we start with all of the objects in the same cluster. In the continuous iteration, a cluster is split up into smaller clusters. It is down until each object in one cluster or the termination condition holds. This method is rigid, i.e., once a merging or splitting is done, it can never be undone.

Approaches to Improve Quality of Hierarchical Clustering

Here are the two approaches that are used to improve the quality of hierarchical clustering −

Perform careful analysis of object linkages at each hierarchical partitioning.

Integrate hierarchical agglomeration by first using a hierarchical agglomerative algorithm to group objects into micro-clusters, and then performing macro-clustering on the micro-clusters.

Density-based Method

This method is based on the notion of density. The basic idea is to continue growing the given cluster as long as the density in the neighborhood exceeds some threshold, i.e., for each data point within a given cluster, the radius of a given cluster has to contain at least a minimum number of points.

Grid-based Method

In this, the objects together form a grid. The object space is quantized into finite number of cells that form a grid structure.

Advantages The major advantage of this method is fast processing time.

It is dependent only on the number of cells in each dimension in the quantized space.

Model-based methods

In this method, a model is hypothesized for each cluster to find the best fit of data for a given model. This method locates the clusters by clustering the density function. It reflects spatial distribution of the data points.

This method also provides a way to automatically determine the number of clusters based on standard statistics, taking outlier or noise into account. It therefore yields robust clustering methods.

Constraint-based Method

In this method, the clustering is performed by the incorporation of user or application-oriented constraints. A constraint refers to the user expectation or the properties of desired clustering results. Constraints provide us with an interactive way of communication with the clustering process. Constraints can be specified by the user or the application requirement.

*Multidimensional Analysis and Descriptive Mining of Complex Data Objects

What are complex data object?

1. Many advanced, data-intensive applications, such as scientific research and engineering design, need to store, access, and analyze complex but relatively structured data objects.

2. These objects cannot be represented as simple and uniformly structured records (i.e., tuples) in data relations.

3. These kinds of systems deal with the efficient storage and access of vast amounts of disk-based complex structured data objects.

4. These systems organize a large set of complex data objects into classes, which are in turn organized into class/subclass hierarchies.

5. Each object in a class is associated with (1) an object-identifier, (2) a set of attributes that may contain sophisticated data structures, set- or list-valued data, class composition hierarchies, multimedia data, and (3) a set of methods that specify the computational routines or rules associated with the object class.

How can we Generalize the Structured Data

Typically, set-valued data can be generalized by:-(1) Generalization of each value in the set to its corresponding higher-level concept, (2) Derivation of the general behavior of the set, such as the number of elements in the set, the types or value ranges in the set, the weighted average for numerical data, or the major clusters formed by the set.

Suppose that the hobby of a person is a set-valued attribute containing the set of values {tennis, hockey, soccer, violin, this set can be generalized to a set of high-level concepts, such as {sports, music, computer games}

How Aggregation and Approximation is done in Spatial and Multimedia Data Generalization

1. Aggregation and approximation are another important means of generalization2. In a spatial merge, it is necessary to not only merge the regions of similar types within the

same general class but also to compute the total areas, average density, or other aggregate functions while ignoring some scattered regions with different types if they are unimportant to the study.

3. Spatial aggregation and approximation. Suppose that we have different pieces of land for various purposes of agricultural usage, such as the planting of vegetables, grains, and fruits. These pieces can be merged or aggregated into one large piece of agricultural land by a spatial merge. However, such a piece of agricultural land may contain highways, houses, and small stores. If the majority of the land is used for agriculture, the scattered regions for other purposes can be ignored, and the whole region can be claimed as an agricultural area by approximation

4. A multimedia database may contain complex texts, graphics, images, video fragments, maps, voice, music, and other forms of audio/video information

5. Generalization on multimedia data can be performed by recognition and extraction of the essential features and/or general patterns of such data. There are many ways to extract such information. For an image, the size, color, shape, texture, orientation, and relative positions and structures of the contained objects or regions in the image can be extracted by aggregation and/or approximation. For a segment of music, its melody can be summarized

based on the approximate patterns that repeatedly occur in the segment, while its style can be summarized based on its tone, tempo, or the major musical instruments played.

What is Spatial Data Mining?

1. A spatial database stores a large amount of space-related data, such as maps, preprocessed remote sensing or medical imaging data, and VLSI chip layout data. Spatial databases have many features distinguishing them from relational databases.

2. Spatial data mining refers to the extraction of knowledge, spatial relationships, or other interesting patterns not explicitly stored in spatial databases. Such mining demands an integration of data mining with spatial database technologies.

3. It can be used for understanding spatial data, discovering spatial relationships and relationships between spatial and nonspatial data, constructing spatial knowledge bases, reorganizing spatial databases, and optimizing spatial queries.

4. It is expected to have wide applications in geographic information systems, geomarketing, remote sensing, image database exploration, medical imaging, navigation, traffic control, environmental studies, and many other areas where spatial data are used.

5. A crucial challenge to spatial data mining is the exploration of efficient spatial data mining techniques due to the huge amount of spatial data and the complexity of spatial data types and spatial access methods.

“Can we construct a spatial data warehouse?” Yes, as with relational data, we can integrate spatial data to construct a data warehouse that facilitates spatial data mining. A spatial data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of both spatial and nonspatial data in support of spatial data mining and spatial-datarelated decision-making processes.

There are three types of dimensions in a spatial data cube

1. A nonspatial dimension contains only nonspatial data. (such as “hot” for temperature and “wet” for precipitation)

2. A spatial-to-nonspatial dimension is a dimension whose primitive-level data are spatial but whose generalization, starting at a certain high level, becomes nonspatial.eg:-city

3. A spatial-to-spatial dimension is a dimension whose primitive level and all of its highlevel generalized data are spatial.eg:equitemp.

Two types of measures in a spatial data cube: 1. A numerical measure contains only numerical data. For example, one measure in a

spatial data warehouse could be the monthly revenue of a region2. A spatial measure contains a collection of pointers to spatial

objects.eg temperature and precipitation

There are several challenging issues regarding the construction and utilization of spatial datawarehouses.

1. The first challenge is the integration of spatial data from heterogeneous sources and systems.2. The second challenge is the realization of fast and flexible on-line analytical processing in

spatial data warehouses.

Mining Spatial Association

A spatial association rule is of the form A=>B [s%;c%], where A and B are sets of spatial or nonspatial predicates, s% is the support of the rule, and c%is the confidence of the rule. For example, the followingis a spatial association rule: is a(X; “school”)^close to(X; “sports center”))=>close to(X; “park”) [0:5%;80%].This rule states that 80% of schools that are close to sports centers are also close to parks, and 0.5% of the data belongs to such a case.

------------------------------------------------------------------------------------------------

What is Multimedia Data Mining?A multimedia database system stores and manages a large collection of multimedia data, such as audio, video, image, graphics, speech, text, document, and hypertext data, which contain text, text markups, and linkages.

Similarity Search in Multimedia DataFor similarity searching in multimedia data, we consider two main families of multimedia indexing and retrieval systems:(1) description-based retrieval systems, which build indices and perform object retrieval based on image descriptions, such as keywords, captions, size, and time of creation;(2) content-based retrieval systems, which support retrieval based on the image content, such as color histogram, texture, pattern, image topology, and the shape of objects and their layouts and locations within the image.

In a content-based image retrieval system, there are often two kinds of queries:

Image sample- based queries and image feature specification queries. Image-sample-based queries find all of the images that are similar to the given image

sample. This search compares the feature vector (or signature) extracted from the sample with the feature vectors of images that have already been extracted and indexed in the image database. Based on this comparison, images that are close to the sample image are returned.

Image feature specification queries specify or sketch image features like color, texture, or shape, which are translated into a feature vector to be matched with the feature vectors of the images in the databaseMining Associations in Multimedia Data

1. Associations between image content and non image content features: A rule like “If at least 50% of the upper part of the picture is blue, then it is likely to represent sky” belongs to this category since it links the image content to the keyword sky.

2. Associations among image contents that are not related to spatial relationships: A rule like “If a picture contains two blue squares, then it is likely to contain one red circle a swell” belongs to this category since the associations are all regarding image contents.

3. Associations among image contents related to spatial relationships: A rule like “If a red triangle is between two yellow squares, then it is likely a big oval-shaped object is underneath” belongs to this category since it associates objects in the image with spatial relationship.

Several approaches have been proposed and studied for similarity-based retrieval in image databases, based on image signature

1. Color histogram–based signature: In this approach, the signature of an image includes color histograms based on the color composition of an image regardless of its scale or orientation. This method does not contain any information about shape, image topology, or texture.

2. Multifeature composed signature: In this approach, the signature of an image includes a composition of multiple features: color histogram, shape, image topology, and texture.

“Can we construct a data cube for multimedia data analysis?” To facilitate the multidimensional analysis of large multimedia databases, multimedia data cubes can be designed and constructed in a manner similar to that for traditional data cubes from relational data. A multimedia data cube can contain additional dimensions and measures for multimedia information, such as color, texture, and shape.

*What is Text Mining?Text databases (or document databases), which consist of large collections of documents from various sources, such as news articles, research papers, books, digital libraries, e-mail messages, and Web pages. Text databases are rapidly growing due to the increasing amount of information available in electronic form, such as electronic publications, various kinds of electronic documents, e-mail, and the World Wide Web .

What is IR(Information Retrieval System)?A typical information retrieval problem is to locate relevant documents in a document collection based on a user’s query, which is often some keywords describing an information need, although it could also be an example relevant document. In such a search problem, a user takes the initiative to “pull” the relevant information out from the collection; this is most appropriate when a user has some ad hoc (i.e., short-term)information need, such as finding information to buy a used car. When a user has a long-term information need (e.g., a researcher’s interests), a retrieval system may also take the initiative to “push” any newly arrived information item to a user if the item is judged as being relevant to the user’s information need. Such an information accessprocess is called information filtering, and the corresponding systems are often called filteringsystems or recommender systems.

Basic Measures for Text Retrieval: Precision and Recall“Suppose that a text retrieval system has just retrieved a number of documents for me basedon my input in the form of a query. How can we assess how accurate or correct the systemwas?”Precision: This is the percentage of retrieved documents that are in fact relevant tothe query (i.e., “correct” responses). It is formally defined as

Recall: This is the percentage of documents that are relevant to the query and were,in fact, retrieved. It is formally defined as

http://3.bp.blogspot.com/-ZusVmascg_I/U8TKJJhkihI/AAAAAAAAARU/W1cMJ_n8xuw/s1600/1.png

http://4.bp.blogspot.com/-oI4zfyyxwc4/U8TKJDKEUvI/AAAAAAAAARY/6OOxQa-p72s/s1600/2.png

How Mining theWorld WideWeb is done?The World Wide web serves as a huge,widely distributed, global information service centerfor news, advertisements, consumer information, financial management, education,government, e-commerce, and many other information services. The Web also containsa rich and dynamic collection of hyperlink information and Web page access and usageinformation, providing rich sources for data mining.challenges for effective resource and knowledge discovery in web

1. The Web seems to be too huge for effective data warehousing and data mining. The size of the Web is in the order of hundreds of terabytes and is still growing rapidly. Many organizations and societies place most of their public-accessible information on the Web. It is barely possible to set up a data warehouse to replicate, store, or integrate all of the data on the Web.

2. The complexity of Web pages is far greater than that of any traditional text documentcollection. Web pages lack a unifying structure.

3. The Web is a highly dynamic information source. Not only does the Web grow rapidly,but its information is also constantly updated.

4. TheWeb serves a broad diversity of user communities. The Internet currently connectsmore than 100 million workstations, and its user community is still rapidly expanding.

These challenges have promoted research into efficient and effective discovery and use of resources on the Internet.

Mining the WWW1. Mining theWeb Page Layout Structure

The basic structure of a Web page is its DOM(Document Object Model) structure. The DOM structure of a Web page is a tree structure, where every HTML tag in the page corresponds to a node in the DOM tree. The Web page can be segmented by some predefined structural tags. Thus the DOM structure can be used to facilitate information extraction.Here, we introduce an algorithm called VIsion-based Page Segmentation (VIPS).VIPS aims to extract the semantic structure of a Web page based on its visual presentation

2. Mining the Web’s Link Structures to Identify Authoritative Web PagesThe Web consists not only of pages, but also of hyperlinks pointing from one page to another.These hyperlinks contain an enormous amount of latent human annotation that can help automatically infer the notion of authority. These properties of Web link structures have led researchers to consider another important category of Web pages called a hub. A hub is one or a set ofWeb pages that provides collections of links to authorities.

http://3.bp.blogspot.com/-ZOji_oLc1TM/U8TKJCpePOI/AAAAAAAAARc/jlBSpzbNXpI/s1600/3.png

*What are the various Data Mining Applications?

Data Mining for Financial Data Analysis-

Design and construction of data warehouses for multidimensional data analysis anddata mining: Financial data collected in the banking and financial industry are often relatively complete, reliable, and of high quality, which facilitates systematic data analysis and data mining. One may like to view the debt and revenue changes by month, by region, by sector, and by other factors, along with maximum, minimum, total, average, trend, and other statistical information.Loan payment prediction and customer credit policy analysis: Loan payment prediction and customer credit analysis are critical to the business of a bank. Many factors can strongly or weakly influence loan payment performance and customer credit rating.

Classification and clustering of customers for targeted marketing : Classification and clustering methods can be used for customer group identification and targeted marketing.For example, we can use classification to identify the most crucial factors that may influence a customer’s decision regarding banking. Customers with similar behaviors regarding loan payments may be identified by multidimensional clustering techniques.

Detection of money laundering and other financial crimes: To detect money launderingand other financial crimes, it is important to integrate information from multipledatabases (like bank transaction databases, and federal or state crime history databases), as long as they are potentially related to the study

Data Mining for the Retail IndustryDesign and construction of data warehouses based on the benefits of data mining: Because retail data cover a wide spectrum (including sales, customers, employees, goods transportation, consumption, and services), there can be many ways to design a data warehouse for this industry. Multidimensional analysis of sales, customers, products, time, and region: The retail industry requires timely information regarding customer needs, product sales, trends,and fashions, as well as the quality, cost, profit, and service of commodities

Analysis of the effectiveness of sales campaigns: The retail industry conducts salescampaigns using advertisements, coupons, and various kinds of discounts and bonusesto promote products and attract customersCustomer retention—analysis of customer loyalty: With customer loyalty card information,one can register sequences of purchases of particular customers. Customer loyalty and purchase trends can be analyzed systematically

Product recommendation and cross-referencing of items: By mining associationsfrom sales records, one may discover that a customer who buys a digital camera islikely to buy another set of items. Such information can be used to form productrecommendations. Collaborative recommender systems use data mining techniquesto make personalized product recommendations during live customer transactions,based on the opinions of other customers

Data Mining for the Telecommunication IndustryFraudulent pattern analysis and the identification of unusual patterns:Fraudulent activity costs the telecommunication industry millions of dollars per year. Itis important to (1) identify potentially fraudulent users and their atypical usage patterns;(2) detect attempts to gain fraudulent entry to customer accounts; and(3) discover unusual patterns that may need special attention, such as busy-hour frustrated call attempts, switch and route congestion patterns, and periodic calls from automatic dial-out equipment (like fax machines) that have been improperly programmed

Multidimensional association and sequential pattern analysis: The discovery of associationand sequential patterns in multidimensional analysis can be used to promote telecommunication services. For example, suppose you would like to find usage patterns for a set of communication services by customer group, by month, and by time of day.

Mobile telecommunication services: Mobile telecommunication, Web and informationservices, and mobile computing are becoming increasingly integrated andcommon in our work and life.

Explain the Social Impacts of Data Mining? Ubiquitous data mining is the ever presence of data mining in many aspects of our daily

lives. It can influence how we shop, work, search for information, and use a computer, as well as our leisure time, health, and well-being. In invisible data mining, “smart” software, such as Web search engines, customer-adaptive Web services (e.g., using recommender algorithms), e-mail managers, and so on, incorporates data mining into its functional components, often unbeknownst to the user.

From grocery stores that print personalized coupons on customer receipts to on-line stores that recommend additional items based on customer interests, data mining has innovatively influenced what we buy, the way we shop, as well as our experience while shopping.

Data mining has shaped the on-line shopping experience. Many shoppers routinely turn to on-line stores to purchase books, music, movies, and toys

Many companies increasingly use data mining for customer relationship management (CRM), which helps provide more customized, personal service addressing individual customer’s needs, in lieu of mass marketing

While you are viewing the results of your Google query, various ads pop up relating to your query. Google’s strategy of tailoring advertising to match the user’s interests issuccessful—it has increased the clicks for the companies involved by four to five times.

Web-wide tracking is a technology that tracks a user across each site she visits. So,whileSurfing the Web, information about every site you visit may be recorded,which can providemarketers with information reflecting your interests, lifestyle, and habits

Finally, data mining can contribute toward our health and well-being. Several pharmaceutical companies use data mining software to analyze data when developing drugs and to find associations between patients, drugs, and outcomes. It is also being used to detect beneficial side effects of drugs

What are the Major concern of Data mining?A major social concern of data mining is the issue of privacy and data security, particularly as the amount of data collected on individuals continues to grow. Fair information practices were established for privacy and data protection and cover aspects regarding the collection and use of personal data. Data mining for counterterrorism can benefit homeland security and

save lives, yet raises additional concerns for privacy due to the possible access of personal data. Efforts towardsensuring privacy and data security include the development of privacy-preserving data mining (which deals with obtaining valid data mining results without learning the underlying data values) and data security–enhancing techniques (such as encryption)

What are the Recent trends in Data mining?Trends in data mining include further efforts toward the exploration of new application areas, improved scalable and interactive methods (including constraint-based mining), the integration of data mining with data warehousing and database systems, the standardization of data mining languages, visualization methods, and new methods for handling complex data types. Other trends include biological data mining, mining software bugs, Web mining, distributed and real-time mining, graph mining, social network analysis, multi relational and multi database data mining, data privacy protection, and data security

Data chunkingData deduplication is an emerging technology that introduces reduction of storage utilization and an efficient way of handling data replication in the backup environment. In cloud data storage, the deduplication technology plays a major role in the virtual machine framework, data sharing network, and structured and unstructured data handling by social media and, also, disaster recovery. In the deduplication technology, data are broken down into multiple pieces called “chunks” and every chunk is identified with a unique hash identifier. These identifiers are used to compare the chunks with previously stored chunks and verified for duplication. Since the chunking algorithm is the first step involved in getting efficient data deduplication ratio and throughput, it is very important in the deduplication scenario.

What is data chunking? How can chunking help to organize large multidimensional datasets for both fast and flexible data access? How should chunk shapes and sizes be chosen? Can software such as netCDF-4 or HDF5 provide better defaults for chunking? If you're interested in those questions and some of the issues they raise, read on ...

Is anyone still there? OK, let's start with a real-world example of the improvements possible with chunking in netCDF-4. You may be surprised, as I was, by the results. Maybe looking at examples in this post will help provide some guidance for effective use of chunking in other similar cases.

First let's consider a single large 3D variable from the NCEP North American Regional Reanalysis representing air temperature (if you must know, it's at the 200 millibars level, at every 3 hours, on a 32.463 km resolution grid, over 33 years from 1979-01-01 through 2012-07-31):

dimensions:y = 277 ;x = 349 ;time = 98128 ;

variables: float T(time, y, x);

Of course the file has lots of other metadata specifying units, coordinate system, and data provenance, but in terms of size it's mostly just that one big variable: 9.5 billion values comprising 38 GB of data.

This file is an example of PrettyBigData (PBD). Even though you can store it on a relatively cheap flash drive, it's too big to deal with quickly. Just copying it to a 7200 rpm spinning disk takes close to 20 minutes. Even copying to fast solid state disk (SSD) takes over 4 minutes. For a human-scale comparison, its close to the storage used for a blu-ray version of a typical movie, about 50 GB. (As an example of ReallyBigData (RBD), a data volume beyond the

comprehension of ordinary humans, consider the 3D, 48 frame per second version of "The Hobbit, Director's Cut".)

Access Patterns make a difference

For now, let's ignore issues of compression, and just consider putting that file on a server and permitting remote access to small subsets of the data. Two common access patterns are:

1. Get a 1D time-series of all the data from a specific spatial grid point.

2. Get a 2D spatial cross section of all the data at a specific time.

The first kind of access is asking for the 1D array of values on one of the red lines, pictured on the left, below; the second is asking for the 2D array of values on one of the green planes pictured on the right.

With a conventional contiguous (index-order) storage layout, the time dimension varies most slowly, y varies faster, and x varies fastest. In this case, the spatial access is fast (0.013 sec) and the time series access is slow (180 sec, which is 14,000 times slower). If we instead want the time series to be quick, we can reorganize the data so x or y is the most slowly varying dimension and time varies fastest, resulting in fast time-series access (0.012 sec) and slow spatial access (200 sec, 17,000 times slower). In either case, the slow access is so slow that it makes the data essentially inaccessible for all practical purposes, e.g. in analysis or visualization.

But what if we want both kinds of access to be relatively fast? Well, we could punt and make two versions of the file, each organized appropriately for the kind of access desired. But that solution doesn't scale well. For N-dimensional data, you would need N copies of the data to support optimal access along any axis, and N! copies to support optimal access to any cross-section defined by some subset of the N dimensions.

A better solution, known for at least 30 years, is the use of chunking, storing multidimensional data in multi-dimensional rectangular chunks to speed up slow accesses at the cost of slowing down fast accesses. Programs that access chunked data can be oblivious to whether or how chunking is used. Chunking is supported in the HDF5 layer of netCDF-4 files, and is one of the features, along with per-chunk compression, that led to a proposal to use HDF5 as a storage layer for netCDF-4 in 2002.

Benefits of Chunking

I think the benefits of chunking are under-appreciated.

Large performance gains are possible with good choices of chunk shapes and sizes. Chunking also supports efficiently extending multidimensional data along multiple axes (in netCDF-4, this is called "multiple unlimited dimensions") as well as efficient per-chunk compression, so reading a subset of a compressed variable doesn't require uncompressing the whole variable.

So why isn't chunking more widely used? I think reasons include at least the following:

1. Advice for how to choose chunk shapes and sizes for specific patterns of access is lacking.

2. Default chunk shapes and sizes for libraries such as netCDF-4 and HDF5 work poorly in some common cases.

3. It's costly to rewrite big datasets that use conventional contiguous layouts to use chunking instead. For example, even if you can fit the whole variable, uncompressed, in memory, chunking a 38GB variable can take 20 or 30 minutes.

This series of posts and better guidance in software documentation will begin to address the first problem. HDF5 already has a start with a white paper on chunking.

The second reason for under-use of chunking is not so easily addressed. Unfortunately, there are no general-purpose chunking defaults that are optimal for all uses. Different patterns of access lead to different chunk shapes and sizes for optimum access. Optimizing for a single specific pattern of access can degrade performance for other access patterns.

Finally, the cost of chunking data means that you either need to get it right when the data is created, or the data must be important enough that the cost of rechunking for many read accesses is justified. In the latter case, you may want to consider acquiring a computing platform with lots of memory and SSD, just for the purpose of rechunking important datasets.

What a difference the shape makes

We have claimed that good choices of chunk shapes and sizes can make large datasets useful for access in multiple ways. For the specific example we've chosen, how well do the netCDF-4 library defaults work, and what's the best we can do by tailoring the chunking to the specific

http://www.hdfgroup.org/HDF5/doc/Advanced/Chunking/

access patterns we've chosen, 1D time series at a point and 2D spatial access at a specific time?

Here's a table of timings for various shapes and sizes of chunks, using conventional local 7200 rpm spinning disk with 4096-byte physical disk blocks, the kind of storage that's still prevalent on desk-top and departmental scale platforms:

Storage layout,chunk shapes

Read timeseries (sec)

Read spatialslice (sec)

Performance bias(slowest / fastest)

Contiguous favoring time range 0.013 180 14000

Contiguous favoring spatial slice 200 0.012 17000

Default (all axes equal) chunks, 4673 x 12 x 16 1.4 34 24

36 KB chunks, 92 x 9 x 11 2.4 1.7 1.4

8 KB chunks, 46 x 6 x 8 1.4 1.1 1.2

We've already seen the timings in the first two rows of this table, showing huge performance bias when using contiguous layouts. The third row shows the current netCDF-4 default for chunking this data, choosing chunk sizes close to 4 MB and trying to equalize the number of chunks along any axis. This turns out not to be particularly good for trying to balance 1D and 2D accesses. The fourth row shows results of smaller chunk sizes, using shapes that provide a better balance between time series and spatial slice accesses for this dataset.

I think the last row of this table supports the main point to be made in this first posting on chunking data. By creating or rewriting important large multidimensional datasets using appropriate chunking, you can tailor their access characteristics to make them more useful. Proper use of chunking can support more than one common query pattern.

That's enough for now. In part 2, we'll discuss how to determine good chunk shapes, present a general way to balance access times for 1D and 2D accesses in 3D variables, say something about generalizations to higher dimension variables, and provide examples of rechunking times using the nccopy and h5repack utilities.

In later posts, we plan to offer opinions on related issues, possibly including

chunk size tradeoffs: small chunks vs. large chunks chunking support in tools chunking and compression complexity of the general rechunking problem problems with big disk blocks worst case performance space-filling curves for improving access locality

In the meantime, Keep On Chunkin' ...

A note about timings

The times we quoted above are averaged from multiple consecutive runs on a desktop Linux system (2.27GHz Xeon CPU, 4 cores, 24 GB of memory, 7200 rpm disk on a SATA-3 bus). Each timing uses a different set of chunks, so we are not exploiting chunk caching. Before reading a subset of data from a file, we run a command to flush and clear all the disk caches in memory, so running the timing repeatedly yields nearly the same time. The command to clear disk caches varies from system to system. On Linux, it requires privileges to set the SUID bit on a shell script:

#!/bin/sh# Flush and clear the disk caches.syncsudo sh -c "echo 3 > /proc/sys/vm/drop_caches"On OSX, the only privilege required is knowledge of the "purge" command, which we wrap in a shell script to make our benchmarks work the same on OSX as on Linux:

#!/bin/sh# Flush and clear the disk caches.purge

data mining - cluster analysisgecnilokheri.ac.in/gpcontent/data mining-unit iv.docx · web viewa...

Documents