Download STATISTIC IN QUANTITATIVE BIOSTRATIGRAPHY (b) and more Slides Geology in PDF only on Docsity!
standardization
- (^) A collection of data can be converted to standardized or
unitless form
- (^) Performed by: subtracting from each observation the
mean of the data set and dividing by standard deviation.
CLUSTER ANALYSIS
- (^) First used by Tryon (1939 )
- (^) Cluster analysis methods are mostly used when we do not
have any a priori hypotheses
- (^) The purpose is to classify complex multivariate data
through some means of of reduction in its dimension.
- (^) Its actually encompasses a number of different
classification algorithms.
- (^) In general, whenever one needs to classify a "mountain" of
information into manageable meaningful piles, cluster
analysis is of great utility.
Joining (Tree Clustering)
- (^) The purpose of this algorithm is to join together objects
(e.g., animals) into successively larger clusters, using some
measure of similarity or distance.
- (^) A typical result of this type of clustering is the hierarchical
tree ( called as Dendrogran )
Joining (Tree Clustering)
Distance Measures
- (^) The joining or tree clustering method uses the similarity/
dissimilarities or distances between objects when forming
the clusters. These distances can be based on a single
dimension or multiple dimensions.
- (^) Euclidean distance. This is probably the most commonly
chosen type of distance. It simply is the geometric distance in
the multidimensional space. It is computed as: distance(x,y)
i
(x
i
2
½
- (^) Squared Euclidean distance. One may want to square the
standard Euclidean distance in order to place progressively
greater weight on objects that are further apart. This distance
is computed as (see also the note in the previous paragraph):
distance(x,y) =
i
(x
i
2
Joining (Tree Clustering)
Distance Measures
- (^) Chebychev distance. This distance measure may be appropriate in cases when one wants to define two objects as "different" if they are different on any one of the dimensions. The Chebychev distance is computed as: distance(x,y) = Maximum|x i - y i
- (^) Power distance. Sometimes one may want to increase or decrease the progressive weight that is placed on dimensions on which the respective objects are very different. This can be accomplished via the power distance. The power distance is computed as: distance(x,y) = ( i |x i - y i
p ) 1/r , where r and p are user-defined parameters
- (^) Percent disagreement. This measure is particularly useful if the data for the dimensions included in the analysis are categorical in nature. This distance is computed as: distance(x,y) = (Number of x i y i )/ i
Joining (Tree Clustering)
Distance Measures
Similarity measures for presence/absence data
1. Jaccard similarit y. A match is counted for all taxa with
presences in both samples. Using M For number of matches
and N for the the total number of taxa with presences in just
one sample, we have.
Jaccard similarity = M / (M+N)
2. Dice (Sorensen) coefficien t. Puts more weight on joint
occurences (M) than on mismatches.
Dice similarity = 2M / (2M+N)
Joining (Tree Clustering)
Distance Measures
Similarity measures for abundance data
1. The Euclidean distance : (see above)
2. C orrelation (of the variables along rows) using Pearson’s
r
3. C orrelation using Spearman’s rho (basically the value of
the ranks).
4. B ray-Curtis distance measure, sensitive to absolute
abundances.
Joining (Tree Clustering)
Distance Measures
Similarity measures for abundance data
5. Chord distance for abundance data. This index is
sensitive to species proportions and not to absolute
abundances. It projects the two multivariate sample
vectors onto a hypersphere and measures the distance
between these points, thus normalizing abundances to 1.
6. Morisita’s similarity index for abundance data.
Joining (Tree Clustering)
Amalgamation or Linkage Rules
- (^) Unweighted pair-group average. In this method, the distance between two clusters is calculated as the average distance between all pairs of objects in the two different clusters. This method is also very efficient when the objects form natural distinct "clumps," however, it performs equally well with elongated, "chain" type clusters. Note that in their book, Sneath and Sokal (1973) introduced the abbreviation UPGMA to refer to this method as unweighted pair-group method using arithmetic averages. mean linkage
- (^) Weighted pair-group average. This method is identical to the unweighted pair-group average method, except that in the computations, the size of the respective clusters (i.e., the number of objects contained in them) is used as a weight. Thus, this method (rather than the previous method) should be used when the cluster sizes are suspected to be greatly uneven. Note that in their book, Sneath and Sokal (1973) introduced the abbreviation WPGMA to refer to this method as weighted pair-group method using arithmetic averages.
Joining (Tree Clustering)
Amalgamation or Linkage Rules
- (^) Unweighted pair-group centroid. The centroid of a cluster is the average point in the multidimensional space defined by the dimensions. In a sense, it is the center of gravity for the respective cluster. In this method, the distance between two clusters is determined as the difference between centroids. Sneath and Sokal (1973) use the abbreviation UPGMC to refer to this method as unweighted pair-group method using the centroid average.
- (^) Weighted pair-group centroid (median). This method is identical to the previous one, except that weighting is introduced into the computations to take into consideration differences in cluster sizes (i.e., the number of objects contained in them). Thus, when there are (or one suspects there to be) considerable differences in cluster sizes, this method is preferable to the previous one. Sneath and Sokal (1973) use the abbreviation WPGMC to refer to this method as weighted pair- group method using the centroid average
Two-way Joining (Block Clustering)
- (^) The other types of analyses the research question of
interest is usually expressed in terms of cases
(observations) or variables. Here we can clustering both
observation and variables.
- (^) Two-way joining is useful in (the relatively rare)
circumstances when one expects that both cases and
variables will simultaneously contribute to the uncovering
of meaningful patterns of clusters.
Two-way Joining (Block Clustering)
K-means Clustering
- (^) Computations
- (^) Computationally, you may think of this method as analysis of variance (ANOVA) "in reverse". The program will start with k random clusters, and then move objects between those clusters with the goal to (1) minimize variability within clusters and (2) maximize variability between clusters. In k-means clustering, the program tries to move objects (e.g., cases) in and out of groups (clusters) to get the most significant ANOVA results.
- Interpretation of results
- Usually, as the result of a k-means clustering analysis, we would examine the means for each cluster on each dimension to assess how distinct our k clusters are. Ideally, we would obtain very different means for most, if not all dimensions, used in the analysis. The magnitude of the F values from the analysis of variance performed on each dimension is another indication of how well the respective dimension discriminates between clusters.