Using R and python for colorful research of biological shape

Python Project and Explore

Since Procrustes superimposition introduces the data into a curved shape space which corresponds to a hyperhemisphere of radius 1 with numerous dimensions (as defined by Rohlf, 1999^{1}), usual multivariate statistical methods are not readily applicable to such data. This data must be projected to a flat, Euclidean tangent space, that is in contact with the hyperhemisphere in one point, taken as a mean shape of the whole dataset. This projection can be applied only if the variation in shapes in not overtly large so that their “shadows” on a flat space are not too far apart. In this post, data will be generated as usual and Procrustes procedures will be taken from previous posts, using defined functions centsiz, transScale, mshapr, pProc and pGP.

Importing libraries, generating data and utility functions

After Procrustes superimposition, data can be projected orthogonally and stereographically. More common method is the orthogonal projection which minimizes lagre shape differences between landmark configurations. The procedure is based on Rohlf, 1999, suggesting that the matrix of aligned centered preshapes should be multiplied by the mean shape of unit centroid size, subtracted from the identity matrix of comparable dimensionality (Claude, 2008).

Orthogonal projection of superimposed data, Procrustes superimposition and projection

12345678910111213141516171819202122

defortproj(A,numland,dim,numind):#where A is a pandas dataframe (tempCoords), numind = landmarks, numind = individualsmshape=A.iloc[:,0:2].groupby(A.iloc[:,2]).mean()mshStd=mshape/centsize(mshape)mshMel=pd.melt(mshStd).drop('variable',1).T#mshRep = pd.concat([mshMel]*numind, ignore_index = True) #calculate repeating size standardized meanShapemshRep=np.dot(np.ones([numind,1]),mshMel)simDia=np.identity(numland*dim)#identity matrix#lonWid = A[["x","y"]].values.reshape(numind,numland*dim) #must have column names "x" and "y" for coordinateslonWidX=A["x"].values.reshape(numind,numland)lonWidY=A["y"].values.reshape(numind,numland)lonWid=np.concatenate((lonWidX,lonWidY),axis=1)temp1=simDia-np.dot(mshMel.T,mshMel)Xi=np.dot(lonWid,temp1)#orthogonal according to RholfXproj=Xi+mshRepreturnXprojprocCoords=pGP(allCoords,200,2,10)#perform Procrustes superimpositionprocoordinates=np.tile(np.arange(0,10),200)procindividuals=allCoords['individuals']procCoords['coordinates']=procoordinatesprocCoords['individuals']=sort(procindividuals)projCoords=ortproj(procCoords,10,2,200)#perform orthogonal projection

Numpy array projCoords holds wide representation of projected data (rows are individuals, while columns are all landmarks, first all x and then y coordinates), ready to be used in PCA or any other multivariate method. In order to illustrate potential grouping in PCA morphospace, since the data is randomly generated, a kmeans clustering will be performed with 3 groups, just to effectively split the data. PCA can be done in python in numerous ways, but in this post a PCA decomposition from scikit-learn package is used.

PCA, cluster analysis and plots

123456789101112131415161718192021222324

pca=PCA(n_components=3)pca.fit(projCoords)print(pca.explained_variance_ratio_)projCoordsP=pca.transform(projCoords)#get the individual scores on PC axesclusters=pd.Series(pyc.kcluster(projCoordsP,nclusters=3,method='a',dist='e')[0])set2=brewer2mpl.get_map('RdYlBu','diverging',3).mpl_colors#generate nice colorsprojCoordsPa=np.column_stack((projCoordsP,clusters))#concatenate all for sequential plot-buildinggroup1PCA=projCoordsPa[projCoordsPa[:,3]==0]#split by group for sequential plot-buildinggroup2PCA=projCoordsPa[projCoordsPa[:,3]==1]group3PCA=projCoordsPa[projCoordsPa[:,3]==2]fig=plt.figure()ax=fig.add_subplot(111,projection='3d')ax.plot(group1PCA[:,0],group1PCA[:,1],group1PCA[:,2],'o',c=set2[0],alpha=1,label='Group1')ax.plot(group2PCA[:,0],group2PCA[:,1],group2PCA[:,2],'o',c=set2[1],alpha=1,label='Group2')ax.plot(group3PCA[:,0],group3PCA[:,1],group3PCA[:,2],'o',c=set2[2],alpha=1,label='Group3')ax.set_title("PCA with grouping according to kmeans 3 group solution")plt.legend(loc='upper left')ax.set_xlabel('PC1 (14.45%)')ax.set_ylabel('PC2 (13.29%)')ax.set_zlabel('PC3 (12.46%)')plt.show()

Since the data is randomly generated, there is not much structured variation in it, as three of the first PC axes describe 14.45%, 13.29% and 12.46% of total sample variability, respectively. On the other hand, kmeans forces the data into three groups, which is useful for demonstration of plotting in matplotlib with 3D axes, and a groupping structure in PCA scatterplots (Figure 1), similar to real-world data.

Rohlf, J.F. 1999. Shape statistics: Procrustes superimpositions and tangent spaces. Journal of Classification16: 197-223.↩