scienfic and large data visualizaon 29 november 2017 high...

Post on 26-Mar-2020

0 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Scien&ficandLargeDataVisualiza&on29November2017

HighDimensionalData–PartII

MassimilianoCorsiniVisualCompu,ngLab,ISTI-CNR-Italy

Overview

•  GraphsExtensions•  Glyphs

–  ChernoffFaces–  Mul&-dimensionalIcons

•  ParallelCoordinates•  StarPlots•  DimensionalityReduc&on

–  PrincipalComponentAnalysis(PCA)–  LocallyLinearEmbedding(LLE)–  IsoMap–  SummonMapping–  t-SNE

DimensionalityReduc,on

•  N-dimensionaldataareprojectedto2or3dimensionsforbeCervisualiza,on/understanding.

•  Widelyusedstrategy.•  Ingeneral,itisamappingnotageometrictransforma,on.

•  Differentmappingshavedifferentproper,es.

PrincipalComponentAnalysis(PCA)

•  Aclassicmul,-dimensionalreduc,ontechniqueisPrincipalComponentAnalysis(PCA).

•  Itisalinearnon-parametrictechnique.•  Thecoreideatofindabasisformedbythedirec,onsthatmaximizethevarianceofthedata.

PCAasachangeofbasis

•  Theideaistoexpressthedatainanewbasis,thatbestexpressourdataset.

•  Thenewbasisisalinearcombina,onoftheoriginalbasis.

PCAasachangeofbasis

Signal-to-noiseRa,o(SNR)

•  Givenasignalwithnoise:

•  Itcanbeexpressedas:

Redundancy

Redundantvariablesconveynorelevantinforma&on!

Figure From Jonathon Shlens, “A Tutorial on Principal Component Analysis”, arXiv preprint arXiv:1404.1100, 2015.

CovarianceMatrix

•  Squaresymmetricmatrix.•  Thediagonaltermsarethevarianceofapar,cularvariable.

•  Theoff-diagonaltermsarethecovariancebetweenthedifferentvariables.

Goals

•  HowtoselectthebestP?– Minimizeredundancy– Maximizethevariance

•  Goal:todiagonalizethecovariancematrixofY– Highvaluesofthediagonaltermsmeansthatthedynamicsofthesinglevariableshasbeenmaximized.

– Lowvaluesoftheoff-diagonaltermsmeansthattheredundancybetweenvariablesisminimized.

SolvingPCARememberthat

SolvingPCA

•  Theorem:asymmetricmatrixAcanbediagonalizedbyamatrixformedbyitseigenvectorsasA=EDET.

•  ThecolumnofEaretheeigenvectorsofA.

PCAComputa,on

•  Organizethedataasanmxnmatrix.•  Subtractthecorrespondingmeantoeachrow.•  CalculatetheeigenvaluesandeigenvectorsofXXT.

•  OrganizethemtoformthematrixP.

PCAforDimensionalityReduc,on

•  Theideaistofindthek-thprincipalcomponents(k<m).

•  Projectthedataonthesedirec,onsandusesuchdatainsteadoftheoriginalones.

•  Thisdataarethebestapproxima,onw.r.tthesumofthesquareddifferences.

PCAastheProjec,onthatMinimizestheReconstruc,onError•  Ifweuseonlythefirstk<mcomponentsweobtainthebestreconstruc,onintermsofsquarederror.

Datapointprojectedonthefirstkcomponents.

Datapointprojectedonallthecomponents.

PCAastheProjec,onthatMinimizestheReconstruc,onError

Example

Figure From Jonathon Shlens, “A Tutorial on Principal Component Analysis”, arXiv preprint arXiv:1404.1100, 2015.

PCA–Example

Eachmeasurehas6dimensions(!)ButtheballmovesalongtheX-axisonly..

LimitsofPCA

•  Itisnon-parametricàthisisastrengthpointbutitcanbealsoaweakpoint.

•  Itfailsfornon-Gaussiandistributeddata.•  Itcanbeextendedtoaccountfornon-lineartransforma,onàkernelPCA.

LimitsofPCA

ICAguaranteessta&s&calindependenceà

ClassicMDS

•  Findthelinearmappingwhichminimizes:

Euclideandistanceinlowdimensionalspace

Euclideandistanceinhighdimensionalspace

PCAandMDS

•  Wewanttominimize,thiscorrespondstomaximize:

Thatisthevarianceofthelow-dimensionalpoints(samegoalofthePCA).

PCAandMDS

•  Thesizeofthecovariancematrixispropor,onaltothedimensionofthedata.

•  MDSscaleswiththenumberofdatapointsinsteadofthedimensionsofthedata.

•  BothPCAandMDSpreservebeCerlargepairwisedistances.

LocallyLinearEmbedding(LLE)•  LLEaCemptstodiscovernonlinearstructureinhighdimensionbyexploi,nglocallinearapproxima,on.

MappingDiscoveredNonlinearManifoldM SamplesonM

LocallyLinearEmbedding(LLE)

•  INTUITIONàassumingthatthereissufficientdata(well-sampledmanifold)weexpecteachdatapointanditsneighborscanbeapproximatedbyalocallinearpatch.

•  Thepatchisrepresentedbyaweightedsumofthelocaldatapoints.

ComputeLocalPatch

•  Chooseasetofdatapointsclosetoagivenone(ball-radiusorK-nearestneighbours).

•  Solvefor:

LLEMapping

•  Findwhichminimizestheembeddingcostfunc,on:

Notethatweightsarefixedinthiscase!

LLEAlgorithm

1.  Computetheneighborsofeachdatapoint,.

2.  Computetheweightsthatbestreconstruct.

3.  Computethevectorsthatminimizesthecostfunc,on.

LLE–Example

PCAfailstopreservetheneighborhoodstructureofthenearbyimages.

LLE–Example

ISOMAP

•  Thecoreideaistopreservethegeodesicdistancebetweendatapoints.

•  Geodesicistheshortestpathbetweentwopointsonacurvedspace.

ISOMAP

Euclideandistancevs

Geodesicdistance

Graphbuildand

GeodesicdistanceApproxima&on

Geodesicdistancevs

ApproximatedGeodesic

ISOMAP

•  Constructneighborhoodgraph– DefinegraphGoveralldatapointsbyconnec,ngpoints(i,j)ifandonlyifthepointiisaKneareastneighborofpointj

•  Computetheshortestpath– UsingtheFloyd’salgorithm

•  Constructthed-dimensionalembedding

ISOMAP

ISOMAP

Autoencoders

•  MachinelearningisbecomingubiquitousinComputerScience.

•  Aspecialtypeofneuralnetworkiscalledautoencoder.

•  Anautoencodercanbeusedtoperformdimensionalityreduc,on.

•  First,letmesaysomethingaboutneuralnetwork..

Autoencoder

Low-dimensional Representation

Mul,-layerAutoencoder

SummonMapping

•  Adapta,onofMDSbyweigh,ngthecontribu,onofeach(i,j)pair:

•  ThisallowstoretainthelocalstructureofthedatabeCerthanclassicalscaling(theretainofhighdistancesisnotprivileged).

t-SNE

•  Mosttechniquesfordimensionalityreduc,onarenotabletoretainboththelocalandtheglobalstructureofthedatainasinglemap.

•  SimpletestsonhandwriCendigitsdemonstratethis(Songetal.2007).

L. Song, A. J. Smola, K. Borgwardt and A. Gretton, “Colored Maximum Variance Unfolding”, in Advances in Neural Information Processing Systems. Vol. 21, 2007.

Stochas,cNeighborEmbedding(SNE)

•  Similari,esbetweenhigh-andlow-dimensionaldatapointsismodeledwithcondi,onalprobabili,es.

•  Condi,onalprobabilitythatthepointxiwouldpeakxjasitsneighbor:

Stochas,cNeighborEmbedding(SNE)

•  Weareinterestedonlyinpairwisedistance

•  Forthelow-dimensionalpointsananalogouscondi,onalprobabilityisused:

Kullback-LeiblerDivergence

•  Codingtheory:expectednumberofextrabitsrequiredtocodesamplesfromthedistribu,onPifthecurrentcodeisop,mizeforthedistribu,onQ.

•  Bayesianview:ameasureoftheinforma,ongainedwhenonerevisesone'sbeliefsfromthepriordistribu,onQtotheposteriordistribu,onP.

•  ItisalsocalledrelaDveentropy.

Kullback-LeiblerDivergence

•  Defini,onfordiscretedistribu,ons:

•  Defini,onforcon,nuosdistribu,ons:

Stochas,cNeighborEmbedding(SNE)

•  Thegoalistominimizesthemismatchbetweenpj|iandqj|i.

•  UsingtheKullback-Leiblerdivergencethisgoalcanbeachievedbyminimizingthefunc,on:

NotethatKL(P||Q)isnotsymmetric!

ProblemsofSNE

•  Thecostfunc,onisdifficulttoop,mize.•  SNEsuffers,asotherdimensionalityreduc,ontechniques,ofthecrowdingproblem.

t-SNE

•  SNEismadesymmetric:

•  ItemploysaStudent-tdistribu,oninsteadofaGaussiandistribu,ontoevaluatethesimilaritybetweenpointsinlowdimension.

t-SNEAdvantages

•  Thecrowdingproblemisalleviated.•  Op,miza,onismadesimpler.

Experiments

•  ComparisonwithLLE,IsomapandSummonMapping.

•  Datasets:– MNISTdataset– Oliveffacedataset– COIL-20dataset

Comparison figures are from the paper L.J.P. van der Maaten and G.E. Hinton, “Visualizing High-Dimensional Data Using t-SNE”, Journal of Machine Learning Research, Vol. 9, pp. 2579-2605, 2008.

MNISTDataset

•  60,000imagesofhandwriCendigits.•  Imageresolu,on:28x28(784dimensions).•  Asubsetof6,000imagesrandomlyselectedhasbeenused.

MNISTt-SNE

MNISTSummonMapping

MNISTLLE

MNISTIsomap

COIL-20Dataset

•  Imagesof20objectsviewedfrom72differentviewpoints(1440images).

•  Imagesize:32x32(1024dimensions).

COIL-20Dataset

...

t-SNE

LLEIsomap

SummonMapping

Objects

Arrangement

Mo,va,ons

•  Mul,dimensionalreduc,oncanbeusedtoarrangeobjectsin2Dor3Dpreservingpairwisedistances(butthefinalplacementisarbitrary).

•  Manyapplica,onsrequiretoplacetheobjectsinasetofpre-defined,discrete,posi,ons(e.g.onagrid).

Example–ImagesofFlowers

Random Order

Example–ImagesofFlowers

Isomap

Example–ImagesofFlowers

IsoMatch (computed on colors)

ProblemStatement

Originalpairwisedistance

Euclideandistanceinthegrid

Permuta&on

Thegoalistofindthepermuta&onπthatminimizesthefollowingenergy:

IsoMatch–Algorithm

•  StepI:DimensionalityReduc,on(usingIsomap)

•  StepII:CoarseAlignment(boundingbox)•  StepIII:Bipar,teMatching•  StepIV(op,onal):RandomRefinement(elementsswap)

Algorithm–StepIDimensionalityReduc,on

Algorithm–StepIICoarseAlignment

Bipar,teMatching

•  Acompletebipar,tegraphisbuilt(onewiththestar,ngloca,ons,onewiththetargetloca,ons)

•  Thearc(i,j)isweightedaccordingtothecorrespondingpairwisedistance.

•  Aminimalbipar,tematchingiscalculatedusingtheHungarianalgorithm.

Algorithm–StepIIIBipar,teMatching(graphbuilt)

Algorithm–StepIIIBipar,teMatching

Algorithm–StepIIIFinalAssignment

AverageColors

WordSimilarity

PileBars

•  Anewtypeofthumbnailbar.•  Paradigm:focus+context.•  Objectsarearrangedinasmallspace(imagesaresubdividedintoclusterstosavespace).

•  Supportanyimage-imagedistance.•  PileBarsaredynamic!

PileBars–Layouts

Slots

1 image 2 images 12 images 4 images 3 images

PileBars

•  Thumbnailsaredynamicallyrearranged,resizedandreclusteredadap,velyduringthebrowsing.

•  ThisisdoneinawaytoensuresmoothtransiDons.

PileBars-Applica,onExampleNaviga,onofRegisteredPhotographs

Take a look at http://vcg.isti.cnr.it/photocloud .

Ques&ons?

top related