recorded future – a white paper on temporal analytics

19
 RecordedFuture AWhitePaperonTemporalAnalytics StaffanTruvé,Ph.D. ChiefScientiest&C o-Founder,Reco rdedFuture [email protected] Thylettershavetransportedmebeyond Thisignorantpresent,andIfeelnow Thefutureint heinstant.(Macb eth,Act1Scene5) Introduction RecordedFutu reisbringinganewcategor yofanalyticstoolstomarket.Unlike traditionalsearch engineswhichfocuso ntextretrievalandle avestheanalysisto theuser,westr ivetoprovidetoo lswhichassistinidentifyingandu nderstanding historicaldevelo pments,andwhichcana lsohelpformu latehypothese sabout andgivecluestolike lyfutureevents. Wehavedecide dontheterm“tempo ral analytics”todescribeth etimeorientedana lysistaskssupportedbyo ursystems. Thiswhitepaperd escribestheunderly ingphilosophyando verallsystem architectureof RecordedFutu reanditsproducts. Searchvs.Analytics Althoughthe focusofRecord edFutureisontemp oralanalytics,acomp arison withtraditionalse archenginesisinevitablesincese archisoneimportant aspectofanalytics. Thehistoryofsea rchgoesbacktoatle ast1945,when VannevarBushpu blished hisseminalarticle“AsWe MayThink”,where amongotherth ingshepointedout that: Thedifficult yseemstobe,notsomuch thatwepublishun dulyinviewofthe extentandvariety ofpresentdayinterests ,butratherthatpublicationhas beenextendedf arbeyondourpresentability tomakerealuseoftherec ord. Thesummationofh umanexperienceisb eingexpandedataprodigi ous rate,andthemeans weuseforthreadingt hroughtheconseq uentmazeto themomentarilyimport antitemisthesameaswasusedinthed aysof square-riggedships. Inthedecadest ofollow,aloto fworkwasdone oninformationmana gementand textretrieval/sea rch.WiththeemergenceoftheWorldWideWe b,boththe needandthea bilityforalmostever yonetouseasear chenginebecameobvio us. Anexplosionofsea rchenginesfollowe d,withnamessucha sExcite,Lycos, Infoseek,andA ltaVista.Allthesefirst generationsearche nginesreallyfocused

Upload: stopspyingonme

Post on 02-Apr-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

7/27/2019 Recorded Future – A White Paper on Temporal Analytics

http://slidepdf.com/reader/full/recorded-future-a-white-paper-on-temporal-analytics 1/19

7/27/2019 Recorded Future – A White Paper on Temporal Analytics

http://slidepdf.com/reader/full/recorded-future-a-white-paper-on-temporal-analytics 2/19

 

ontraditionaltextsearch,usingvariousalgorithmsbutreallylookingat

individualdocumentsinisolation.

Googlechangedthat,withitspublicdebutin1998.Google’ssecondgeneration

searchengineisbasedonideasfromanexperimentalsearchenginecalled

BackRub.AtitsheartisthePageRankalgorithm,andthisisthecoreofGoogle’ssuccess(togetherwithcleveradvertisingbasedrevenuemodels!).ThemainideaofthePageRankalgorithmistoanalyzelinksbetweenwebpages,andtorank a

pagebasedonthenumberoflinkspointingtoit,and(recursively)therankof

thepagespointingtoit.Thisuseofexplicitlinkanalysishasproventobetremendouslyusefulandsurprisinglyrobust(eventhoughGooglecontinuously

havetotweaktheiralgorithmstocombatattemptstomanipulatetherankingalgorithm).

RecordedFutureisdevelopingathirdgenerationanalyticsengine,whichgoesbeyondexplicitlinkanalysisandadsimplicitlinkanalysis,bylookingatthe

“invisiblelinks”betweendocumentsthattalkaboutthesame,orrelated,entitiesandevents.Wedothisbyseparatingthedocumentsandtheircontentfromwhattheytalkabout–the“canonical”entitiesandevents(yes,thismodelisheavily

inspiredbyPlatoandhisdistinctionbetweentherealworldandtheworldofideas).

Documentscontainreferencestothesecanonicalentitiesandevents,andweusethesereferencestorankcanonicalentitiesandeventsbasedonthenumberof

referencestothem,thecredibilityofthedocuments(ordocumentsources)

containingthesereferences,andseveralotherfactors(forexample,co-occurrenceofdifferenteventsandentitiesinthesameorinrelated

documentsisalsousedforranking).Thisrankingmeasure–calledmomentum–isouraggregatejudgmentofhowinterestingorimportantanentityoreventis

atacertainpointintime–notethatovertime,themomentummeasureof

coursechanges,reflectingadynamicworld.

Inadditiontoextractingeventandentityreferences,RecordedFuturealso

analyzesthe“timeandspacedimension”ofdocuments–referencestowhenandwhereaneventhastakenplace,orevenwhenandwhereitwill takeplace–since

manydocumentactuallyrefertoeventsexpectedtotakeplaceinthefuture.Wearealsoaddingmorecomponents,e.g.sentimentanalyses,whichdeterminewhat

attitudeanauthorhastowardshis/hertopic,andhowstrongthatattitudeis–

theaffectivestateoftheauthor.

Thesemantictextanalysesneededtoextractentities,events,time,location,

sentimentetc.canbeseenasanexampleofalargertrendtowardscreating“thesemanticweb”.

ThetimeandspaceanalysisdescribedaboveisthefirstwayinwhichRecorded

Futurecanmakepredictionsaboutthefuture–byaggregatingweighted

opinionsaboutthelikelytimingoffutureeventsusingalgorithmiccrowdsourcing.Inadditiontothis,wecanusestatisticalmodelstopredictfuture

happeningsbasedonhistoricalrecordsofchainsofeventsofsimilarkinds.

7/27/2019 Recorded Future – A White Paper on Temporal Analytics

http://slidepdf.com/reader/full/recorded-future-a-white-paper-on-temporal-analytics 3/19

 

Thecombinationofautomaticevent/entity/time/locationextraction,implicit

linkanalysisfornovelrankingalgorithms,andstatisticalpredictionmodelsformsthebasisforRecordedFuture’stemporalanalyticsengine.Ourmissionis

nottohelpourcustomersfinddocuments,buttoenablethemtounderstandwhatishappeningintheworld.

RecordedFutureandBusinessIntelligence

Therehasbeenalongpathofinnovationinsystemsforbusinessintelligence–

tryingtohelpdecisionmakersincompaniesandorganizationsmakebetter,datadriven,decision.We’dliketothinkoftheseinthreegenerationsaswell:

•  Firstgenerationbusinessintelligencetools(BI)wereallaboutreporting

andOLAPcubes,typicallytakinghistoricalfinancial,sales,and

manufacturinginformationandorganizingforanalysis.Veryhelpful–butveryfocusedonprovidingarearmirrorviewoftheworld

•  Secondgenerationbusinessintelligencewasallaboutrealtime–hookingintorealtimedatasourcesaswellasrealtimeuserinteraction–allowing

decisionmakerstobothlookatverytimelydataaswellasadjustandinteractwithsuchviewsathighpace.

•  Thirdgenerationbusinessintelligence,wewouldliketobelieve,willbeallaboutlookingoutsidecorporationsandgeneratingdataandanalytics

fordecisionmakingbasedontheworld,notjustoldhistoricalenterprise

data.ThisisRecordedFuture.

7/27/2019 Recorded Future – A White Paper on Temporal Analytics

http://slidepdf.com/reader/full/recorded-future-a-white-paper-on-temporal-analytics 4/19

 

RecordedFutureatWork

Toillustratetheseideas,we’llpresentasimpleexample.Assumewehaveasetofdifferentsourcesfromthenet,asillustratedinthispicture:

Fromthesesources,weharvestdocuments,eitherfromRSSfeedsorotherforms

ofwebharvesting.Anexampledatasetmightcontainthefollowingdocumentswithshorttextsnippetsinthem:

Ouranalysisfirstdetectsentitiesmentionedinthedocument,anddecideswhich

entitycategorytheybelongto(inthisexample,blueforCompanies,OrangeforPersons,andgreenforCities):

7/27/2019 Recorded Future – A White Paper on Temporal Analytics

http://slidepdf.com/reader/full/recorded-future-a-white-paper-on-temporal-analytics 5/19

 

Next,eventsinvolvingtheseentitiesaredetected;inthisexamplefivedifferent

kindsofevents:

7/27/2019 Recorded Future – A White Paper on Temporal Analytics

http://slidepdf.com/reader/full/recorded-future-a-white-paper-on-temporal-analytics 6/19

 

Thesearethecanonicalevents;wenowaddeventreferences/instancesderivedfromthedifferentdocuments(andthesameforentityinstances,butforthesake

ofgraphicalclaritythesearenotincludedinthesepictures):

Oncethisanalysisiscompleted,wecanactuallydispose1oftheoriginaltexts,sincewehavecompletedthetransitionfromthetexttothedatadomain:

1Wedokeepreferencestotheoriginaldocuments,butwedonotstoreanycopyoftheactualtext. 

7/27/2019 Recorded Future – A White Paper on Temporal Analytics

http://slidepdf.com/reader/full/recorded-future-a-white-paper-on-temporal-analytics 7/19

7/27/2019 Recorded Future – A White Paper on Temporal Analytics

http://slidepdf.com/reader/full/recorded-future-a-white-paper-on-temporal-analytics 8/19

 

SystemArchitecture

TheRecordedFuturesystemcontainsmanycomponents,whicharesummarized

inthefollowingdiagram:

Thesystemiscenteredroundthedatabase,whichcontainsinformationaboutallcanonicaleventandentities,togetherwithinformationabouteventandentity

references(sometimesalsocalledinstances),documentscontainingthese

references,andthesourcesfromwhichthesedocumentswereobtained.

Therearefivemajorblocksofsystemcomponentsworkingwiththisdatabase:

-  Harvesting–inwhichtextdocumentsareretrievedfromvarioussources

onthenetandstoredinthedatabase(temporarilyforanalysis,longertermonlyifpermittedbytermsofuseandIPRlegislation).

-  Linguisticanalysis–inwhichtheretrievedtextsareanalyzedtodetecteventandentityinstances,timeandlocation,textsentimentetc.Thisis

thestepthattakesusfromthetextdomaintothedatadomain.Thisisalsotheonlylanguagedependentcomponentofthesystem;asweare

addingsupportformultiplelanguagesnewmodulesareintroducedhere.

Weareusingindustryleadinglinguisticsplatformsforsomeoftheunderlyinganalyses,andcombinethemwithourownanalysistoolswhen

necessary.

-  Refinement–inwhichdataisanalyzedtoobtainmoreinformation;this

includescalculatingthemomentumofentities,events,documentsand

7/27/2019 Recorded Future – A White Paper on Temporal Analytics

http://slidepdf.com/reader/full/recorded-future-a-white-paper-on-temporal-analytics 9/19

 

evensources(seenextsection),calculationofsentiment,synonym

detection,andontologyanalysis.

-  Dataanalysis–inwhichdifferentstatisticalandAIbasedmodelsare

appliedtothedatatodetectanomaliesinthedataandtogenerate

predictionsaboutthefuture,basedeitheronactualstatementsinthetextsorothermodelsforgeneralizingtrendsorhypothesizingfrompreviousexamples.

-  Userexperience–thedifferentuserinterfacestothesystem,includingthewebinterface,overviewdashboard,alertmechanisms,andtheAPIfor

interfacingtoothersystems.

Momentum

Tofindrelevantinformationintheseaofdataproducedbyoursystem,weneed

somerelevancemeasure.Tothisend,wehavedeveloped“momentum”–a

relevancemeasureforeventsandentitieswhichtakesintoaccounttheflowofinformationaboutanentity/event,thecredibilityofthesourcesfromwhichthat

informationisobtained,theco-occurrencewithothereventsandentities,andsoon.Momentumisforexampleusedtopresentresultsinmostrelevantorder,and

canalsobeutilizedtofindsimilaritiesbetweendifferenteventsandentities.

UserExperience

EndusersinteractwithRecordedFuturethroughaseriesofrichuser

experiences.Theanalyticsqueryinterfaceallowsuserstospecifyevents(suchas

“PersonTravel”),entities(suchas“HuJintao”)andtimeintervals(suchas“2009”or“AnytimeintheFuture”):

7/27/2019 Recorded Future – A White Paper on Temporal Analytics

http://slidepdf.com/reader/full/recorded-future-a-white-paper-on-temporal-analytics 10/19

 

Theresultscanthenbeanalyzedinseveraldifferentviews(details,charts,

timelines):

7/27/2019 Recorded Future – A White Paper on Temporal Analytics

http://slidepdf.com/reader/full/recorded-future-a-white-paper-on-temporal-analytics 11/19

 

Videosshowingtheuseofthesystemareavailableat:http://www.youtube.com/recordedfuture

Finally,enduserscaneasilysubscribetoemailalerts(calledFutures)

correspondingtointerestingqueries.Livevisualizationswithup-to-datedata

fromRecordedFuturecanalsobeembeddedinblogs,etc.

Futures

FuturesareawayofstoringanalyticquestionsandhavingRecordedFuture

monitorthemwithrespecttothecontinuousflowofdatafromtheworld.AnyqueryinRecordedFuturecanbeturnedintoaFutureattheclickofagreen

button:

7/27/2019 Recorded Future – A White Paper on Temporal Analytics

http://slidepdf.com/reader/full/recorded-future-a-white-paper-on-temporal-analytics 12/19

 

WhenaFutureisdefined,thefrequencyofupdatescanbespecified(andofcoursechangedlater),andtheFuturecanalsobesharedwithothers:

FuturesarethendeliveredastheyaredetectedbyRecordedFuture,inarich

emailformatwhichworkswellonbothlargeandsmallscreendevices:

7/27/2019 Recorded Future – A White Paper on Temporal Analytics

http://slidepdf.com/reader/full/recorded-future-a-white-paper-on-temporal-analytics 13/19

 

API

DeveloperscanaccessRecordedFuturedataandanalyticsthroughawebservicesAPI(documentationavailable

http://code.google.com/p/recordedfuture).Queriestothesystemareexpressed

usingjson(http://json.org/)andresultsareprovidedasjsonorcsvtext.TheAPIcanbeusedtointerfaceRecordedFuturewithstatisticssoftwaresuchasR

(http://www.r-project.org/)orvisualizationsoftwaresuchasSpotfire

(http://spotfire.tibco.com/),aswellasproprietaryanalyticsapplications.

ExamplesofapplicationsoftheRecordedFutureAPIinclude:

•  Algorithmictrading–usingtheRecordedFuturedatastreamtoenhance

automatedtrading/riskdecisionmaking,e.g.bymonitoringmomentumandsentimentdevelopmentofcompaniesinaportfolio.

•  Mediamonitoring–buildingnewapplicationsthatmonitorsocialaswell

astraditionalmediacoverageofacompany,industrysector,organization,orcountry.

•  Dashboards–usingtheRecordedFuturedatastreamtodisplaynovel,

externallyoriented,indicatorsoftheworld,likethefollowingverysimpleexample:

•  GeographicalinformationaccessedthroughtheAPIcaneasilybeusedto

presentresultsin3rdpartyapplicationssuchasGoogleMapsandGoogle

Earth:

7/27/2019 Recorded Future – A White Paper on Temporal Analytics

http://slidepdf.com/reader/full/recorded-future-a-white-paper-on-temporal-analytics 14/19

 

AFinalWord

RecordedFuturebringsaparadigmshifttoanalytics,byfocusingontimeasanessentialaspectoftheanalyst’swork.Sophisticatedlinguisticandstatistical

analysescombinedwithinnovativeuserinterfacesandapowerfulAPIbringsnewopportunitiestobothhumananalystsanddevelopersof3rdpartyanalytics

systems.Wecontinuouslydevelopalltheseaspectsofoursystemtobringnewtoolsintotheanalysts’hands-thefuturehasonlyjustbegun!

"Thus,whatenablesthewisesovereignandthegoodgeneraltostrikeandconquer,andachievethingsbeyondthereachofordinarymen,isforeknowledge."

(fromTheArtofWarbySunTzu,Section13)

7/27/2019 Recorded Future – A White Paper on Temporal Analytics

http://slidepdf.com/reader/full/recorded-future-a-white-paper-on-temporal-analytics 15/19

 

WHITEPAPERADDENDUM

Plato,theCave,andRecordedFuture

StaffanTruvé,Ph.D.

TounderstandthephilosophybehindRecordedFuture,itishelpfultoconsiderthefamous“caveallegory”byPlato:

Platoimaginesagroupofpeoplewhohavelivedchainedinacavealloftheirlives, facingablankwall.Thepeoplewatchshadowsprojectedonthewallbythings

 passinginfrontofafirebehindthem,andbegintoascribeformstotheseshadows. AccordingtoPlato,theshadowsareascloseastheprisonersgettoseeingreality.

Hethenexplainshowthephilosopherislikeaprisonerwhoisfreedfromthecave

andcomestounderstandthattheshadowsonthewallarenotconstitutiveofrealityatall,ashecanperceivethetrueformofrealityratherthanthemere

shadowsseenbytheprisoners.(en.wikipedia.org/wiki/Allegory_of_the_Cave)

(imagefromwww.thatmarcusfamily.org/philosophy/Course_Websites/Phil_Math/Photos/Cave.jpg )

Whatwereadinnewspapers,blogsetc.isnotunliketheshadowsonthewallofthecave–wegetreportsabouteventsintherealworld,andattempttousethat

informationtogetanideaaboutwhatisreallyhappening.Asgoodanalysts,wenaturallyconsultseveralsources,andweightogethertheinformationobtained

fromthem–alwayskeepinginmindthatsomesourcesaremorecrediblethan

others,andthusshouldbegivenhigherweight.Wecalltheevidencewegetfrom

7/27/2019 Recorded Future – A White Paper on Temporal Analytics

http://slidepdf.com/reader/full/recorded-future-a-white-paper-on-temporal-analytics 16/19

 

differentreports“eventinstances”,andtherealworldeventstheyreportonwe

refertoas“canonicalevents”.

Acanonicalevent,inoursystem,isarepresentationofaparticularhappeningin

therealworld.Forexample,assumewereadthefollowingstatementintheNew

YorkTimes:

“BarackObamasaidyesterdaythatHillaryClintonwillbetravellingtoHaitinext

week”

Thisstatementdescribestwoevents:acanonical“Quotation”eventanda

canonical“PersonTravel”event.

Thequotationeventreferstoacanonicalentity,“BarackObama”,anda

statement“HillaryClintonwillbetravellingtoHaitinextweek”.Ithasanassociatedtime,“yesterday”.

The“PersonTravel”eventincludesreferencestotwocanonicalentities,“HillaryClinton”and“Haiti”,andhasanassociatedtime“nextweek”.

Notethat“yesterday”and“nextweek”arerelativetimes,andtoplacethemonanabsolutetimeaxisweneedtoknowwhentheentirestatementwasuttered.Let

usassumethatthestatementwasutteredonWednesday,March17th.Thenwe

mightrepresentthestatementpictoriallyinthefollowingway2:

2

Notethat“nextweek”isculturallydependant–intheUS,weeksbeginonSundayswhereasinmanyothercountriestheybeginonMondays!

7/27/2019 Recorded Future – A White Paper on Temporal Analytics

http://slidepdf.com/reader/full/recorded-future-a-white-paper-on-temporal-analytics 17/19

 

Inoursystem,thisstatementwillberepresentedinthefollowingway:

Wehavethreecanonicalentities:BarackObamaandHillaryClinton,whicharePersonentities[bluerectangles],andHaiti,aLocationentity[greenrectangle].

Therearetwocanonicalevents[redovals]–“QuotationbyBarackObama”and

“PersonTravelofHillaryClintontoHaiti”.

Furthermore,thereareinstancesoftheseevents[pinkovals],whicharetagged

bythetimeortimeintervalduringwhichtheyareexpectedtohaveoccurredor

willoccur.

TheQuotationinstancealsohasareferencetothetextofthequoteandtothe

instanceoftheeventreferencedinthequote.

Finally,bothinstancesrefertothetextfragmentrepresentingtheoriginal

statement,andthefragmentreferstoitssource–theNewYorkTimes.

7/27/2019 Recorded Future – A White Paper on Temporal Analytics

http://slidepdf.com/reader/full/recorded-future-a-white-paper-on-temporal-analytics 18/19

 

Multipletextdocuments,retrievedfromdifferentsources,canofcoursebeused

togatherevidenceofthesamecanonicalevent,i.e.,toprovidedifferentinstancesofthecanonicalevent.Severaldifferentcanonicalevents–andinstances–will

alsorefertothesameentities.Toextendourexample,let’saddthestatement:

“HillaryClintontomeetwithBanKi-MooninPortauPrinceonMarch23rd 

Therepresentationofour“worldknowledge”willthenbeupdatedto:

Isthisallweknow?Notreally!RecordedFuturealsomaintainsanontology3,

withadditionalinformationaboutcanonicalentitiesandtheirrelationships.Inthisparticularexample,thefollowinginformationcanbefoundinourdatabase:

3Ontologyisthephilosophicalstudyofthenatureofbeing,existenceorrealityin

general,aswellasthebasiccategoriesofbeingandtheirrelations.Traditionally

listedasapartofthemajorbranchofphilosophyknownasmetaphysics,ontologydealswithquestionsconcerningwhatentitiesexistorcanbesaidtoexist,and

howsuchentitiescanbegrouped,relatedwithinahierarchy,andsubdivided

accordingtosimilaritiesanddifferences.(http://en.wikipedia.org/wiki/Ontology)

7/27/2019 Recorded Future – A White Paper on Temporal Analytics

http://slidepdf.com/reader/full/recorded-future-a-white-paper-on-temporal-analytics 19/19

 

Combiningtheinformationderivedfromanalyzedtextandtheontologygivesus

thefollowingpictureforthisminimalexample.IntherealRecordedFuturedatabase,therearemillionsofeventinstances.Thisshouldgiveyouanidea

abouthowtherichnessofRecordedFuturedatacanhelpyouinanalyzingeventsintherealworld!

Additionalreadingonourblogs:

CompanyUpdates:http://blog.recordedfuture.com

Government&Intelligenceexamples:

http://www.AnalysisIntelligence.com

Finance&Statisticsexamples:

http://www.PredictiveSignals.com