david louis hollembaekhollembaek.com/wp-content/uploads/2019/06/az-sat-2019...experiment first with...

10
David Louis Hollembaek Copyright David Louis Hollembaek 2019

Upload: others

Post on 20-May-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: David Louis Hollembaekhollembaek.com/wp-content/uploads/2019/06/az-sat-2019...experiment first with web scraping to gather more locations, but of course this also creates a host of

David Louis Hollembaek

Copyright David Louis Hollembaek 2019

Page 2: David Louis Hollembaekhollembaek.com/wp-content/uploads/2019/06/az-sat-2019...experiment first with web scraping to gather more locations, but of course this also creates a host of

From Crawl to Consume

Objectives = AppInfo,Gathering,Transformation

For Objective in ObjectivesLessonsLearned

To-Do‘s

I learned a lot going through theseexperiments and the goal here is to share those learnings with you.

Roughly half of the talk is Slides and Discussion. The other half goes to demonstration- a walk through of the configuration and running the variouscomponents to show that all of this works.

Copyright David Louis Hollembaek 2019

Page 3: David Louis Hollembaekhollembaek.com/wp-content/uploads/2019/06/az-sat-2019...experiment first with web scraping to gather more locations, but of course this also creates a host of

The App

Azure Search:

Name, Geocode,

whatever else…

Xamarin,

Facebook

Authentication

I made a simple app with a friend that helps me find places to buy cigars while I travel-which is often. Sometimes I have only an hour or two of free time on these trips and I don‘t want to waste them.

This is the third time I have used this app as the basis of a talk. I talked about geo Search and Usage Analytics at previous communityevents when there was less data in the app. In this round I am focused on automatingdata processing.Copyright David Louis Hollembaek 2019

Page 4: David Louis Hollembaekhollembaek.com/wp-content/uploads/2019/06/az-sat-2019...experiment first with web scraping to gather more locations, but of course this also creates a host of

LocationURL LocationName Cigars Inside Drinks View wifi outdoor heat Lockershttp://www.gekkos-bar.com/ Gekkos X X X Xhttps://www.steigenberger.com/en/hotels/all-hotels/germany/hamburg/steigenberger-hotel-hamburg Steigenberger Hotel Hamburg x x x xhttps://www.kempinski.com/en/munich/hotel-vier-jahreszeiten/ Kempinkski X X X Xhttp://www.lestroisrois.com/en/restaurants/salon-du-cigare Salon du Cigare X X X X X Xhttps://www.steigenberger.com/en/hotels/all-hotels/germany/hamburg/steigenberger-hotel-hamburg Pusser’s Bar X Xhttp://www.altetabakstube.de/ Alte Tabakstube X X

https://www.zechbauer.de/en/en/Max Zechbauer Tabakwaren GmbH & Co. KG X X

http://www.themayfairhotel.co.uk/en/bars-and-restaurants/may-fair-terrace.html May Fair Terrace X X X X Xhttps://www.corinthia.com/en/hotels/london/dining/the-outdoors/garden-lounge Garden Lounge X X X X Xhttp://www.casadelhabano-berlin.de/ La Casa del Habano Berlin X X X X Xhttps://www.bellevue-palace.ch/ Bellevue Palace Hotel X X X X X X Xhttps://www.capellahotels.com/dussel

For the prototype, I just made a list of placesthat I have found while traveling. We usedthe Bing Locations API to add geo codes for those locations. Easy!

Copyright David Louis Hollembaek 2019

Page 5: David Louis Hollembaekhollembaek.com/wp-content/uploads/2019/06/az-sat-2019...experiment first with web scraping to gather more locations, but of course this also creates a host of

Name,Address,City, Zip,State,Country,Phone,Geo,Tags,website,Created,SourceText,EnglishText,Entities,KeyPhrasesSalon du Cigare,,Basel,,,Switzerland,,,"smoking lounge, luxury, view",https://www.lestroisrois.com/en/restaurants/salonKempinski Cigar Lounge by Zechbauer,,Muenchen,,,Germany,,,"smoking lounge, luxury",https://www.kempinski.com/en/munich/Herzogs Zigarrenlager am Hafen,Stralauer Allee 9,Berlin,10245,,Germany,030-29047015,"52.4990784,13.4575716",,http://www.zigarrenherzogDavidoff Regent,"1F No.2-2. Ln.39, Sec.2, Chung Shan N. Rd.",Taipei,104,,Taiwan,+886 2 2523 5715,"25.0400201,121.5129813",Davidoff of Geneva,"2, rue de Rive",Genève,1204,,Switzerland,+41 22 310 90 41,"46.2023093,6.1500262",davidoff,,Robot,,,,Davidoff of Geneva,1390 6th Ave.,New York,10019,NY,USA,+1 212-757-3167,"40.7640181,-73.9774084",davidoff,,Robot,,,,Brobergs Deposit. ,P.O.Box 111 10,Göteborg ,40423,,Sweden,+46 31151260,"57.7064995,11.9695908",davidoff,,Robot,,,,Brobergs Deposit.,Sturegallerian 39,Stockholm ,21143,,Sweden,,"59.3361436,18.0759595",davidoff,,Robot,,,,Davidoff of Geneva Ginza,8-5-6 Ginza Chuo-ku,Tokyo,,,Japan,+813 5537 5585,"35.6694951,139.7602963",davidoff,,Robot,,,,Davidoff Shanghai,No. 381 Middle Huai Hai Road,Shanghai,200020,,China,+86 21 54656548,"31.222197,121.472953",davidoffDavidoff Lounge at Bahnhofsplatz Zürich,Bahnhofsplatz 6,Zürich,8001,,Switzerland,+41 442116323,"47.3767475,8.5401224",smoking Marriott,Calea 13 Sptembrie nr 90,Bucharest,,,Romania,,"44.4256866,26.0768286",,,Robot,,,,Allegro Papagayo Britt Shop;Hotel Occidental Allegro Papagayo;Guanacaste;;;Costa Rica;506-2277-1614;"9.748917, -83.753428";Buy Reserva Conchal (Hotel Westin) Britt Shop;Hotel Westin Reserva Conchal;Guanacaste;;;Costa Rica;506-2277-1614;"10.4957971, Casa del Habano Del Valle;"Av. División del Norte, No. 18, Del Valle Norte";Distrito Federal;3103;;Mexico;55-5282-11-84;"19.3427568, Casa del Habano Mazaryk;"Presidente Mazaryk, No. 393, Loc. 28, Polanco";Distrito Federal;11560;;Mexico;55-5282-1046;"19.4340199, Casa del Habano Loreto;"Altamirano No. 46, Loc. 119 entre Av. Revolución y RÃo Magdalena, Tizapán San Angel";DistritoHabano Club;"Heróico Colegio Militar, 1110, Las Fuentes";Piedras Negras;26010;;Mexico;01( 878 ) 78 34304;"28.7028691, Havana Express;"Shop No. 1, Lobby of the Westin No. 1 Martin Place, Sydney NSW 2000, Australia";New South Wales;2000;;"Cigar Emporium, Wynn";"Shop No. 29, Wynn Macau, Rua Cidade De Sintra, NAPE, Macau";Nape;;;Macau SAR China;(853) 2875 7368;"22.188408, 113.546051";"Buy"INTERCONTINENTAL DOHA ,BELGIAN CAFÉ";WEST BAY;DOHA;29922;;Qatar;97444849897;"25.3521436, 51.4939853";",smoking

What is wrong with this picture? I started to experiment first with web scraping to gather more locations, but of course this also creates a host of classical data cleansing problems:1. The sites are all over the world- in different

languages2. The encoding is not consistent3. Some sites publish geocodes, some do not4. Etc.

Copyright David Louis Hollembaek 2019

Page 6: David Louis Hollembaekhollembaek.com/wp-content/uploads/2019/06/az-sat-2019...experiment first with web scraping to gather more locations, but of course this also creates a host of

Starting Data

Search

Transformation

Pipeline

Crawler

Data Factory

Cognitive Services

Databricks

There were patterns for using tools to accomplish thesetasks, but not everything fit perfectly for this kind of application or the type of data that I already had. Like any seasoned technology professional, I took all the parts that I thought I needed off the shelf and started assembling! Something that started simple was starting to getcomplicated. Copyright David Louis Hollembaek 2019

Page 7: David Louis Hollembaekhollembaek.com/wp-content/uploads/2019/06/az-sat-2019...experiment first with web scraping to gather more locations, but of course this also creates a host of

From another angle, you could even call it a tornado- a roughly structured, fast-moving, collection of objects.I contributed to this complexity- my datadoes not meet the Volume or Velocityrequirements of big data, but I still used an OLAP engine, Databricks/Spark, because it‘s a tool that I have used for other applicationsin this space.

Copyright David Louis Hollembaek 2019

Page 8: David Louis Hollembaekhollembaek.com/wp-content/uploads/2019/06/az-sat-2019...experiment first with web scraping to gather more locations, but of course this also creates a host of

Find a file, send a file

• Check for a file, if exists, run a process• Send a file between stages or…

• Read the data from shared storage

• Easy to share storage, so Option 1

1 CSV in Blob Storage

ADF Checks and

Runs

Databricks Reads

and Writes

Azure Search

Reads and

Indexes

I started simple with a Data Factory Pipeline that checked that a CSV with locations exists, then triggers a databricks job that runs a Python notebook, and finally the results are indexed automatically by Azure Search.There is a template for this pipeline in Azure Data Factory.From here, we look at the configurationin ADF, the notebook in Databricks, runboth jobs, browse blob storage to seethat the output file was created, and Search to see that we now have lots of metadata for each entry.

Copyright David Louis Hollembaek 2019

Page 9: David Louis Hollembaekhollembaek.com/wp-content/uploads/2019/06/az-sat-2019...experiment first with web scraping to gather more locations, but of course this also creates a host of

Some Learnings and Some To-Do‘s

• Databricks was fast, but not thewebsites which were retrieved

• OLAP probably not the right placefor retrieving site content

• Lots of connection errors

• CSV started as a simple format, but „simple“ quickly madechallenges

• Hard to prep messy CSV data• Probably better to start with tables

and proper ETL

• Everything is pulling• This model works but Push would be

better

• I have samples, but did not have time for Classification

• Many scraped references aremissing key info

• Websites

• Geo Codes

• Do it all over with tables insteadof a CSV

• Trigger feeding to Search

I hope the learnings are clear- in the next round of updates, I will migrate the data to a table where I can more easily use ADF features to do things like the site retrieval , and will limit the databricks jobs to things like classification which really benefit from OLAP capabilities.

Copyright David Louis Hollembaek 2019

Page 10: David Louis Hollembaekhollembaek.com/wp-content/uploads/2019/06/az-sat-2019...experiment first with web scraping to gather more locations, but of course this also creates a host of

https://github.com/hollembaek/AzureSaturday2019demo

The notebook code viewed in the demo is here. You can use it to add cognitive services to your own notebooks, or just to see how I handled the file-based input and output. The notebook Retrieves site text then uses Cognitive Services to:1. Check the language and translate to English if the content is not already in English2. Extract Key Phrases3. Extract Named Entities

The readme also contains simple instructions for use and links to other examples and templatesthat informed the creation of this demo.

Copyright David Louis Hollembaek 2019