david louis hollembaekhollembaek.com/wp-content/uploads/2019/06/az-sat-2019...experiment first with...
TRANSCRIPT
David Louis Hollembaek
Copyright David Louis Hollembaek 2019
From Crawl to Consume
Objectives = AppInfo,Gathering,Transformation
For Objective in ObjectivesLessonsLearned
To-Do‘s
I learned a lot going through theseexperiments and the goal here is to share those learnings with you.
Roughly half of the talk is Slides and Discussion. The other half goes to demonstration- a walk through of the configuration and running the variouscomponents to show that all of this works.
Copyright David Louis Hollembaek 2019
The App
Azure Search:
Name, Geocode,
whatever else…
Xamarin,
Authentication
I made a simple app with a friend that helps me find places to buy cigars while I travel-which is often. Sometimes I have only an hour or two of free time on these trips and I don‘t want to waste them.
This is the third time I have used this app as the basis of a talk. I talked about geo Search and Usage Analytics at previous communityevents when there was less data in the app. In this round I am focused on automatingdata processing.Copyright David Louis Hollembaek 2019
LocationURL LocationName Cigars Inside Drinks View wifi outdoor heat Lockershttp://www.gekkos-bar.com/ Gekkos X X X Xhttps://www.steigenberger.com/en/hotels/all-hotels/germany/hamburg/steigenberger-hotel-hamburg Steigenberger Hotel Hamburg x x x xhttps://www.kempinski.com/en/munich/hotel-vier-jahreszeiten/ Kempinkski X X X Xhttp://www.lestroisrois.com/en/restaurants/salon-du-cigare Salon du Cigare X X X X X Xhttps://www.steigenberger.com/en/hotels/all-hotels/germany/hamburg/steigenberger-hotel-hamburg Pusser’s Bar X Xhttp://www.altetabakstube.de/ Alte Tabakstube X X
https://www.zechbauer.de/en/en/Max Zechbauer Tabakwaren GmbH & Co. KG X X
http://www.themayfairhotel.co.uk/en/bars-and-restaurants/may-fair-terrace.html May Fair Terrace X X X X Xhttps://www.corinthia.com/en/hotels/london/dining/the-outdoors/garden-lounge Garden Lounge X X X X Xhttp://www.casadelhabano-berlin.de/ La Casa del Habano Berlin X X X X Xhttps://www.bellevue-palace.ch/ Bellevue Palace Hotel X X X X X X Xhttps://www.capellahotels.com/dussel
For the prototype, I just made a list of placesthat I have found while traveling. We usedthe Bing Locations API to add geo codes for those locations. Easy!
Copyright David Louis Hollembaek 2019
Name,Address,City, Zip,State,Country,Phone,Geo,Tags,website,Created,SourceText,EnglishText,Entities,KeyPhrasesSalon du Cigare,,Basel,,,Switzerland,,,"smoking lounge, luxury, view",https://www.lestroisrois.com/en/restaurants/salonKempinski Cigar Lounge by Zechbauer,,Muenchen,,,Germany,,,"smoking lounge, luxury",https://www.kempinski.com/en/munich/Herzogs Zigarrenlager am Hafen,Stralauer Allee 9,Berlin,10245,,Germany,030-29047015,"52.4990784,13.4575716",,http://www.zigarrenherzogDavidoff Regent,"1F No.2-2. Ln.39, Sec.2, Chung Shan N. Rd.",Taipei,104,,Taiwan,+886 2 2523 5715,"25.0400201,121.5129813",Davidoff of Geneva,"2, rue de Rive",Genève,1204,,Switzerland,+41 22 310 90 41,"46.2023093,6.1500262",davidoff,,Robot,,,,Davidoff of Geneva,1390 6th Ave.,New York,10019,NY,USA,+1 212-757-3167,"40.7640181,-73.9774084",davidoff,,Robot,,,,Brobergs Deposit. ,P.O.Box 111 10,Göteborg ,40423,,Sweden,+46 31151260,"57.7064995,11.9695908",davidoff,,Robot,,,,Brobergs Deposit.,Sturegallerian 39,Stockholm ,21143,,Sweden,,"59.3361436,18.0759595",davidoff,,Robot,,,,Davidoff of Geneva Ginza,8-5-6 Ginza Chuo-ku,Tokyo,,,Japan,+813 5537 5585,"35.6694951,139.7602963",davidoff,,Robot,,,,Davidoff Shanghai,No. 381 Middle Huai Hai Road,Shanghai,200020,,China,+86 21 54656548,"31.222197,121.472953",davidoffDavidoff Lounge at Bahnhofsplatz Zürich,Bahnhofsplatz 6,Zürich,8001,,Switzerland,+41 442116323,"47.3767475,8.5401224",smoking Marriott,Calea 13 Sptembrie nr 90,Bucharest,,,Romania,,"44.4256866,26.0768286",,,Robot,,,,Allegro Papagayo Britt Shop;Hotel Occidental Allegro Papagayo;Guanacaste;;;Costa Rica;506-2277-1614;"9.748917, -83.753428";Buy Reserva Conchal (Hotel Westin) Britt Shop;Hotel Westin Reserva Conchal;Guanacaste;;;Costa Rica;506-2277-1614;"10.4957971, Casa del Habano Del Valle;"Av. División del Norte, No. 18, Del Valle Norte";Distrito Federal;3103;;Mexico;55-5282-11-84;"19.3427568, Casa del Habano Mazaryk;"Presidente Mazaryk, No. 393, Loc. 28, Polanco";Distrito Federal;11560;;Mexico;55-5282-1046;"19.4340199, Casa del Habano Loreto;"Altamirano No. 46, Loc. 119 entre Av. Revolución y RÃo Magdalena, Tizapán San Angel";DistritoHabano Club;"Heróico Colegio Militar, 1110, Las Fuentes";Piedras Negras;26010;;Mexico;01( 878 ) 78 34304;"28.7028691, Havana Express;"Shop No. 1, Lobby of the Westin No. 1 Martin Place, Sydney NSW 2000, Australia";New South Wales;2000;;"Cigar Emporium, Wynn";"Shop No. 29, Wynn Macau, Rua Cidade De Sintra, NAPE, Macau";Nape;;;Macau SAR China;(853) 2875 7368;"22.188408, 113.546051";"Buy"INTERCONTINENTAL DOHA ,BELGIAN CAFÉ";WEST BAY;DOHA;29922;;Qatar;97444849897;"25.3521436, 51.4939853";",smoking
What is wrong with this picture? I started to experiment first with web scraping to gather more locations, but of course this also creates a host of classical data cleansing problems:1. The sites are all over the world- in different
languages2. The encoding is not consistent3. Some sites publish geocodes, some do not4. Etc.
Copyright David Louis Hollembaek 2019
Starting Data
Search
Transformation
Pipeline
Crawler
Data Factory
Cognitive Services
Databricks
There were patterns for using tools to accomplish thesetasks, but not everything fit perfectly for this kind of application or the type of data that I already had. Like any seasoned technology professional, I took all the parts that I thought I needed off the shelf and started assembling! Something that started simple was starting to getcomplicated. Copyright David Louis Hollembaek 2019
From another angle, you could even call it a tornado- a roughly structured, fast-moving, collection of objects.I contributed to this complexity- my datadoes not meet the Volume or Velocityrequirements of big data, but I still used an OLAP engine, Databricks/Spark, because it‘s a tool that I have used for other applicationsin this space.
Copyright David Louis Hollembaek 2019
Find a file, send a file
• Check for a file, if exists, run a process• Send a file between stages or…
• Read the data from shared storage
• Easy to share storage, so Option 1
1 CSV in Blob Storage
ADF Checks and
Runs
Databricks Reads
and Writes
Azure Search
Reads and
Indexes
I started simple with a Data Factory Pipeline that checked that a CSV with locations exists, then triggers a databricks job that runs a Python notebook, and finally the results are indexed automatically by Azure Search.There is a template for this pipeline in Azure Data Factory.From here, we look at the configurationin ADF, the notebook in Databricks, runboth jobs, browse blob storage to seethat the output file was created, and Search to see that we now have lots of metadata for each entry.
Copyright David Louis Hollembaek 2019
Some Learnings and Some To-Do‘s
• Databricks was fast, but not thewebsites which were retrieved
• OLAP probably not the right placefor retrieving site content
• Lots of connection errors
• CSV started as a simple format, but „simple“ quickly madechallenges
• Hard to prep messy CSV data• Probably better to start with tables
and proper ETL
• Everything is pulling• This model works but Push would be
better
• I have samples, but did not have time for Classification
• Many scraped references aremissing key info
• Websites
• Geo Codes
• Do it all over with tables insteadof a CSV
• Trigger feeding to Search
I hope the learnings are clear- in the next round of updates, I will migrate the data to a table where I can more easily use ADF features to do things like the site retrieval , and will limit the databricks jobs to things like classification which really benefit from OLAP capabilities.
Copyright David Louis Hollembaek 2019
https://github.com/hollembaek/AzureSaturday2019demo
The notebook code viewed in the demo is here. You can use it to add cognitive services to your own notebooks, or just to see how I handled the file-based input and output. The notebook Retrieves site text then uses Cognitive Services to:1. Check the language and translate to English if the content is not already in English2. Extract Key Phrases3. Extract Named Entities
The readme also contains simple instructions for use and links to other examples and templatesthat informed the creation of this demo.
Copyright David Louis Hollembaek 2019