metia cep powerful price search engine driven by...

5
Microsoft .NET Customer Solution Case Study Powerful Price Search Engine Driven by TPL Overview Country or Region: USA Industry: Retail Industry Customer Profile PriceSpider is an online engine featuring real-time price updates, real-time local store inventory, aggregated product reviews and social integration so consumers can make better buying decisions. Business Situation To solve mission-critical bandwidth and resource efficiency challenges, PriceSpider needed to leverage parallelization to create advanced data compilation and search technology. Solution The company converted its manual parallelization process to the Task Parallel Library (TPL), a set of public types and APIs in the Microsoft .NET Framework 4, gaining improved business agility. Benefits Real-time product information Time and energy savings Powerful data collection Maintenance and expansion Improved user experience We’ve been very impressed with Task Parallel Library, which is there to aid you with doing things on multiple threads... [It] lets us optimally make use of resources." Chadd Nervig, Senior Software Developer, PriceSpider What differentiates the PriceSpider search engine is its real-time product and pricing information. Each day, the PriceSpider data crawl fetches dozens of terabytes of up-to-the-minute product data, including images, for hundreds of thousands of products. To retrieve and process all that information faster and more efficiently, PriceSpider took advantage of the parallel- programming capabilities provided in Microsoft Visual Studio 2010 Premium with Team Foundation Server and Microsoft .NET Framework 4, including the Task Parallel Library (TPL). By converting from its manual parallelization process to the more efficient TPL, PriceSpider minimized bottlenecks in the crawling process, saved time and energy, and ensured that customers have continuous access to up-to-the-minute product images and information.

Upload: others

Post on 18-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Metia CEP Powerful Price Search Engine driven by TPLdownload.microsoft.com/documents/customerevidence/...  · Web viewLibrary (TPL), a set of public types and APIs in the System.Threading

Microsoft .NETCustomer Solution Case Study

Powerful Price Search Engine Driven by TPL

OverviewCountry or Region: USAIndustry: Retail Industry

Customer ProfilePriceSpider is an online engine featuring real-time price updates, real-time local store inventory, aggregated product reviews and social integration so consumers can make better buying decisions.

Business SituationTo solve mission-critical bandwidth and resource efficiency challenges, PriceSpider needed to leverage parallelization to create advanced data compilation and search technology.

SolutionThe company converted its manual parallelization process to the Task Parallel Library (TPL), a set of public types and APIs in the Microsoft .NET Framework 4, gaining improved business agility.

Benefits Real-time product information Time and energy savings Powerful data collection Maintenance and expansion Improved user experience

We’ve been very impressed with Task Parallel Library, which is there to aid you with doing things on multiple threads... [It] lets us optimally make use of resources."

Chadd Nervig, Senior Software Developer, PriceSpider

What differentiates the PriceSpider search engine is its real-time product and pricing information. Each day, the PriceSpider data crawl fetches dozens of terabytes of up-to-the-minute product data, including images, for hundreds of thousands of products. To retrieve and process all that information faster and more efficiently, PriceSpider took advantage of the parallel-programming capabilities provided in Microsoft Visual Studio 2010 Premium with Team Foundation Server and Microsoft .NET Framework 4, including the Task Parallel Library (TPL). By converting from its manual parallelization process to the more efficient TPL, PriceSpider minimized bottlenecks in the crawling process, saved time and energy, and ensured that customers have continuous access to up-to-the-minute product images and information.

Page 2: Metia CEP Powerful Price Search Engine driven by TPLdownload.microsoft.com/documents/customerevidence/...  · Web viewLibrary (TPL), a set of public types and APIs in the System.Threading

SituationPriceSpider, a website property based in Irvine, California, and managed by enterprise solutions provider Neudesic, enables consumers to search online retailer web sites for product pricing and related information such as descriptions, pictures, reviews, and stock information. Using the latest in website technology, PriceSpider is an active, real-time web crawler that searches thousands of online retailers to provide consumers with the best prices on the web, supported by product information, images, and availability. It combines the latest in social networking and web-crawling technology, all with the end goal of helping consumers save money and be more informed about the products that they are looking to purchase. In addition to providing consumers with this service, PriceSpider also delivers its data in reports for business-to-business exchanges, and licenses its technology to companies in the e-commerce industry, such as Samsung and Sony. PriceSpider handles between 8,000 and 10,000 unique users daily, and indexes hundreds of thousands of electronics products.PriceSpider faced a mission-critical challenge in using network bandwidth and other infrastructure resources efficiently. Real-time results are the heart of the company’s business, and driving its data compilation and search technology quickly is required in order to provide the most up-to-date information to users. This need for speed could only be solved with parallelization techniques.

SolutionPriceSpider has always used the Microsoft .NET Framework to drive its

processes because of the framework’s reliability and ease of use. Over time, the company decided to convert its manual parallelization process to the Task Parallel Library (TPL), a set of public types and APIs in the System.Threading and System.Threading.Tasks namespaces in Microsoft .NET Framework 4, in order to gain new business agility. What PriceSpider experienced as a result was a new level of speed and parallelization. According to Jon Pfortmiller, Director of Technology at PriceSpider and Neudesic, the company uses TPL to operate in an “absurdly parallel” manner, taking best advantage of resources and minimizing bottlenecks.Pfortmiller saw that the canonical dependency on product feeds from sellers was a highly inefficient method for constructing a price-comparison database. Datasets from feeds were both massive and frequently out-of-date, and 90 percent of the data provided was unchanged from previous feeds. Pfortmiller saw that the process could be made far more efficient and up-to-date by crawling sellers’ sites rather than depending on product feeds. “More and more, what is important is having real-time information to make real-time decisions,” says Pfortmiller. “The more information you have, the better decisions you can make and obviously the fresher that information is, the better decisions you can make. It goes from companies trying to make decisions on their prices all the way back to the consumer, who’s trying to make a buying decision. And the latest thing out there now, what’s really hot, is having the real-time inventory at these local stores.”Pfortmiller worked with Microsoft National Systems Integrator and Gold Certified Partner Neudesic to develop PriceSpider and play the role of parent company. His

25

“More and more, what is important is having real-time information to make real-time decisions. The more information you have, the better decisions you can make and obviously the fresher that information is, the better decisions you can make."

Jon Pfortmiller, Director of Technology, PriceSpider

Page 3: Metia CEP Powerful Price Search Engine driven by TPLdownload.microsoft.com/documents/customerevidence/...  · Web viewLibrary (TPL), a set of public types and APIs in the System.Threading

team used .NET Framework 4 to create the search engine and web crawler, and it has continued to refine the service based on Microsoft technologies. In the course of developing its solution, PriceSpider massively parallelized the code it used, relying on Task Parallel Library as well as Windows Presentation Foundation, part of the Microsoft .NET Framework. PriceSpider also plans to take advantage of new asynchronous programming language and library features that are being developed for Microsoft Visual Studio.The technical problems that PriceSpider faces generally involve minimizing bottlenecks due to network bandwidth, latency, and system resources while gathering data. In particular, the system must minimize the time spent processing images into formats and sizes needed for presentation, because processing images synchronously can result in a large bottleneck. Maintainability is also important to PriceSpider, and the company has used Task Parallel Library to streamline the code for the project in order to accommodate maintenance and future enhancement.Says Chadd Nervig, Senior Software Developer at PriceSpider, “We’ve been very impressed with Task Parallel Library, which is there to aid you with doing things on multiple threads. It’s doing multiple concurrent things, so that when one task is waiting on network I/O or waiting for something to download, another task can be at a different stage in the process, or also waiting for something to download, or waiting for a file to copy, or waiting for some image to process. So fundamentally you want to be doing multiple things at once that aren’t blocking each other. We’re able to easily keep track of all these threads that are off doing different things and sync

them back up when they’re done thanks to the Task Parallel Library, which lets us optimally make use of resources.”“We also rely heavily on Windows Presentation Foundation,” Nervig notes. “We use that for processing images. There is a whole lot of cataloging and processing involved with each one of these images that we find, and there are a lot of them. The system deals with a lot of data nonstop for months on end.”

“There are a whole lot of resources involved in performing the processing of these images, from network bandwidth and Internet latency to disk access, processor speed, and memory,” continues Nervig. “All those things are all very important in doing this processing, and they all come in different steps in the process. And so it just screams out to parallelize it so we can fully optimize and maximize the use of all of these resources at the same time.”Adds Pfortmiller, “I think ultimately we can show more stuff to consumers. If we can process more images more efficiently, then we have better data ourselves, we have better content. The more we process, the more likely we are to find a better image.” Nervig anticipates even further refinements based on new features in C# and the Microsoft Visual Basic development system that significantly simplify asynchronous programming, such as the Visual Studio Async CTP. “With the Visual Studio Async CTP, you can be doing some sort of processing, and you can spin off a thread, which means doing something in a second thread without waiting for the first thread to finish,” says Nervig. “When it finishes, you can get the results and continue processing as normal. You can do that

35

The commented out ‘foreach’ lines are replaced with ‘Parallel.ForEach’ calls, to quickly and easily make the file copies run in parallel.

Page 4: Metia CEP Powerful Price Search Engine driven by TPLdownload.microsoft.com/documents/customerevidence/...  · Web viewLibrary (TPL), a set of public types and APIs in the System.Threading

already, but the Async CTP features just make it a whole lot easier.”

“You could get the same performance doing it without the Async CTP. It just would be a lot more code, it’d be a lot more complex code, and it would be harder to maintain.”

Chadd Vervig, Senior Software Developer, PriceSpider

Nervig adds, “You could get the same performance doing it without the Async CTP. It just would be a lot more code, it’d be a lot more complex code, and it would be harder to maintain.”

BenefitsPriceSpider is thriving on the benefits of parallelization. By using Task Parallel Library, PriceSpider can provide real-time product information that is verifiably identical to what users would see if they were to browse the seller’s site, along with a highly tunable user experience and access to all of the benefits of social media, from product photos to reviews. From a development standpoint, the efficiencies provided by the Task Parallel Library have translated into energy and time savings in creating the architecture to do this powerful data collection and computation. Maintainability and Expandability

Regardless of how well a software project works, if it is not maintainable or expandable, it will quickly fall into ruin. By using Microsoft development tools, PriceSpider has built in provisions for both maintenance and expansion.Looking to the future, PriceSpider is looking to grow its offering and provide additional services and product types on its site. “I think one way that we could improve this in the future is through expanding into other products,” says Nervig. “Right now we focus on just consumer electronics, but one of the aspirations we have is to spread to a wider demographic and a wider selection of products. More products mean more images to process.” PriceSpider will be better able to expand its site and efficiently provide more services to its customers because of the efficiencies it gains from using the Task Parallel Library. Improved Consumer Experience The product research process can be difficult and prone to misinformation, which is ultimately why the PriceSpider approach is so important. The company’s technique of crawling sites rather than depending on product data feeds enables its information to stay unbiased. When search engines depend solely on data feeds from product manufacturers or retailers, consumers and businesses often lose out by receiving information that is both outdated and incomplete. PriceSpider uses parallelization to minimize bottlenecks in its crawling process so that consumers get the latest possible information, minimizing the research phase and improving the consumer’s experience. PriceSpider improves the user’s experience in three important ways:

45

Page 5: Metia CEP Powerful Price Search Engine driven by TPLdownload.microsoft.com/documents/customerevidence/...  · Web viewLibrary (TPL), a set of public types and APIs in the System.Threading

• Through immediacy, by providing up-to-the-minute product information and images• Through location, by providing the consumer with local points to purchase along with inventory data• Through social media, by providing consumer reviews from multiple places that are not tied to any specific retailer

By using Windows Presentation Foundation for in-memory image processing and Task Parallel Library to parallelize the entire process, PriceSpider is able to give consumers and businesses access to data that actually helps them make beneficial purchase decisions.

Microsoft .NETMicrosoft .NET is software that connects people, information, systems, and devices through the use of web services. Web services are a combination of protocols that enable computers to work together by exchanging messages. Web services are based on the standard protocols of XML, SOAP, and WSDL, which allow them to interoperate across platforms and pro-gramming languages.

.NET is integrated across Microsoft products and services, providing the ability to quickly build, deploy, manage, and use connected, secure solutions with web services. These solutions provide agile

business integration and the promise of information anytime, anywhere, on any device.

For more information about Microsoft .NET and web services, please visit these websites: www.microsoft.com/net msdn.microsoft.com/webservices

55

For More InformationFor more information about Microsoft products and services, call the Microsoft Sales Information Center at (800) 426-9400. In Canada, call the Microsoft Canada Information Centre at (877) 568-2495. Customers in the United States and Canada who are deaf or hard-of-hearing can reach Microsoft text telephone (TTY/TDD) services at (800) 892-5234. Outside the 50 United States and Canada, please contact your local Microsoft subsidiary. To access information using the World Wide Web, go to:www.microsoft.com

For more information about products and services, call or visit the website at:

For more information about PriceSpider products and services, call or visit the website at: www.pricespider.com

This case study is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS SUMMARY.

Document published March 2011

Software and Services Microsoft Office− Microsoft Office SharePoint Server

2007 Microsoft Server Product Portfolio− Microsoft SQL Server 2008 R2

Enterprise Microsoft Visual Studio

− Microsoft Visual Studio 2010 Other Products− Windows 7

Hardware Intel Xeon E5506, 64-bit