improving amazon data quality
TRANSCRIPT
Data Quality Analysis and Reporting
Data Record ScienceDerek Pappas
Detecting data quality problems
❖ amazon.com product data quality can be improved
❖ DRS software will detect the problems
Filtering
There are do many opportunities to improve the user experience on amazon.com by identifying and fixing/filtering out the problematic data.
Suggestions for Fixing Data❖ Product matching
❖ Variant elimination
❖ Identify bad data
❖ Identify duplicate products with different names from the same vendor
❖ Identify missing data
❖ Suggest fixes for data
❖ Identify over/underpriced items at third party stores (significantly overpriced items on amazon.com makes Amazon look bad in my opinion)
❖ Find bad/correct product classification
❖ Wrong product images
❖ Wrong specifications
❖ Google SEO violations
Data Processing Pipeline
❖ Our pipeline was built with Hadoop map/reduce which scales. The pipeline processed 200 million records last week. It can process billions.
Detecting problems
The following are just a few examples of problems that the DRS pipeline can detect.
Overpricing
See the attached image of the massage balls. We can group those product variants and we can identify the overpricing.
Overpricing Example
"Jamming" the Amazon Index❖ The link below shows the same product over and over with
different product names-these are not variants. The vendor is "jamming" the amazon index so that their product shows up under different search terms. Google will algorithmically reduce the number of links in the Google index when a site is "spammy" or Google will manually exclude a site from or reduce the number of links in the from the Google index when black hat SEO tactics are being used by the site. See the image below
❖ https://www.amazon.com/s/ref=sr_st_price-asc-rank?keywords=ab+straps+hanging&rh=i%3Aaps%2Ck%3Aab+straps+hanging&qid=1480277091&sort=price-asc-rank
“Jamming” the Amazon Index
Bad Classification
❖ 3. In other instances on amazon.com I see misclassified items. In most cases we can identify the classification problems now.
Bad Classification
There are biking and racing helmets mixed together.
https://www.amazon.com/s/ref=sr_nr_p_36_2?srs=2592626011&fst=as%3Aoff&rh=n%3A3375251%2Cn%3A%213375301%2Cn%3A706814011%2Cn%3A3403201%2Cn%3A6389202011%2Cn%3A3404571%2Ck%3ARACING%2Cp_36%3A1253557011&bbn=3404571&sort=price-asc-rank&keywords=RACING&ie=UTF8&qid=1480301345&rnid=386589011
Wrong Product Image
❖ 5. Does not know who the manufacturer is. Searching for racing inside of Giro getting Fox and Bell at the top of the search results.
Wrong Product Image (Socks)
Bad Specifications
❖ Name value pairs do not match
Bad Specifications
Mining Reviews
❖ Product Quality Issues (including Amazon basics)
❖ Store customer service issues
❖ Graph ratings vs number of reviews (is one 5 star review better than fifty 4 star reviews-validity)
Sort by Price Does Not include Shipping
Product Quality Issues
❖ https://www.amazon.com/AmazonBasics-Micro-USB-USB-Cable-2-Pack/dp/B00NH13O7K/ref=pd_sim_147_5?_encoding=UTF8&psc=1&refRID=7QRCXVWVQB9F4J9EVGV7
Reporting and Analysis
❖ Our data analysis and reporting can find the good/bad records and the good/bad/missing fields/images.
❖ Moreover, our software can often suggest fixes on the data analysis website.