5
5 Comments

Is scraping the correct solution for importing reviews?

I was thinking about making a simple review consolidation service like gatherup.com, endorsal.io, or repuso.com, where you provide links to your Yelp page, Facebook Reviews page, Trustpilot page, etc. up to like 100 sources, and it pulls all your reviews and displays them in a centralized place.

I was trying to determine what the easiest way to do this would be. Since these sites are just asking for links to the review page, not having you actually sign in to an integration of some sort, I assume they're just scraping the reviews from the page?

So my solution would need to have a CRON job of some sort fire once a day to fetch new reviews. At that time, for each of my customers, for each of their review sources (Google, FB, Yelp..), I send out a web scraper (Puppeteer?), and that has to find each review on page as well as progress through each page if more than one, and save it back to my database.

This seems doable but it also seems pretty complex and easy to break, like if Trustpilot changes an element on page it could break my scraper, or my scraper could get blocked easily, or a review could be easily missed. Am I thinking about this correctly or is there a more simple obvious route to go?

posted to Icon for group Developers
Developers
on April 12, 2022
  1. 2

    All review consolidation products achive this by scrapping. There isn't any better way to do this...

    Instead, to maintain consistency you should scrap a particular page every day and compare this will predefined output..

    You should do it just for 1 review. If the outcome don't match, then fire an email to yourself. This helps to keep an eye on changes...

  2. 1

    This is the technic to get the data, however, pay attention to duplicate content penalty with SEO. Resharing unedited content without canonical links can get you in trouble.

  3. 1

    3.2 The content on the Website, including but not limited to the intellectual property rights, text, characteristics, graphics, icons, photos, calculations, references and software is or will be our property or the property of a third party (other than the Registered User) and is protected by U.S. and applicable international legislation, including without limitation applicable copyright and trademark laws.

    3.3 Unauthorized copying, distribution, presentation or other use of the Website or part hereof is a violation of U.S. law and may thus result in civil and/or criminal penalties.
    https://legal.trustpilot.com/for-reviewers/end-user-terms-and-conditions

    It's going to be a risky business. Scaling it going to be another issue. I don't believe they'll do something about it as long as you don't hurt their business.

    About scraping, first I'd look for their API, next inspect their private APIs, if they don't work out use the headless solution.

    It's doable as others already did. Things break, your test cases for each review site will alert you when that happens.

    1. 1

      Thanks for the info! Yeah I figured there may be some ToS issues too so thats yet another issue, I'll dig into the APIs to see if I can leverage that but I'm thinking this turned out to be more trouble than it's worth.

      1. 1

        I don't have any accounts on these review sites so I'm not familiar with their logic but after a quick look I noticed some of them have "Save" buttons on business profiles. Potentially allowing regular users to follow those business' feed/reviews. If that's the case, that could be another way to fetch data.

        I can't say anything about the trouble/worth. So good luck with it.

Trending on Indie Hackers
Most founders don't have a product problem. They have a visibility problem User Avatar 101 comments Day 4: Why I Built a $199 Workspace Nobody Asked For User Avatar 54 comments How to automatically turn customer feedback into high-converting testimonials User Avatar 39 comments Spent months building LazyEats AI. Spent 1 day realizing I have no idea how to get users. User Avatar 33 comments Hi IH — quick update. The MVP is live. User Avatar 24 comments I kept rewriting the same quiz + spaced-repetition code. So I packaged it into an API User Avatar 21 comments