Querying large amount of data in MySQL

I am building this platform https://newsengine.co/ that aggregates news and allow access via API. The database becomes quite large in a short time. It managed to collect over 2 million news in less than a month which is increasing everyday and hence querying becomes a problem. I constantly get 504 error while trying to access the API and in general also the API is slow because of the time needed to read data from database.

Currently I am using 2 separate droplet from DigitalOcean ($20/4GB RAM for database, web app and $10/2GB RAM to run the crawlers). I am thinking of increasing the size to 8GB or 16 GB memory with more CPU cores. Before that I want to know if that is the only solution. I have also created indexes, tweaked mysql config to see if that will make querying efficient but not much success. There will be around 2-3 million news per month and I will probably archive news older than 1 month which won't be accessible via API but only on custom request.

Sagun Shrestha

posted to

Developers

on December 30, 2019

Say something nice to sagunsh…

Post Comment

3

for $30 /m you can run both on a dedicated server and a lot more.

2m rows is not a lot. how big is each row? are you using innodb? have you split the data that you don't need to query in a separate table? do you have too many indexes? if you don't need them , and the machine is isolated, you can turn off networking and grant tables in mysql

zerostar07

·
6 years ago
·
Reply
1. 1
  
  Didn't find a $30 plan on DO. I am thinking of moving to $40 one with but before that wanted to see if everything else is working fine.
  
  sagunsh
  
  ·
  6 years ago
  ·
  Reply
  1. 1
    
    https://contabo.com/?show=vps
    
    https://www.hetzner.com/sb
    
    zerostar07
    
    ·
    6 years ago
    ·
    Reply
    1. 2
      
      No kidding is that the real prices for contabo and hetzner? Looking at these prices, linode and DO seems way too expensive. And more important, can I trust these sites?
      
      I would love to have a vps at this price but what about the downtime, disk speed and reliability?
      
      sagunsh
      
      ·
      6 years ago
      ·
      Reply
      1. 2
        
        i use hetzner. the only problem i have is occasionally one RAID disk will fail (which they replace pretty fast). everything else works great - also unlimited 1Gbit traffic
        
        yeah most VPS and cloud target the clueless
        
        zerostar07
        
        ·
        6 years ago
        ·
        Reply
      2. 2
        
        Hetzner is realize we are using it in the last five years with TB Size SQL databases with few billion rows.
        
        szferi
        
        ·
        6 years ago
        ·
        Reply
3

Hi,
Last week I optimize a MySql query from 19seconds to 0.14 second.
Put EXPLAIN in front of query to see what happened and what can optimize.
Can you put the query here?
Bogdan

mangusta86

·
6 years ago
·
Reply
1. 1
  
  There are 2 table to consider: Source(id, name, language, category) and News(id, url, title, author, source_id, content, published_date). One typical query can be selecting news from british newspapers:
  
  SELECT news.title, news.url, news.author, news.source_id, source.name as source_name,source.url as source_url, news.content, news.published_date, news.image FROM news JOIN source on news.source_id = source.id where source.country='gb'
  
  Similary we can query based on category (politics, sports), language (en, fr, de), title containing certain substring (I am using INSTR instead of LIKE here), published date between certain dates. We can also add 2 or more of these filters in the same query. By default the result is sorted descending by published date.
  
  sagunsh
  
  ·
  6 years ago
  ·
  Reply
  1. 3
    
    Put "EXPLAIN" in front of the query:
    
    So in your case "EXPLAIN SELECT news.title, news.url..."
    
    You are almost definitely missing an index somewhere, but without seeing the EXPLAIN everyone here is just guessing.
    
    adotout
    
    ·
    6 years ago
    ·
    Reply
    1. 1
      
      Thanks to the Explain command, I managed to increase the speed atleast to be noticable. Will be experimenting it for a while to see if I can improve it further.
      
      sagunsh
      
      ·
      6 years ago
      ·
      Reply
  2. 2
    ~2 million rows per month is not a lot, you should be fine.
    
    Some suggestions:
    
    Selecting blobs/text columns is usually the source of most of the problems. In your example query, do you really need to return the whole news.content column? You could probably just select the first 50 characters of the content (SELECT LEFT(news.content, 50)...). Then if the user wants the whole content, it has to pass a flag on the api call requesting the full content or load it one by one by the id, or something similar.
    
    Other ideas:
    
    run EXPLAIN and see if you're getting any full table scans and check your indices
    
    make sure you have a foreign key on news.source_id
    
    add an index to source.country or maybe normalize source.country into its own table (countries table) and add a foreign key
    
    depending on the usage, you could have a partial index on published_date so your query only hits the latest news, or a partial index on the most popular languages, etc.
    
    if we're talking about MySQL, it might be better to move the text content into a different table
    
    in the future, you could look into table partitioning by date or adding a read replica, but you don't need it now
    
    In case you need more powerful search, Postgres has nice full text search capabilities, or you could check Elasticsearch, but only use ES as a "search cache" that you can delete and recreate from scratch when needed, because it's very unreliable.
    
    For pricing concerns, you could take a look at AWS RDS, you can buy Reserved DB Instances and get a sweet discount if you're going to be using the DB for a year or more -> https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_WorkingWithReservedDBInstances.html
    
    Hope that helps!
    
    thdaraujo
    
    ·
    6 years ago
    ·
    Reply
  3. 2
    couple ideas:
    
    split the query in 2 queries: make the search query, then a second SELECt to get the relevant news rows. currently, depending on the search query, mysql may be scanning the whole table to find matches. you can experiment with EXPLAIN to see how many rows are scanned for each type of search
    
    use numeric ids instead of strings for country/language
    
    since you don't search in the "content" field, (and it's probably the largest), use a separate table just for the content
    
    have u tried LIKE instead of Instr ?
    
    zerostar07
    
    ·
    6 years ago
    ·
    Reply
  4. 2
    
    Are you considering some paging?
    Usaually you don't have to pull all the lines from the table at once, so maybe you should consider LIMIT and OFFSET if it's not already the case.
    
    Then maybe you should add a table containing all the countries, then linking the source table to the country_id et not just the country. Then filtering on the country_id instead.
    
    Hope it helps!
    
    nh2_amine
    
    ·
    6 years ago
    ·
    Reply
    1. 1
      
      Yes definately there will be limit and offset. Sorry that I forgot to mention that part here.
      
      sagunsh
      
      ·
      6 years ago
      ·
      Reply
  5. 1
    
    With larger datasets avoid any string parsing at all costs. Using LIKE or INSTR both require full tablescans in most cases. Instead build an index of tags and link the News table to the Tag table through a relational table News_Tag. This allows the query optimizer to avoid table scans of anything except the Tag table to find the matching tag_ids then this can be joined on News_Tag to get the news_ids.
    
    You then develop statistics to find common words across articles and filter out words that are in too many articles(like prepositions, pronouns and such) and use this to dynamically update your Tags table. You can also keep stats to derive trending topics over time by keeping track of tag frequency from run to run.
    
    namdets
    
    ·
    6 years ago
    ·
    Reply
  6. 1
    
    Source_id have index?
    
    mangusta86
    
    ·
    6 years ago
    ·
    Reply
2

I would definitely look at dedicated hardware from hetzner. I have just moved from spending £000's on aws to just a few £00 on Hetzner.

Only thing you will need to understand as you scale is how to scale the hardware.

As for your slow queries I think everyone has answered it below although I would add in some APM software such as AppOptics (free Dev edition now) that will not only show your poor performing quire's but code as well.

DB design is definitely a big factor. Any data that doesn't need to be returned in the first set of results separate out.

Stevewolfe

·
6 years ago
·
Reply
2

I think Elasticsearch would be a much better tool for the job. ( alternatively, setup read-only replicas for mysql )

PawelDubiel

·
6 years ago
·
Reply
1. 1
  
  Can you suggest some tutorials/videos/articles to migrate from MySQL of Elasticsearch? I would love to try Elasticsearch but I am fairly new to this topic so if you can show me a direction, that would be helpful. Thanks.
  
  sagunsh
  
  ·
  6 years ago
  ·
  Reply
  1. 1
    
    Use elasticsearch if your queries are based on FULL TEXT Search ( Otherwise, MySQL should be fine) . The migration is pretty simple, and if you simply type in Google it should give you some ideas.
    
    In the worst scenario, you can simply insert to elasticsearch through REST API ( as the elastic has one )
    
    Also, there are plenty of managed elastic services for example https://www.elastic.co/pricing.
    
    They are also other engines for a full-text search like ( sphinx ) I think for Sphinx you can create the index directly from the MySQL queries. And is very simple.
    Another alternative is Solr. But the elastic is, in my opinion, the most powerful.
    
    PawelDubiel
    
    ·
    6 years ago
    ·
    Reply
2

The mysql engine can impact performance quite heavily. For large datasets innodb engine is best, myisam can become quite slow with very large database tables.

On the queries that take time off you can extract the query that is being run and put "explain"at the start it will tell you how mysql is fetching the data and confirm that your indexes are being used.

boardy

·
6 years ago
·
Reply
1
I would suggest adding a FULLTEXT index against each column with your text data in it. You would need to change the query syntax to make use of the fulltext index, to use MATCH(column) against ('searchterms'). For example if your table is called documents and the column is called content:
```
-- to add the index on the "content" column
ALTER TABLE documents
ADD FULLTEXT INDEX IX_documents_content_fulltext (content);

-- to use the index in a query:
SELECT id 
FROM document 
WHERE MATCH(content) AGAINST ('bushfire' IN BOOLEAN MODE)
```
You would also want to prevent people making queries that have too many results, e.g. if they are searching for a short,common word.
kristianp

·
6 years ago
·
Reply
1

I am not a Mysql expert. However, I have a good deal of experience with storing large amounts of data in a RDBMS.

If you are storing lobs (Large Objects, as Oracle DB calls them, ie large chunks of text or binary data), you can suffer performance issues if the row cannot fit into the page. The easy way to do this is to store the lobs in a separate table. I generally use the same ID as the table with the meta data as the primary key. This will allow you to have reasonable query performance on the main table, and you can select the lobs when you actually need to fetch the data (via the primary key).

Secondly, for a vast quantity of rows, a partitioned table is a must. Each RDBMS does partitioning a bit differently, but basically, you define a partition key, and the the rows are separated into separate tables (partitions) under the hood by a range that you define. If you can query via partition key, you narrow down the data significantly. Generally, I use a date as the partition key using a day as the range. If there isn't enough data per day, a month can work as well. As for row count, 100k records per partition worked well on Oracle, YMMV. Additionally, this makes it easier to drop data as well. Instead of deleting rows, you can just drop the partition (very similar to dropping a table).

Finally, indexes are a must. However, you can't just create indexes all willy nilly. You want to run an explain plan to see where the cost is high and figure out what column is being used the most. Obviously, you want to store dates as dates and not varchars, so you don't have to do a string comparison. If you have to run a function on a column to get your results, you can index the function (at least in Oracle and Postgres, you can).

Happy querying!

bremen242

·
6 years ago
·
Reply
1

Our product, EverSQL, may be of help in optimizing your MySQL queries automatically. Feel free to reach out to us via a support ticket and mention this post, and we'll be happy to get you on a one month free subscription to help out.
Disclaimer: I am a co-founder in EverSQL.

tomershay

·
6 years ago
·
Reply
1

My experience is use "partition" table, use 3NF and FK for seperated data info in other tables. Currently my data is more than 250m rows. So you can use "select * from myTable partition(p20191231) where...." for example

Firdix

·
6 years ago
·
Reply
1
2-3 million records is not a lot. We have tables with +250 mil records.

But it also depends what you have as record. If you save full text in one field than that 2-3 mil is same as 20-30 mil for simpler tables. One is number of record and another is size of records.

Are you doing full text search on records?
1. Try to optimise DB ( indexes, EXPLAIN ... )
2. Try to upgrade to better server ( OVH, Hetzner ... ) which are way cheaper than DO.
3. If you are searching from DB for text than you should use ElasticSearch
emchooo

·
6 years ago
·
Reply
1

MySQL might not be the best option here. Have you looked at Elasticsearch?

vfuse

·
6 years ago
·
Reply
1. 1
  
  Heard of elastic search but not much idea. I will check that as an alternative.
  
  sagunsh
  
  ·
  6 years ago
  ·
  Reply
1

This sounds like a great scenario for read-replica databases. How often is the data updated?

FunctionFocus

·
6 years ago
·
Reply
1. 1
  
  Currently I am running the crawlers every 30 minute but in the future it will be around 10-15 minutes.
  
  sagunsh
  
  ·
  6 years ago
  ·
  Reply
1

Did you index the fields that you query on?

ran

·
6 years ago
·
Reply
1. 1
  
  Yes I have created multiple indexes for fields that appear in the where part.
  
  sagunsh
  
  ·
  6 years ago
  ·
  Reply
1

Can you post the query that takes time ? Are you doing full text search on that ?

m00dy

·
6 years ago
·
Reply
1. 1
  
  Hey @m00dy see the above comment for query.
  
  sagunsh
  
  ·
  6 years ago
  ·
  Reply
1

maybe you can do this async?
You send a request by API to make big query and then do this in the background and get results in another request?

tobiasz

·
6 years ago
·
Reply
1. 1
  
  You mean dividing a larger query into chunks and sending data to browser?
  
  sagunsh
  
  ·
  6 years ago
  ·
  Reply
  1. 1
    
    Maybe he means something like rabbitmq to queue your requests.
    
    keyserfaty
    
    ·
    6 years ago
    ·
    Reply