Right now, I am building an application that has 2 million rows of data. It is centred around a lot of analytics around team productivity. I would like to increase how fast it is. I am already using redis as a cache, my database is postgres, I select only columns I need using flask Sqlalchemy and I have cut down a lot of the processing time already. But I am certain this can be faster. Where do I learn the design patterns for this? What do I use to figure out how to improve this application? How do I measure performance? How do I learn to build apps that achieve less than 300ms response time?
It will also be helpful to know what kind of hardware you are using for the database. How much RAM the machine has? Does the whole dataset fit in the RAM? If not, at least indexes should fit in the RAM.
Disk type: HDD/SSD?
Are you using indexes?
Did you check "explain analyze"[1] to check your query plans to ensure that the indexes are being used or not?
There are many things that affect database performance.
>> I want to build something that involves billions of rows of data. I want to know how to speed it up.
The key to this is "Shared nothing architecture" along with a Sharded Database. I have tried to explain this architecture here [2]. To understand Database sharding, I would recommend this Digital Ocean Article [3].
To learn more about highly scalable architectures I would suggest reading the real world architectures section [4] of the High Scalability Blog.
[1] https://thoughtbot.com/blog/reading-an-explain-analyze-query...
[2] https://mobisoftinfotech.com/resources/mguide/shared-nothing...
[3] https://www.digitalocean.com/community/tutorials/understandi...
Also in general, always query less data than more and then request more as needed. For example, lets say you are loading a table for users. You could have 2 million users in the database. Do you really want to query all 2 million at once even if you have optimized the code and hardware for it ? I would say No. Query may be a few hundred at once and then query more as needed.
That said, 2M rows is not really a lot of data (unless the rows are huge). In case you don't need to worry about updates, you can just load everything into memory, e.g. in Pandas dataframe (large overhead, slow, many features) or more efficient implementation, like `datatable` (lower overhead, faster, less features).
Also I recently discovered BI tools (more like realized that despite the name it doesn't have to apply to business data). E.g. Metabase provides nice UI for non-complicated analytical (like SELECT avg(...) GROUP BY ...). So if it fits 90% of your queries, then maybe you got the frontend for free, and only need to work on backend (and the rest 10% of queries).
- Profiling and refactoring Python code in general:
Using yappi[^1] to profile and generate profile data, and KCachegrind[^2] to visualize that data in the form of call graphs, number of cycles, etc. can yield great results. You can find which functions in your code base are taking too long, and this can give great pointers to where bottlenecks are.
Using pyreverse[^3], now integrated in pylint[^4], to generate say a PNG image for class hierarchy and "UML diagrams" is extremely helpful. When I have used it and was the arrows going all over the place, it has helped me eke out better abstractions, remove a lot of code, write cleaner interfaces, and frankly write code I and others could actually read.
After installing pylint. On a package level for instance. Say package name is foo and follows standard hierarchy with `foo/foo`:
cd foo
pyreverse -o png .
# generates classes.png and packages.png
# You can also see pyreverse -o png foo
- Profiling in the context of Flask:Using Werkzeug's ProfilerMiddleware[^5] helps you see what's going on with each request. What functions are called, number of calls, total time, per call, which line, etc.
If the example in the documentation does not work, try the following:
...
try:
from werkzeug.middleware.profiler import ProfilerMiddleware
except ModuleNotFoundError:
# Older version
from werkzeug.contrib.profiler import ProfilerMiddleware
...
# Assuming you have an app object
app.config['PROFILE'] = True
app.wsgi_app = ProfilerMiddleware(app.wsgi_app, restrictions=[50])
General things: it is very helpful to extract as much code from the routes. This helps making the functions usable elsewhere, and not rely on Flask's testing client which can be pretty frustrating when dealing with the app context, especially in test suites involving database actions, and weird connections in setUp and tearDown if you're using unittest*.As I said, this is general and not very specific for "big data" or "billions of rows", but these small things lead to bigger things in my opinion: making the code easier to read and extend, easier to test and cover, easier to profile and improve, compounds to a point you may postpone more involved approaches.
[^1]: https://github.com/sumerc/yappi
[^2]: https://kcachegrind.github.io/
[^3]: https://www.logilab.org/blogentry/6883
[^4]: https://github.com/PyCQA/pylint
[^5]: https://werkzeug.palletsprojects.com/en/1.0.x/middleware/pro...
- If using an ORM, dump all the queries that it is creating on your behalf. Check how they compare to what you would hand write, look at the execution plans.
- Tune the queries, reduce joins where possible, add/remove indexes etc. Sometimes removing an index helps more then adding new ones if insert/updates are a bottleneck. Don't forget to check these.
- In the application layer, check all your translations/transformations to measure and tune those. These can add a lot of time if called repeatedly for many records.
- At the UI level find places where there might be a lot more going on then needs be. I so often find places where a lot more data is being requested then needed and then you have to filter through it all and if nothing else it adds memory pressure which can lead to hard to find performance issues.
- Cache's are awesome, but they sometimes just add complexity and delays where you don't need them. Small datasets usually can do really well just with some materialized views, data partitions and query tuning. I have actually removed caches before and improved performance after a little query work, don't ignore this. But I wouldn't say that is the norm, but it does exist.
- Denormalize in places to remove joins if they are not substantially necessary. When working on a system that had large marketing datasets, I found that by caching a few fields on a highly used table we could eliminate many joins in the application which were on numerous pages (common queries), in some cases giving us 25x improvements. Of course, that adds a little extra time and consideration on inserts/updates but that tradeoff is usually worth it if you are read/query heavy.
As for learning to deal with large datasets. It starts with all of the above, you need to really tune the data/queries at the database level first. Postgres is awesome, relational data is great for a ton of things, but there may be a reason to store data in non-relational ways for some things (cache is a great example). This means you might have replicated data in places which adds a little complexity but aids in performance. Always consider how the data is going to be accessed and used by the application(s). Joins are an example of a place where small changes can result in magnitudes greater performance.
Large data means small details are way more important, and what I have seen a lot of times is people don't pay attention to a lot of the little details when working with smaller datasets. That's pretty fair most of the time because it really doesn't affect performance on small datasets. But in the end when you get to large data all those details matter. Even little things like picking the right datatype is more critical in large datasets. You don't want to store a number as a string and always be doing conversions on it as an example. It adds a lot of extra wasted cycles. Of course, on small data, you would rarely see a problem with it, which makes people get complacent with their choices many times.