How to speed up your data synchronization by factor 50

philippderksen
25. Sept. 2018
4 Min. Lesezeit

Aktualisiert: 20. Feb. 2019

It's still possible to improve the performance. It might speed up the process, if the writer component uses batch processing to write e.g. 100 products at once. Once batch writing is implemented you should try different batch sizes to find the best setting. Since we write the data directly into the database it might improve the performance, if another language is used. Especially when the different import tasks are seperated, parallel processing could speed up the job.

It's essential for every online shop to offer buyable products, but where does the data come from? Usually the data is (at least once) entered manually into any kind of database. There are a lot of different ways to transfer this data e.g. manual data input, file based data transfer or make use of APIs using middleware software or direct communication. Depending on the size of the company, the number of offered products and the frequency of data changes, it's a good idea to automate this process. In any case, it's recommended to avoid manual data transfer to reduce mistakes and invalid data. This article will cover file based data transfers of a large number of products (> 100k).

The mission

Assume we have an ERP system like SAP or Dynamics NAV and an ecommerce software like Shopware. The ERP system outputs XML files which contain the product information. In this context, it's not important which software is used. It could be a PIM as well. The goal is to import/update more than 100.000 products every day.

First attempt

We want to reuse existing components and standard tools. The first implementation uses the xml reader to read the files. This way we can easily use XSD schema validation. The writer component uses the shops built in models and ORM. In Shopware context this means that we use doctrine. By using existing components we were able to implement the first version very fast. But during implementation we use small sample files and when we executed the first test run with dataset equivalent to the real amount of products, we realized that the import job is slow with a high memory consumption.

Rethink it

There are two components which have a huge impact on speed: the data reading process and the data writing process. XML files offer easy validation via schema files, but therefore the whole file has to be stored in memory. This needs of course memory and also time. The fast alternative is called stream processing. We consume the files in small pieces and process the products one by one. The Shopware built in way of storing products in the database, including validation and creating and saving all needed associations, works, but this is meant for creating or editing a single product. These functions use an ORM abstraction layer, which is slow by design. You can use sql statements which are executed directly. This is a lot faster, but lacks validation and other convenient functions. Using the fast options, it's possible to import 100.000 (100k) products within 2 hours. The slow options combined will take more than 4 days for the same data.

General implementation thoughts

import tasks

Depending on how the data is delivered, we recommend to perform each import task separately. This way it's easily possible to rerun a failed step and some tasks might be used more often e.g. price imports. Be aware of the order of the single steps. So usually it's importing categories first, add products (be careful with variants), assign products to categories, import prices.

language

It's no problem to use PHP for these kind of task, but if you write directly into the database it's also possible to use completely different languages and tools. Since it's usually the shop dev team which has to implement the import, it might make sense to use the same language the ecommerce software is written in.

libraries

We use the symfony console component for the import scripts to add some cli convenience. Don't forget to add some error handling and logging. Since we wanted to know how long these tasks run, we added a simple stopwatch to measure the runtime of our scripts. There is a nice implementation of XML stream processing we adepted. To make sure there is only one import job running at the same time, it's also useful to use lock files. By writing SQL statements we could not use the Shopware model validation, but Shopware offers a free migration and an import/export plugin for this way of writing product data into the database including validation. Not all of the used libraries worked without adjusting them to our needs or using just the parts we needed.

What's next

tl;dr

Do not read complete files into memory and use direct DB statements to persist data to speed up up your import job. The Downside is, that you usually have to code a little more and can't use already existing features like XML schema validation or ORM model validation.