How to Use Data Lakes in Manufacturing

Sep 29

Industry 4.0 Revolution has been going for over 11 years. We’ve gone from a time when manufacturing companies barely recognized the usefulness of Process Historians—to now, when Industry 4.0 talking heads all over the world have delivered their hot takes on the movement.

An Industry 4.0 concept that gets a lot of attention is data lakes. The focus on Industry 4.0, digitalization, digital transformation, IT/OT convergence, etc. speak of data lakes as if they are the greatest thing to happen to manufacturing since the advent of the PLC.

At Corso Systems, we believe there can be value in data lakes for manufacturing operations. But, we also want to de-mystify the term, explain when it is perfect tool for the job, and when you might be better off with a different solution.

What is a Data Lake?

Example of a data lake architecture with databases, reports, dashboards, data transformation, data science, and machine learning as integrations above the overall data lake.

Basically, a data lake is a very large scale data store that does not impose a schema on the data it stores. A data lake is somewhat conceptually similar to a database, but it’s more like Dropbox/Google Drive/Sharepoint/etc. You can store files, raw data, time series data, work orders, maintenance data, customer orders, ERP information (and just about anything else you could ever want or need for your business) in a single place.

Unlike a typical SQL database, there’s no requirement for creating a schema beforehand in a data lake. So, if you suddenly need to add some fields to a form, or create an entire a new system, you can capture data in a data lake without needing to configure anything new. Simply connect the pipe to the data lake and data can start flowing.

You may have heard terms like “data silos”, “rigid architectures”, and “data security” as reasons to avoid using typical databases. Some claim these concepts are all "problems” to solve by putting all of the data into one place. Many data lakes are driven by Software as a Service (SaaS) platforms in the cloud, or can be hosted in a cloud-based environment in a data center. AWS, Azure, Snowflake, and a handful of startups are operating in this space. With SaaS, you’re paying for someone to host and manage the data, provide access to your analytics pipelines and tools, and make the whole thing work. But, as the amount of data grows, your costs increase similarly to any other type of technology.

Who Should Use a Data Lake and Why?

To be completely transparent, if you are operating your business with anything less than ten databases, don’t have a dedicated development team focused on data analysis tools, or you have no MES functionality (OEE, Track and Trace, SPC, etc.)—or your entire operation is contained within one manufacturing facility without the word “Giga-” in front of its name, you likely do not need a data lake. And, if you fall into any of these categories, a data lake will require additional effort to get data from your system, additional costs to store the data—and will not provide a ton of value for you in the long run compared to using SQL Databases alone.

We would also hazard to say if more than 60% of the data you rely on for decision making, dashboards, etc. is time series data (process historians, downtime events, setpoint changes, SPC readings), you will not get the most possible value from a data lake implementation.

How Data Lakes Can Provide Value

Outside of those general constraints, data lakes can provide value to many companies. For example, in automotive manufacturing when it’s necessary to track hundreds of thousands of data points with many different formats on every single car coming through the line, data lakes can be great. You can store time series data about the process equipment right alongside real-time torque measurements from on-line sensors, QA/QC issues on a per VIN basis, and even tie the data from maintenance operations after the car is sold to a customer back into the process data for a full lifetime view of each vehicle.

Similarly, food and beverage manufacturers can use data lakes to track process data for all batches, tie it into customer orders, shipping/receiving data, and even pull in customer sales and recall information for a quick resolution to any potential regulatory issues.

Companies with many facilities (ESPECIALLY enterprises which grow by acquiring existing facilities with different automation platforms), operational requirements, and data needs can use data lakes to integrate the entire organization using a data lake as their a single source of truth.

You will need to have a development staff or partners capable of handling a data lake integration. By not imposing a schema on the data like in a SQL database, you will need to have someone translate what is in the Data Lake with what your systems need to consume that data.

What Are the Benefits of a Data Lake?

No Data Silos

The first thing you will see in any marketing presentation about data lakes is that all SQL and filesystem based data management solutions are classified as data silos—and thus are limiting your ability to understand the interaction between these various systems because the data is stored in separate places. Based on this marketing, storing all of the data for your operation in one place this eliminates the possibility of data silos because all of your data is in one place.

Data lakes can also address a potential hurdle with integrating plant floor data from PLCs and SCADA systems with IIoT systems where data might be coming from many different devices or sources. While this is a moot point if you are using a platform like Ignition with MQTT enabled Devices and the Cirrus Links MQTT modules, it is a valid concern when using legacy platforms like FactoryTalk View or Wonderware for SCADA where MQTT connectivity is not built-into the platforms.

Security

When storing the data in one place, you only have one system to secure. Contrast this with different databases, each with slightly different security needs and implementations, various APIs and integrations with third party systems, and all of the devices on the plant floor. Securing most manufacturing facilities can be a headache. This is supposed to be made easier by consolidating everything into a single place with a data lake.

Scale

Marketing informs us that data lakes can operate at extremely large scales with minimal oversight. You do not need to enforce any sort of data structure so you can throw literally all of your data into the data lake. This concept holds true for a small system or with an extremely large enterprise spanning around the globe. You can store anything and everything in a data lake. So if a year later you want to apply a novel algorithm to your data and include some new metrics you have not analyzed before, you can—as long as you stored data you might find useful in the future.

What Are the Risks of Using a Data Lake?

Structure Issues

Data silos aren’t bad because data is stored in different databases. Data silos are “bad” because people may not be aware of all of the data available to them when designing a system to analyze that data. Data lakes actually do nothing to combat this issue, and in many cases make it worse. At least with traditional databases, you can see the schema and very quickly understand what data is available and where it is. In a data lake, the data is still in the same silos it was before. However, it is all stored in one place rather than many—and now you have to go wading deep into the lake to figure out what is there and what you can use.

Compared to traditional SQL databases, data lakes become much hard to optimize as they grow because of their lack of structure. In SQL databases, you can use indexes to speed up queries. With data lakes and a complete lack of structure, optimizing queries in the data lake can be a massive development undertaking. While this task can be accomplished using some meta data analysis tools, industry standard best practices don’t exist yet, so many times you will need to roll some of your own code.

Large Scale Data, Large Scale Costs

The potential scale of data lakes can be astounding. Some companies claim to have dozens of Petabytes of data on hand at any given time. Not only is that mostly inconceivable for people to comprehend, it requires massive resources to store and manage the data.

As with traditional databases you will also find as your system grows it can take longer and longer to retrieve data. Since SQL databases have been around for a few decades there are tried and true ways to optimize queries to reduce performance hits as the system grows. Since data lakes are a newer technology these methods are developed on a case by case basis and can require lots of resources and development time to implement and test.

The other issue with scale is you need to have dedicated data scientists and analysts to even hope to understand all of the data available to you, as well as more complicated and robust tools to analyze the data in the first place. It isn’t like using a trend with time series data alone, you need to be able to correlate time series data with many other formats to gain useful insights.

Stagnation: Data Swamps

Data Lakes are commonly referred to by industry veterans as “Data Swamps”. This is because you can quickly find yourself throwing a lot of data into the data lake and not ever seeing any value come back out. Data Lakes can also run into issues where “all of the data, all of the time” means you end up with data quality issues that can go unchecked for weeks, months, or even years. Storing low quality data adds to the issues of scale and structure, not to mention increased technology costs.

Stagnation isn’t inevitable, or is an issue only with Data Lakes. Regardless of the systems and technology you use, data must be validated before its stored, and periodically verified throughout the lifecycle of the data you are storing.

Data lakes tend to promote an “store everything you can, you might need it later” approach. While you can also do this with a traditional database the instinct to push as much data as you can into a data lake is very enticing. In most cases this can lead to storing a lot of data “we might need in the future”, further cluttering the system until if/when you find a use for the data.

How to Mitigate the Risks of Using Data Lakes

A Culture That Embraces Change

As with every technology project, building up a solid culture around the tool is the most important initiative. As we have seen time and time before with things like OEE implementations there will be resistance to change, and a culture of embracing change needs to be in place for a project to be successful.

For Data Lake implementations this is not only a culture of making sure what goes into the data lake is valid data, it also requires a culture of using the data and building tools to analyze, view, and use the data to make decisions to improve the overall operation.

Simply implementing a data lake to store data and then never using it to improve things is not a good use of anyone’s time or resources, and building the culture up front is very important for these projects to be a success.

Team Communication

A strong part of any culture is the communication within the various teams and organizations. This is true with data lake projects. You need to communicate to all members of the team the motivations behind the project, how people are going to get data into and out of the system, and the needs/benefits for the people who will be using the system.

Communication is very important on data lake implementations given the scale of the project and the number of people who will ultimately use the system. This even extends to development efforts. You will want to understand the overall use case for the things to build and find ways to optimize development to avoid building the same thing but slightly different for many different people in the organization. Using templates can help reduce the sheer amount of hours required to implement useful analytics tools.

Solid Documentation

Documentation is critical component to data lake implementations. Not only how to interact with and manage the data lake itself, also what data is going in, how it is structured, and how to get it out of the data lake are all key pieces of information to document.

With the number of people typically involved at companies with successful data lake projects, it could easily become a full time job for multiple people to field all of the questions that can come up for developers, users of the system, and even IT. Document, document, document. Keep the documentation up to date as well.

Remember to USE the data!

USE THE DATA! If you are storing the data, you will want to use it and get some value out of the system. If you aren’t using it it’s turning your lake into a swamp.

Don't let your data lake turn into a swamp like the duck-weed choked one in this photo!

Data Lake Alternatives

If you aren’t ready to do a full blown data lake implementation, but you want to get some of the value out of advanced analytics tools, you’re in luck!

From an end user perspective the value of a data lake can be access better analytics tools, and the ability to visualize multiple data sources in one place. Using SCADA platforms like Ignition, or analytics focused packages like Seeq or Flow Software you can easily pull in multiple databases, correlate and manage data in multiple formats and display the data in useful ways. Many companies also use tools like Power BI or Tableau to pull in data from many sources into dashboards and reports.

From a development perspective, you can stitch together various systems using any number of integration tools to access data, and through various things like the scripting engine in Ignition, get all of the data into the right format for what you need. Depending on the scope of your operation it will likely be faster to do this development than it would be to implement and manage a data lake project.

What’s Next For Manufacturing Data?

Based on the fact the manufacturing world is usually 5-10 years behind the rest of the world in terms of technology it remains to be seen what is on the horizon for how we interact with data. To hazard a guess, it would be the advent of better analytics tools and an understanding of how to leverage them to get the most value out of all of the data you are tracking. The major hurdle in our experience isn’t the technology though. We could do everything Google does with its platforms and data mining operations with most SCADA systems already in place with data already being collected.

The major hurdle preventing manufacturing companies from leveraging this technology (as with any new technology) is the culture around technology, data, and information in manufacturing. Manufacturing needs more stability than the general technology world so it is slow to adopt the latest and greatest tools on the market.

It isn’t enough to simply track all of the data you possibly can. You also need tools you can use to understand what the data is telling you. To determine the right tools you need to know what questions to ask of the data to get value out of it. To know what questions to asks requires doing things in new and interesting ways given the technology at your fingertips.

Data Lakes in manufacturing are no different than the movement to Process Historians in the late 00’s, and the push into MES and OEE in the mid ‘10s. As more companies adopt, use, and abuse these tools they will get more robust and useful, more companies will use them, and the cycle of improvement will continue until these types of tools either become industry standard and are adopted by the major technology platforms, or they will fall by the wayside when the new hot thing comes out next year.

Considering a Data Lake?

Corso Systems can help you decide on the best solution for your manufacturing enterprise. Schedule an intro call with sales to learn about all the options, and what might be best for your business.

Schedule 15-Minute Discovery Meeting

IIoTIndustry 4.0

Alex Marcy

How to Use Data Lakes in Manufacturing

What is a Data Lake?