DataOps: A Unique Moment in Time for Next Generation Data Engineering

Check out the original LinkedIn post here

Judging from the engagement on my latest post from across the spectrum of business leaders, data engineers, data scientists and thought leaders, DataOps is clearly picking up steam. There were too many comments for me to respond to them all on LinkedIn, so I decided to post this follow up.

My overall takeaway? The sense of optimism that people are feeling about next-generation data engineering in the enterprise. It finally feels like large organizations have embraced their “data debt” and are figuring out how to monetize their data. It helps when they realize that great analytics depend on great data.

We’re at a unique point in time when fundamental changes in data management in the enterprise, the low barrier to cloud migration and the volume and value of enterprise data combine into a massive opportunity to create next generation data engineering pathways.

post2

Figure 1: A unique moment in time for Data as an Asset

One of the foundational ideas behind DataOps, and the reason Mike and I started Tamr, is that enterprise technology and the people responsible for data have finally reached a place where companies that are not data-native can start achieving the types of analytical outcomes that have allowed Google, Facebook and Amazon to sprint so far ahead of the competition. These companies achieved their primacy because they could devote vast amounts of technical resources to solving the problems of software and data engineering at scale when it was a seemingly impossible task. DataOps technologies and processes have the potential to bring Google/Amazon-level data engineering to rest of us in the Global 2000.

Below, I’ve taken a few threads from comments I received and addressed them. I love the discussion — so let’s keep it up!

Starting with outcomes

A common point was expertly summed up: “putting [this] into practice will be a long journey, and will need to bring cultural change.” This is right on the nose. While developing a data-driven engineering infrastructure will take organizations a long time, getting started won’t. Building a foundation for the repeatable delivery of production-ready data starts with a single use case, and is often preceded by shockingly simple but shockingly hard-to-answer questions from business users: “How many customers do we have?” “How much do we pay each of our suppliers?” Answering these questions well is the best way to get started on DataOps. By attacking the question not as a one-and-done project, but as an analytic that should be readily available to many users at any point in time, the principles of DataOps necessarily come into focus. You can’t answer that question for many users over time without integrating data from many different sources, without involving users and without automation. DataOps will be successful when driven by use cases with clear business value — whether it’s generating more revenue through cross-sell and upsell, preparing for GDPR or finding savings in the supply chain. The success of these analytics is addictive; once an organization is successful for one project; the next most important use case is immediately apparent.

Isn’t DataOps just a new marketing name for an old trend?

This was a common and valid comment. As someone who has been on both sides of the CIO/vendor table for over three decades, I’ve been in the midst of every new attempt at solving this problem. My partner Mike gives a great talk on the generations of data curation.

DataOps is a discipline that allows organizations to repeatedly create production-ready data from a wide variety of sources for a wide variety of use cases. There are two reasons why DataOps is materially different than the generations of data curation which came before it. The first differentiator is the is the ability to create bi-directionality between end users at the point of consumption into the data models themselves. The key to repeatability is the tight coupling of feedback from users at the point of consumption into the creation of the data models themselves. The quintessential example, in my mind, is Google Maps. Google keeps informational incredibly up-to-date simply by asking a few users to add a few data points. Integrating this feedback into its architecture is at the core of why Google Maps is so good and constantly getting better. When we’re setting up next-generation data engineering processes, we have to keep this bi-directionality in mind; it’s the only way to scale data curation. The technology is there to allow closely-coupled integration between data production and consumption, and coupling them allows repeatability and scale.

The second differentiator is automation. The use of technologies like machine learning on very tactical but predictable chores like transformations or schema-mapping allow DataOps-ground data engineering pipelines to do the vast majority of work — upwards of 95% — without human intervention; saving the tough work for humans. Combined with scale-out computing and the easy availability of cloud resources, we are at a place where the rest of us can get a handle on the crazy amounts of siloed enterprise data.

Data Quality: Cold War or Overt Conflict within your organization? ?

At the heart of the problem is an age-old battle — the desire of IT folks to not let systems and data get out of hand, and the desire of business users to have useful, efficient operations with the data they need to get the job done. Business process automation has been spewing out data for decades; until recently, we thought of this as as a type of exhaust. As business users realize how valuable that data is, the desire to treat data as an asset has highlighted just how messy enterprise data systems are. DataOps projects need to acknowledge that business users need some freedom to operate — but that data needs to be handled and treated like an asset. Engineering for these realities, rather than ignoring them, is a key component of DataOps. A great example comes from my friends at Novartis Institute for Biomedical Research, who realized that while they wanted to glean insights from the tens of thousands of experiments, forcing data entry standards on hundreds of scientists was a non-starter. Instead, they had to adjust their data engineering to accommodate how scientists actually produce data, and use the machine to prepare the data how the business wants to use it.

In some companies, this tension is overt; in others, it’s a cold war. In these cases, technological changes should be deployed to facilitate cultural changes. In every DataOps project I’ve been a part of, the people side of things is much harder than the technology side. Giving business users a taste of the possible analytical outcomes from having unified, trusted data givens them a foundation of trust to give up data hoarding and political games. It’s tough, but integrating users into the process as experts, and seeing the results of their feedback improve the data they have access to, is a powerful tool for detente.

Isn’t another system for copying data just adding to the problem?

Data isn’t code; as one commenter mentioned, the instant you copy data it becomes corrupted. This is why Master Data Management, for example, failed to reach its promise. Again, DataOps is about acknowledging reality. Data is going to multiply, duplicate and live where it’s not suppose to. A common system of reference — a unified data hub — tackles this problem by allowing the data models to be flexible enough to accommodate duplication and resolve uncertainties by incorporating experts for data curation. Only processes that acknowledge and account for how data is produced and used have any hope of success. The alternatives have all been tried: imposing standards, mastering data, system consolidations, etc. They are necessary but insufficient, because as soon as they become cumbersome to business users, those users ignore them, circumvent them or procure their own, new system.

DataOps done poorly will just result in another competing golden record. This happens when the tools and process takes primacy over results. When DataOps projects are focused on driving transformational and repeatable analytical outcomes, users will clamor to gain access for their own use cases. I’ve seen this virtuous cycle over and over again.

Posted in Uncategorized | Leave a comment

DataOps: Building A Next Generation Data Engineering Organization

See the original post here

Large enterprises are experiencing a foundational shift in how they view their data and structure their data engineering teams. As organizations capture more data than ever before, and store it in an ever-increasing variety of data stores, the prospect of competing on analytics becomes more tantalizing than ever. The challenges of using data at scale, however, is rooted in the “data debt” accumulated over time by enterprises struggling to manage the extreme volume and variety of their data. Paying down this data debt is the proverbial long pole in the tent for competing on analytics.

If you ask most employees, they likely believe their company’s data is neatly organized and easily accessible. As enterprise data professionals know, the reality is that a typical data environment resembles a “random data salad.” For decades, companies have been idiosyncratically deploying systems for business process automation, with the data generated from these deployments treated mainly as “exhaust” to the business processes. The resulting data environment is deeply fragmented and virtually impossible to integrate at scale — crushing the hopes of companies that want to develop an analytical advantage.

Indeed, even basic questions about the business, like “Who are my customers?” can’t be answered consistently and completely. The realities of doing business, like merger and acquisitions, further complicate data management. Companies need to start managing their data as an asset, much like they would their own money, if they hope to compete on analytics. It’s time to rethink priorities and start putting the “data horse in front of the analytics cart.”

Legacy Approaches to Data Engineering Are Failing

Legacy approaches to data management, such as ETL and MDM, are acceptable tools for small-scale or static environments; however, their rules-based approach can’t handle a large-scale, highly dynamic data environment (see this whitepaper by my co-founder Mike Stonebraker). Each question asked from the business would result in a long, complicated process of acquiring and manually preparing the data for consumption — providing little repeatability or scalability. Standardization is another approach that isn’t practical for large enterprises. Forcing top-down naming conventions for customers, suppliers, or other entities is hard to monitor and harder still to enforce. These complicated data management challenges are the reasons why we see such a high failure rate of so many potentially game-changing analytic projects.

Data Operations (DataOps) As An Engineering Methodology 

We envision a world where enterprise data customers have ready access to high-quality, cross-silo, unified enterprise data for all of their core logical entities. Data Operations (DataOps) is a methodology consisting of people, processes, tools, and services for enterprises to rapidly, repeatedly, and reliably deliver production-ready data from the vast array of enterprise data sources.

DataOps was inspired by the concept of DevOps, a term pioneered decades ago by internet startups to describe one of the main ways they were trying to beat their much better-resourced established competitors . The concept is that the best way to increase velocity of new features being delivered in software is by using a continuous build, test, and release process that has a strong emphasis on QA automation. If you’re Google working on Maps, a sustained faster build, test, and release cycle creates a competitive advantage over Yahoo!.

Similarly, enterprises expecting to compete on the basis of analytics need to empower analysts with easy access to up-to-date, unified data organized logically for them. Only by implementing an integrated, repeatable, scalable data curation process is it possible for a business to achieve the analytic velocity necessary to create a competitive advantage.

DataOps Capabilities & Principles

DataOps methodologies promote rapid, logical organization of all enterprise data, consumable downstream by a wide variety of methods. Its capabilities and principles are built for data engineering organizations that need to manage an architecture of constant change in data sources and uses.

post1

(Figure 1. Open “DataOps” Ecosystem: Catalogs and Hubs)

Capabilities

DataOps teams need to consider the following capabilities (See Figure 1.):

  • Raw Source Catalog — Catalog all critical data sources, including raw physical attributes and records from the operational sources across an enterprise. Before you serve up insight, you need to know what you have.
  • Movement / Logging / Provenance — These capabilities are required to move the data efficiently from Point A to Point B. Today’s ETL platforms, having been refined and hardened over the past two decades, offer highly optimized movement and manipulation of data. Core features, such as end-to-end data lineage and dependency analysis as well as data quality and operational metadata capture, make your raw data sources and feeds highly accessible and reliable.
  • Logical Models — Well-designed logical models deliver on three fronts:
  1. Flexible Format: Fundamentally, a single logical entity can be completely described by a collection of simple key-value pairs. A flexible format must hold up to both RDBMS design and highly normalized document search engines.
  2. Abstract: The model must successfully abstract the many varied physical data sources and feeds. Consumers of data care about analyzing entities like customers, suppliers, products or purchases and not a collection of source attributes.
  3. Semantically Meaningful: A logical model speaks the language of the consumer, allowing them to understand the underlying physical data, consume it, and provide feedback.
  • Unified Data Hub — An analytical system of record for key enterprise entities that is accessed and curated by any employee in an organization. Within the data hub, you gain a holistic view of your data landscape. This includes the entities that have been organized, the sources contributing to each entity, the applications utilizing the unified data, and the individuals who are curating and consuming the data.
  • Feedback — Provide users with a feedback mechanism to identify where problems in the data exist. This will improve the upstream system dynamically and continuously as well as eliminate the need for consumers to go to people physically managing data to provide feedback or put in requests. Critically, feedback is enabled in the unified data hub in the language of the logical model.

These capabilities deliver clean, unified views of data organized in a way that consumers would want it — while respecting the extreme amount of variety in both sources and uses. Users can consume a much higher quality of data faster while eliminating the need to idiosyncratically prepare the data themselves.

Principles

The DataOps principles underpinning these capabilities are equally as important to consider. These principles don’t just apply to tools and technology, but also to people, process and services. DataOps is ultimately an ecosystem, and changing organizational behavior is just as important as altering technological choices.

  • Interoperable: Open, Best-of-Breed, FOSS & Proprietary, Table In / Table Out — An ideal data engineering architecture should include technologies that are best-of-breed and open. Allowing mega-vendors “guided tours of your wallet” needs to be a thing of the past. Delivering clean, complete data to consumers when they want it and how they want it requires piecing together multiple technologies from different vendors, whether they be large tech companies or startups. Moreover, there can’t be an aversion to open source if it can best solve the problem. Technologies and processes need to interoperate and follow the basic premise that they should be able to easily accept a table in and produce a table out.
  • Social: Bi-Directional, Collaborative, Extreme Distributed Curation — Consumers want to use a wide variety of tools to interact with their data. They want to visualize information in their favorite analytic tool or wiki page and provide feedback directly so physical interaction or request tickets aren’t necessary. The communication paradigm needs to shift: information flow shouldn’t always be from source to consumption. Rather it needs to be bi-directional. It’s equally important that source owners receive feedback from users as users receive data from source owners. Moreover, like the modern internet experience, this collaboration needs to be enterprise-wide. Users are continually being conditioned to be responsible for data curation in their personal lives and data engineering teams should leverage this.
  • Modern: Hybrid, Service-Oriented, Scale-Out Architecture — Flexibility is critical for next-generation data engineering teams. When creating a repeatable architecture to service data consumers, ensure the backend has the capabilities to scale out as projects broaden in scope. Also, using a microservices-based architecture when developing key capabilities is critical, as changes often need to occur in some specific functionality and re-architecting the entire system isn’t desirable. Finally, make use of the cloud for scalability when needed but consider that a hybrid approach of cloud and on-premise capabilities is likely the way to go.

DataOps Wave Will Transform Data Engineering Environments

Applying a DataOps methodology to your data engineering organization will allow you to achieve transformational analytic outcomes and “capture low-hanging analytical fruit” that will save your company money and drive additional revenue. Quickly delivering clean data from the vast array of enterprise data sources to the vast array of consumption use cases in a repeatable manner will build your “company IQ.” DataOps is the transformational change data engineering teams have been waiting for to fulfill their aspirations of enabling their business to gain analytic advantage through the use of clean, complete, current data.

Posted in Uncategorized | Leave a comment

Think3 – $1B Fund for SaaS Founders who want to move onto next project

What happens when the SaaS company you started has grown beyond startup phase and your ready to go do your next startup.  If your company has been successful – but is not going to end up being a unicorn or doesn’t have an obvious strategic buyer – sometimes you can stall for years.  There are a ton of my fellow founders who end up in this state.  As posted last week on TechCrunch here – the team at Think3 has $1B and perhaps more importantly – are actual software operating people who will provide you with a fair exit that enables you to take on your next project knowing that what you built will continue in good hands.  If you are interested ping Andy Tryba @ Think3 – andy.tryba@think3.com.

Posted in Uncategorized | Leave a comment

Twine Health acquired by Fitbit

Twine Health has been acquired by Fitbit.  As a user, investor & Founding BOD Member @ Twine Health I’m thrilled about the potential of this deal for patients.

The mission at Twine is powerful and the impact of Twine @ scale could be transformational.  I was one of the first users of Twine and – as posted here – Twine enabled me to reduce my A1C by 7.1 in less than 4 months.  The Twine Health team should be able to bring dramatic health benefits to the Fitbit user community.  John Moore, Scott Gilroy & Frank Moss have great partners over the past 5 years – it’s been an honor to be part of such a great team.

Following the lead of leaders like Eric Topol, MD and Dave Chase – I believe that the time is right for a significant change in healthcare.  We can dramatically improve outcomes and reduce cost if we embrace the patient as consumer and information liquidity as the key lever.

Posted in Uncategorized | Leave a comment

LZR in Austin is AWESOME!

Had a chance to swing by LZR – the rebranded event space formerly the music venue La Zona Rosa – reference article here.

John Price has done an incredible job of building the type of space and community that creates great startups.  Not only is the space cool, comfortable and flexible – but the diversity of the people & interests is inspirational.  I think I know where I’ll be hanging out during the day now that Amy and I have a pied-à- in downtown Austin just a few blocks from LZR.

Posted in Companies, Entrepreneurship, Founders, Information Technology, Start-Ups, Uncategorized, Venture Funding | Leave a comment

Kinsa – 10M+ real temperature readings to deliver world’s best real-time flu tracking

Lots of people (including myself) talk about the potential impact of information technology in healthcare – as published in the New York Times – Inder Singh and the Kinsa team are delivering real health benefits to real people every day – helping fight the flu.
More than 4 years ago I met Inder and I was truly inspired by his mission.  He was coming off of a tour of duty with the Clinton Health Access Initiative and was driven to help improve human health by creating a global map of human temperature.  He believed that a commercial, consumer driven approach was the right path and I couldn’t help but lean in to support Inder and his mission as a seed investor in Kinsa.
IMG_20141119_181842 (1)

Inder and I at Apple Store in Meatpacking NYC – on Nov 19, 2014 – first day that Kinsa was available in the Apple Store.

I’m blown away by the progress @ Kinsa over the past four years as they’ve created the best and least expensive smartphone connected thermometer that has generated 10M+ temperature readings.  The quantity of bottom up temperature readings has enabled them to build the worlds most advanced model for tracking flu trends.   Kinsa’s model is being hailed as the next generation capability to track flu trends – radically enhancing tools such as the models used by the CDC which are driven by data collected from hospitals or Google Flu trends which analyzes web search data.

The benefit of the Kinsa approach is being supercharged by the Kinsa FLUency schools program which gives away Kinsa thermometers to schools across the country – 400+ schools in the past 12 months.  Doing good while building a great company – truly awesome.

 The press is beginning to cover Kinsa – specifically :

I’m truly honored to be a small part of the Kinsa story and to consider Inder a friend.

Posted in Analytics, Entrepreneurship, Founders, Health Care, Information Technology, Life Sciences, Start-Ups | Leave a comment

Koa Labs : The Next Chapter

Over the past 5+ years I’ve been working hard to help develop founder culture in Cambridge/Boston ecosystem. Thanks to my wife Amy for supporting my seemingly never ending startup addiction – she’s an amazing partner and soul mate.  615

As I outlined in my first Koa post back in the spring of 2012 – we have amazing startup resources in Cambridge/Boston and I consider it an honor to work with so many smart, talented entrepreneurs, engineers and scientists. At Koa we have expressed our enthusiasm for founding great companies via three methods :

  • seed fund primarily focused on backing first time entrepreneurs with technical backgrounds
  • “fierce networking” to connect the best people with the best opportunities across our ecosystem
  • a “startup club” in Harvard Square – a physical place for entrepreneurs to hang out, work on their projects and support each other in both success and failure. This started as a place to hang my hat and turns out – there were a ton of people who wanted co-working space in HSquare to start their companies

IMG_20171015_165735The Koa Labs seed fund is alive and well – our portfolio is here. Thus far we’ve invested ≈ $8M+  directly into startups and institutional seed funds with very healthy returns (despite the expectation that it would be worth $0 😉 I’m not making new direct investments at the moment but I am sure that I’ll go back to writing new checks at some point. As posted previously – I am working hard to focus on Tamr 100% and I am channeling my startup investing through Founder Collective and other great new seed funds in the Boston/Cambridge ecosystem such as The Engine, Pillar and others.

10622835_577159312409915_6893702260569433245_nI continue to do a ton of networking with all the fantastic people in our community – I get mojo from seeing great people work on big opportunities in our ecosystem.  I continue to believe that if we prosecuted intellectual content as aggressively as those in the Bay Area that we’d have 4X+ the number of startups in Boston/Cambridge.

I am continuously amazed at the artificial gap between the great academic talent/content in our ecosystem and the great commercial talent.  Most of these people are merely 1-2 T-stops away from each other – bridging these gaps is a lifelong mission for me.

As a result of my focus on Tamr and the emergence of other great co-working spaces on the red line – I’ve decided to get out of the “landlord” business in Harvard Square.  I’m hopeful that Tim Rowe and/or others will open additional co-working space in Harvard Square to compliment those in Kendall, Central and Davis – there is tremendous demand in Harvard Square based on my experience over the past 5 years. The Harvard iLab is also doing a great job and will be even more active once the new engineering school is finished across the river (btw I’m a huge fan of Frank Doyle – the new Dean at the Harvard School of Engineering and Applied Sciences – his work on the artificial pancreas is inspirational, he’s a great guy and welcome addition to our ecosystem in Boston).  22860126_1323004911158681_3812756076587665136_o

We’ve had the privilege to host some great startups and founders in our space @ 1430 Mass Ave over the past 5 years.  I thank all of them for the opportunity to work together.  Some of the great startups who have hung their hats @ Koa include : Recorded Future, Madaket, Tamr, Resurety, FirecrakerTwine Health, Matter.io and many others.

I’d also like to thank all of the individual people who have helped me @ Koa over the past 5 years – especially Rich Miner, Christopher AhlbergKatie RaeFrank Moss, Remy Evard, Jim DoughertyZen Chu, Ellen Rubin, Scott Kirsner, Jody RoseKelsey Cole, Steve Bell, Joe McPhersonHugo VanVuurenJanice BrownTony Purpura, Kim Murphy, Sean Treacy & the team at Grafton Street Studio and of course Paul Melone.   I’d also like to thank the many interns that we’ve had at Koa including – Sam Roberts, Katie Kaufman, Ian LeeSean ClemensBryan Holtzman and Qin Li.  I’m psyched to continue to work with many of you at Tamr and on other projects.

Over the past decade, my favorite change in the startup ecosystem in Boston/Cambridge has been the migration from Route 128 into the city/red line.  This change has also come with a refocus on great Founders as the core of our startup ecosystem – the team at Founder Collective has done a great job curating founder role models by sponsoring the “Founder Dialog” series.

At the end of the day – incubators, service providers and funds are all necessary – but great founders are first and foremost the core of our ecosystem in Boston.  Over the past few years there is no greater example of the power of great founders in Boston than Langley Steinert @ Cargurus. I was following Langley’s lead when I moved to Harvard Square 5+ years ago to start Koa – CarGuru’s original offices are 3 blocks from Tamr.  If Tamr can be a fraction as successful as CarGurus – I’ll be thrilled – although I’m not planning to move out of Harvard Square anytime soon 😉

image001

ps – special thanks to Aaron Kelsey Cole for all his effort – no one appears to work as hard as you big guy 😉

 

Posted in Analytics, Big Data, Diversity, Enterprise Software, Entrepreneurship, Financing, Founders, Health Care, Information Technology, Innovation, Life Sciences, Start-Ups, Uncategorized | 3 Comments