Use Deep Learning to Model Database Performance

Concept: diagnose, optimize, and anticipate performance in analytic platforms such as databases through deep learning models.

The complexity of database and NoSQL systems can exceed even talented engineers’ abilities to mentally model and understand cause and effect across entire platforms. Reacting to performance incidents begins to involve trial and error, while predicting the impact of changes relies on intuitive estimates.

Consider a prestigious new client starting to use a cross-channel digital marketing platform. Their high volume campaigns analyze hundreds of millions of data points to segment, target, and reach consumers across display ads, email, and app notifications. This imposes new pressures on databases supporting the system. Late on a Friday before a holiday weekend, a critical query begins to perform slowly. A DBA analyzes the situation, creates new indices, and goes home. The situation should be solved. Instead, performance slows down for many clients across the whole platform. This is counter-intuitive; indices require some compute power to generate, but should not negatively affect the whole platform. After two days of analysis, a senior engineer discovers that implementing new indices caused the database to generate sub-optimal new query plans for many queries unrelated to the initial issue.

A solution will show how changes to an analytic platform will affect performance. If a solution can show cause and effect as changes are made, than it can also predict performance issues by extrapolating trends or performing sensitivity analyses to understand which patterns constrain performance.

Deep learning is a sub-set of machine learning that creates high level models and representations of complex structures. The techniques are often associated with image, text, and video processing. Deep learning techniques can symbolically represent the data models, queries, indices, constraints, views and other elements that form a database (as well as modelling storage and CPU performance, and other workloads like replication or mirroring for disaster recovery). Deep learning techniques typically involve training some form of neural network. Data is fed through the network, adjusting how many layers of cells each perform actions on the data, and then pass it along to adjacent cells until the output begins to match test data.

NoSQL systems and databases generate logs that can be used to constantly train the neural network. Higher level abstractions of cause and effect for a given database, such as SQL Server or MySQL, could also be analyzed through a cloud services across many instances and users (which would involve few privacy concerns since the system is only analyzing performance metadata).

Part of the process of generating product ideas should be to look at the state of the market, and study competitors. I have not found any credible solutions. Many products show what is happening within a database, but none show why and how a change has or will affect performance.

A product should perhaps first target Oracle Exadata because it is used for larger databases and the nature of big, monolithic devices makes testing and staging environments expensive. Sure, Exadata devices can be carved into multiple logical databases, and one of these could be used for testing, but in my experience this can introduce other issues (even when using “caging” to limit resources used by testing), contends for i/o accessing the data, and defeats the point of using a comparatively inexpensive model to predict performance.

The product’s value to a customer should roughly follow data volumes multiplied by query complexity times data model complexity. This is hard to translate to licensing terms. Pricing could simply follow data size.

What are your thoughts? Would you buy this product?

Jeremy Lehman is a principal with Third Derivative advising companies on product management and technical execution of analytic software. Previously, he led global product development and operations for Experian Marketing Services; led sales technology and CRM, in addition to investment banking and wealth management IT, at Barclays; and served at CTO for equities at Citigroup.

Use Deep Learning to Model Database Performance

Data Inventory and Predictive Power

Concept: manage analytic applications’ cost and risk by visually showing the types of data, growth rates, predictive power, and time horizons for statistical significance within a given data store.

Inexpensive storage and increasing analytic power prompt many architects to conclude that all of a company’s data should be retained in lakes or reservoirs. However, the data for a specific application may be limited by server and storage capital expenditures, software licensing terms, latency requirements, or operational risk. This is particularly true for information and analytic products on multi-tenant infrastructure.

Providers of analytic platforms can see surprising and unpredictable growth rates in clients’ data. Users’ behavior patterns change over time and in particular after new features are deployed. Clients can make fast and surprising leaps in data science sophistication. High quality data generated for non-revenue purposes like compliance can be discovered for sales and marketing optimization. The effects of large spikes in data include increased server and storage costs, longer-running and potentially less effective query results, and increased risk as architectures involve more moving parts. Frequently, contract terms for software products also do not match how customers’ actions evolve.

Consider digital marketing campaigns: large marketers can generate hundreds of millions of digital consumer actions each day, such as viewing or clicking on an ad. Each event can be associated with thousands of related data points such as details on the marketing campaigns targeting a given consumer, their attributes and behaviors, and, increasingly, third party data like weather or fuel prices. Many marketers have somewhat fragmented organization structures where data for web marketing, email marketing, point of sale, warranty, and service may not be managed together. My friend Mark Frisch, a data scientist, invented a great term while working with a large consumer packaged goods company: their intended data lake is more of a swamp.

Data from consumer’s abandoned ecommerce shopping carts provides an example. Abandoned items, and their context, is a strong predictor of the categories of consumers’ interests for several weeks, and can have a “thick tail” of secondary predictive power for several months. However, the data just is not that valuable to analyze a given consumer after roughly six months.

A solution can start with visual dashboards that show a user the contents of their data stores, relative growth rates, and potentially how often the fields are used in queries. In my experience in both banking and digital marketing, clients can be quite surprised to realize the size and range of their data, and how fast it grows.

The second step is to show which data has predictive value and how long this lasts. This could be achieved by analyzing data against a target variable such as advertising views, social/viral referrals, or transactions such as purchases. Data scientists will correctly point out this is not mathematically optimal. That’s true, but much capex, processing time, and risk can be reduced by removing just the evidently less valuable data. For example, there is little reason to retain abandoned shopping cart data over a year old when performing complex calculations across hundreds or thousands of variables for segmentation and targeting.

Informatica, Microsoft, and Oracle are doing interesting work in the related field of data discovery. Oracle has a particularly clever approach, re-purposing the Endeca tool for searching unstructured data to now show relationships within data in numerous formats across an enterprise. This is broader and less immediately actionable than the tool I am proposing. Data discovery is a positive process, discovering insights and opportunities that mostly larger organizations may or may not be ready to act upon. I’m proposing a simpler tool for DBAs and SAs to immediately manage costs and risks in large and small organizations.

The product’s design should abstract the logic to categorize and measure data from any specific store. In other words, the design should plug the visualization and analytics into multiple sources. It should be targeted initially at databases, particularly SQL Server and Oracle, where licensing true ups combined with recent price increases can drive cost-driven needs to optimize data management. The product’s road map should be tightly focused initially to make a specific database instance run fast, inexpensively, and without production incidents. Later versions may look across an enterprise’s data or dive into formats like HDFS, but my sense would be to get a minimal viable product out fast for databases and, if customers buy it, refine the functionality deeply to drive finance and operational value, while establishing a brand distinct from discovery tools.

Database operations teams often adopt products in response to immediate needs rather than RFP processes or long sales cycles. Frequently, they need tools outside of recurring budget processes. The business model for this product could use a “freemium” approach, where operations teams could immediately use the product when faced with incidents. After showing they can deliver the same business value with less data/time/cost/risk, or at a later point when budget cycles enable purchasing, they can then purchase the premium product.

My sense is the product should emphasize elegant design and user experience to comprehend large data sets.

What are your thoughts? I’d welcome your views.
Jeremy Lehman is a principal with Third Derivative advising companies on product management and technical execution of analytic software. Previously, he led global product development and operations for Experian Marketing Services; led sales technology and CRM, in addition to investment banking and wealth management IT, at Barclays; and served at CTO for equities at Citigroup.



Data Inventory and Predictive Power


Hi, I’d like to share product ideas for analytic software to learn from others, refine concepts, and meet people who may want to collaborate.

CIO and product development leadership roles can provide broad exposure to see patterns or intuitively sense emerging needs. However, leaders are often balancing too many demands to understand the practical specifics to make new concepts successful. I’ve been fortunate to have many truly brilliant colleagues at Microsoft, Thomson Reuters, Citigroup, Barclays, and Experian Marketing Services. Listening to them taught me much, and I hope to encourage similar conversation here.

At Microsoft, we designed high performance transaction systems for capital markets trading. Our team at Thomson Reuters brought together several acquired businesses providing data for investment management, first adding analytics and then workflows within an integrated suite. At Citi, we pioneered NoSQL design patterns while creating trading and risk management systems. At Barclays, we enabled an integrated view across all global transactions, then applied data science to optimize sales and marketing. At Experian Marketing Services, we created a massively scaled cross-channel digital marketing platform integrated with analytics to identify, understand, target, and reach consumers.

Concepts here are conceptual, not actionable. People I’ve worked with as an investor and operator know that I encourage a high standard of fact-based rigor. This is different, and intended as a safe place to discuss creative ideas. While not specifically intending to be business plans, the posts will be guided by Sequoia Capital’s very effective business plan outline.

Please share your thinking. Would you buy these products? Will these concepts solve your needs? What people and companies are doing interesting work in these areas? Thank you.

Jeremy Lehman is a principal with Third Derivative advising companies executing large scale product and software development projects. Previously, he led global product development and operations for Experian Marketing Services; led sales technology and CRM, in addition to investment banking and wealth management IT at Barclays; and served at CTO for equities at Citigroup.