The Financial Services Industry is amongst the most data driven of
industries. The regulatory environment that commercial banks and insurance
companies operate within requires these institutions to store and analyze many
years of transaction data, and the pervasiveness of electronic trading has
meant that Capital Markets firms both generate and act upon hundreds of
millions of market related messages every day. For the most part, financial
services firms have relied on relational technologies coupled with business
intelligence tools to handle this ever-increasing data and analytics burden. It
is however increasingly clear that while such technologies will continue to
play an integral role, new technologies –many of them developed in response to
the data analytics challenges first faced in e-commerce, internet search and
other industries – have a transformative role in enterprise data management.
Consider a problem faced by every top-tier global bank: In
response to new regulations, banks need to have a ‘horizontal view’ of risk
within their trading arms. Providing this view requires banks to integrate data
from different trade capture systems, each with their own data schemas, into a
central repository for positions counter-party information and trades. It’s not
uncommon for traditional ETL based approaches to take several days to extract,
transform, cleanse and integrate such data. Regulatory pressure however
dictates that this entire process be done many times every day. Moreover,
various risk scenarios need to be simulated, and it’s not uncommon for the
simulations themselves to generate terabytes of additional data every day. The
challenge outlined is not only one of sheer data volumes but also of data
variety, and the timeliness in which such varied data needs to be aggregated
and analyzed.
Now consider an opportunity that has largely remained unexploited:
As data driven as financial services companies are, analysts estimate that
somewhere between 80 and 90 percent of the data that banks have is unstructured,
i.e., in documents and in text form. Technologies that enable businesses to
marry this data with structured content present an enormous opportunity for
improving business insight for financial institutions. Take for example,
information stored in insurance claim systems. Much valuable information is
captured in text form. The ability to parse text information and combine the
extracted information with structured data in the claims database
will not only enable a firm to provide a better customer experience, it also
may enhance their fraud detection capabilities.
The
above scenarios were used to illustrate a few of the challenges and potential
opportunities in building a comprehensive data management vision. These and
other data management related challenges and opportunities have been succinctly
captured and classified by others under the ‘Four Vs’ of data – Volume,
Velocity, Variety and Value.
The
visionary bank needs to deliver business insights in context, on demand, and at
the point of interaction by analyzing every bit of data available. Big Data
technologies comprise the set of technologies that enable banks to deliver to
that vision. To a large extent, these technologies are made feasible by the
rising capabilities of commodity hardware, the vast improvements in storage
technologies, and corresponding fall in the price of computing resources. Given
that most literature on Big Data relegate established technologies such as
RDBMS to the ‘has been’ heap, it is important that we stress that relational
technologies continue to play a central role in data management for banks, and
that Big Data technologies augment the current set of data management
technologies used in banks. Later sections of this paper will expand on this
thought and explain how relational technology is positioned in the Big Data
technology continuum.
This
paper broadly outlines Oracle’s perspective on Big Data in Financial Services
starting with key industry drivers for Big Data. Big Data comprises several
individual technologies, and the paper outlines a framework to uncover these
component technologies, then maps those technologies to specific Oracle
offerings, and concludes by outlining how Oracle solutions may address Big Data
patterns in Financial Services.
What
is Driving Big Data Technology Adoption in Financial Services?
There
are several use cases for big data technologies in the financial services
industry, and they will be referred to throughout the paper to illustrate
practical applications of Big Data technologies. In this section we highlight
three broad industry drivers that accelerate the need for Big Data technology
in the Financial Services Industry.
Customer
Insight
Up
until a decade or so ago, it may be said that banks, more than any other
commercial enterprise, owned the relationship with consumers. A consumer’s bank
was the primary source of the consumer’s identity for all financial, and many
non-financial transactions. Banks were in firm control of customer
relationship, and the relationship was for all practical purposes as long-term
as the bank wanted it to be. Fast forward to today, and the relationship is
reversed. Consumers now have transient relationships with multiple banks: a
current account at one that charges no fees, a savings accounts with a bank
that offers high interest, a mortgage with a one offering the best rate, and a
brokerage account at a discount brokerage. Moreover, even collectively, financial
institutions no longer monopolize a consumer’s financial transactions. New
entrants-peer-to-peer services; and the Paypals, Amazons, Googles and Walmarts
of the world – have had the effect of disinter mediating the banks. Banks no
longer have a complete view of their customer’s preferences, buying patterns
and behaviors. This problem is exacerbated by the fact that social networks now
capture very valuable psychographic information – the consumer’s interests,
activities and opinions.
The
implication is that even if banks manage to integrate information from their
own disparate systems, which in itself amounts to a gargantuan, a fully
customer-centric view may not be attained. Gaining a fuller understanding of a
customer’s preferences and interests are prerequisites for ensuring that banks
can address customer satisfaction and for building more extensive and complete
propensity models. Banks must therefore bring in external sources of
information, information that is often unstructured. Valuable customer insight
may also be gleaned from customer call records, customer emails and claims
data, all of which are in textual format. Bringing together transactional data
in CRM systems and payments systems, and unstructured data both from within and
outside the firm requires new technologies for data integration and business
intelligence to augment the traditional data warehousing and analytics
approach. Big Data technologies therefore play a pivotal role in enabling
customer centricity in this new reality.
Regulatory
Environment
The
spate of recent regulations is unprecedented for any industry. Dodd-Frank alone
adds hundreds of new regulations that affect banking and securities industries.
For example, these demands require liquidity planning and overall asset and
liability management functions to be fundamentally rethought. Point-in-time
liquidity positions currently provided by static analysis of relevant financial
ratios are no longer sufficient, and a more near real-time view is being
required. Efficient allocation of capital is now seen as a major competitive
advantage, and risk-adjusted performance calculations require new
points of integration between risk and finance subject areas.
Additionally, complex stress tests, which put enormous pressure on the underlying
IT architecture, are required with increasing frequency and complexity. On the
Capital Markets side, regulatory efforts are focused on getting a more accurate
view of risk exposures across asset classes, lines of business and firms in
order to better predict and manage systemic interplays. Many firms are also
moving to a real-time monitoring of counterparty exposure, limits and other
risk controls. From the front office all the way to the boardroom, everyone is
keen on getting holistic views of exposures and positions and of risk-adjusted
performance.
Explosive
Data Growth
Perhaps
the most obvious driver is that financial transaction volumes are growing
leading to explosive data growth in financial services firms. In Capital
Markets, the pervasiveness of electronic trading has lead to a decrease in the
value of individual trades and an increase in the number of trades. The advent
of high turnover, low latency trading strategies generates considerable order
flow and an even larger stream of price quotes. Complex derivatives are
complicated to value and require several data points to help determine, among
other things, the probability of default, the value of LIBOR in the future, and
the expected date of the next ‘Black Swan’ event. In addition, new market rules
are forcing the OTC derivative market – the largest market by notional value –
toward an electronic environment.
Data
growth is not limited to capital markets businesses. The Capgemini/RBS Global
Payments study for 2011 estimates that the global volume for electronic
payments is about 260 billion and growing between 15 and 22% for developing
countries. As devices that consumers can use to initiate core transactions
proliferate, so too do the number of transactions they make. Not only is the
transaction volume increasing, the data points stored for each transaction are
also expanding. In order to combat fraud and to detect security breaches,
weblog data from bank’s Internet channels, geospatial data from smart phone
applications, etc., have to be stored and analyzed along with core operations
data. Up until the recent past, fraud analysis was usually performed over a
small sample of transactions, but increasingly banks are analyzing entire
transaction history data sets. Similarly, the number of data points for loan
portfolio evaluation is also increasing in order to accommodate better
predictive modeling.
Technology
Implications
The
technology ramifications of the broad industry trends outlined above are:
More
data and more different data types: Rapid growth in structured and unstructured
data from both internal and external sources requires better utilization of
existing technologies and new technologies to acquire, organize, integrate and
analyze data.
More
change and uncertainty: Pre-defined, fixed schemas may be too restrictive when
combining data from many different sources, and rapidly changing needs imply
schema changes must be allowed more easily.
More
unanticipated questions: Traditional BI systems work extremely well when the
questions to be asked are known. But business analysts frequently don’t know
all the questions they need to ask Self-service ability to explore data,
add new data, and construct analysis as required is an essential need for banks
driven by analytics.
More
real-time analytical decisions: Whether it is a front office trader or a back
office customer service rep, business users demand real-time delivery of
information. Event processors, real-time decision making engines and in-memory
analytical engines are crucial to meeting these demands.
The
Big Data Technology Continuum
So
how do we address the technology implications summarized in the previous
section? The two dimensional matrix below provides a convenient starting,
albeit incomplete, framework for decomposing the high-level technology
requirements for managing Big Data. The figure below depicts, along the
vertical dimension, the degree to which data is structured: Data can be
unstructured, semi-structured or structured. The second dimension is the
lifecycle of data: Data is first acquired and stored, then organized and
finally analyzed for business insight. But before we dive into the
technologies, a basic understanding of key terminology is in order.
We
define the structure in ‘structured data’ in alignment with what is expected in
relational technologies – that the data may be organized into records
identified by a unique key, with each record having the same number of
attributes, in the same order. Because each record has the same number of
attributes, the structure or schema need be defined once as metadata for the
table, and the data itself need not have metadata embedded in it.
Semi-structured
data also has structure, but the structure can vary from record to record.
Records in semi-structured data are sometimes referred to as jagged records
because each record may have variable number of attributes and because the
attributes themselves may be compound constructs, i.e. be made up of
sub-attributes like in an XML document. Because of the variability in
structure, metadata for semi-structured data has to be embedded within the data:
for e.g., in the form of an XML schema or as name-value pairs that describe the
names of attributes and their respective values, within the record. If the data
contains tags or other markers to identify names and the positions of
attributes within the data, the data can be parsed to extract these name-value
pairs.
By unstructured data, we mean data for which structure does not conform to the two other classifications discussed above. Strictly speaking, unstructured text data usually does have some structure -- for e.g., the text in a call center conversation record has grammatical structure -- but the structure does not follow a record layout, nor are there any embedded metadata tags describing attribute. Of course, before unstructured data can be used to yield business insights, it has to be transformed into some form of structured data. One way to extract entities and relationships from unstructured text data is by using natural language processing (NLP). NLP extracts parts of speech such as nouns, adjectives, subject-verb-object relationships; commonly identifiable things such as places, company names, countries, phone numbers, products, etc.; and can also identify and score sentiments about products, people, etc. It’s also possible to augment these processors by supplying a list of significant entities to the parser for named entity extraction.
The
diagram above might justifiably convey that a myriad of disparate technologies
are needed to comprehensively handle Big Data requirements for the enterprise.
However, these are not ‘either/or’ technologies. They are to be viewed as part
of a data management continuum: each technology enjoys a set of distinct
advantages depending on the phase in the lifecycle of data management and on
the degree of structure within data it needs to handle, and so these
technologies work together within the scope of an enterprise architecture.
The
two points below are expanded on further along in this section, but they are
called out here for emphasis:
The
diagram does not imply that all data should end up in a relational data
warehouse before analysis may be performed. Data needs to be organized for
analysis, but the organized data may reside on any suitable technology for
analysis.
As
the diagram only uses two dimensions for decomposing the requirements, it does
not provide a complete picture. For example, the diagram may imply that
structured data is always best handled in a relational database. That’s not
always the case, and the section on handling structured data explains what
other technologies may come into play when we consider additional dimensions
for analysis.
Handling
Unstructured Data
Unstructured
data within the bank may be in the form of claims data, customer call records,
content management systems, emails and other documents. Content from external
sources such as Facebook, Twitter, etc., is also unstructured. Often, it may be
necessary to capture such unstructured data first before processing the data to
extract meaningful content. File systems of course can handle any type of data
as they simply store data. Distributed file systems are file systems
architected for high performance and scalability. They exploit parallelism
that is made possible because these file systems are spread over several
physical computers (from 10s to few thousand nodes). Data captured in
distributed file systems must later be organized (reduced, aggregated, enriched,
and converted into semi-structured or structured data) as part of the data
lifecycle.
Dynamically
indexing engines are relatively new class of databases in which no particular
schema is enforced or defined. Instead, a ‘schema’ is dynamically built as data
is ingested. In general, they work something akin to web search engines in that
they crawl over data sources they are pointed at, extracting significant
entities and establishing relationships between these entities using Natural
Language Parsing or other text mining techniques. The extracted entities and
relationships are stored as a graph structure within the database. These
engines therefore simultaneously acquire and organize unstructured data.
Handling
Semi-Structured Data
Semi-structured
data within the bank may exist as loan contracts, in derivatives trading
systems, as XML documents and HTML files, etc. Unlike unstructured data,
semi-structured data contains tags to mark significant entity values contained
within it. These tags and corresponding values are key-value pairs. If the data
is in such a format that these key-value pairs need to be extracted from within
it, it may need to be stored on a distributed file system for later parsing and
extraction into key-value databases. Key-value stores are one in a family of
NoSQL database technologies -- some others being graph databases and document
databases – which are well suited for storing semi-structured data. Key-value
stores do not generally support complex querying (joins and other such
constructs) and may only support information retrieval using the primary key
and in some implementations using an optional secondary key. Key-values stores
like the file systems described in the previous section are also often
partitioned, enabling extremely high read and write performance. But unlike in
distributed file systems where data can be written and read in large blocks,
key-value stores support high performance for single-record reads and writes
only.
That
these newer non-relational systems offer extreme scale and/or performance is
accepted. But this advantage comes at a price. As data is spread across
multiple nodes for parallelism there is increased likelihood of node failures,
especially when cheaper commodity servers are used to reduce the overall system
cost. In order to mitigate the increased risk of node, failures these systems
replicate data on two or often three nodes. The CAP Theorem put forward by
Prof. Eric Brewer states that such systems have to choose two from among the
three properties of Consistency, Availability and Partition Tolerance. And most
implementations choose to sacrifice Consistency, the C in ACID, thereby
redefining themselves as BASE systems (Basically Available Soft-state
Eventually consistent).
Handling
Structured Data
Banks
have applications that generate many terabytes of structured data and have so
far relied almost exclusively on relational technologies for managing this
data. However, the Big Data technology movement has risen partly from the
limitations of relational technology, and the most serious limitations may be
summed up as:
·
Relational technologies were engineered
to handle needs not always required. For example, relational systems can handle
complex querying needs and they adhere to strict ACID properties. These
capabilities are not always required, but because they are always “on”, there
is an overhead associated to relational systems that sometimes constrains other
more desired properties such as performance and scalability. To make the
argument more concrete, let’s take an example scenario: It wouldn't be unusual
for a medium to large size bank to generate 5-6 terabytes of structured data in
modeling exposure profiles of their counterparties using Monte Carlo
simulations (assuming 500000 trades, 5000 Scenarios). Much more data would be
generated if stress tests were also performed. What’s needed is a database
technology that can handle huge data volumes with extremely fast read (by key)
and write speeds. There is no need for strict ACID compliance; availability
needs are less than in say, a payment transaction system; there are no complex
queries to be executed against this data; and it would be more efficient for
the application that generates the data (the Monte Carlo runs) to have local
data storage. Although the data is structured, a relational database may not be
the optimal technology here. Perhaps a NoSQL database or distributed file
system or even a data grid (or some combinations of technologies) may be faster
and more cost effective in this scenario.
- The
rigidity enforced by the relational schema model requires more upfront
planning and may make the schema more brittle in the face of changing
requirements. This is often felt when data from multiple sources needs to
be aggregated into one database, and especially when data from source
systems are in different formats.
While
relational technologies may be challenged in meeting some of these demands, the
model benefits tremendously from its structure. These technologies remain the
best way to organize data in order to quickly and precisely answer complex
business questions, especially when the universe of such questions is known.
They remain the preferred technology for systems that have complex reporting
needs. Also, if ACID properties and reliability are must haves for applications
such as core banking and payments, few other technologies meet the demands for
running their mission critical, core systems.
Moreover,
many limitations of relational technology implementations like scale and
performance are addressed in specific implementations of the technology, and we
discuss the Oracle approach to extending the capabilities of the Oracle
Database implementation – both, in terms of its ability to scale and its
ability to handle different types of data -- in the next section.
Adding New Dimensions
to the Decomposition Framework
In
the previous sections we used the two dimensions shown in Figure 1 to uncover
the technologies required to handle big data needs. In this section we outline
two additional dimensions that can be used for further decomposition of
technology requirements.
Handling
Real-Time Needs
Real-time
risk management, in-process transactional fraud prevention, security breach
analytics, real-time customer care feedback and cross-selling analytics, all
necessitate the acquisition, organization and analysis of large amounts of data
in real-time. The rate at which data arrives and the timeliness in responding
to incoming data is another dimension we may apply to the technology
requirements decomposition framework. Acquiring data at extremely fast
rates and performing analysis on such data in real-time requires different set
of technologies than previously discussed.
Three
most common technologies for handling real-time analytics are Complex Event
Processors, in-memory Distributed Data Grids and in-memory Databases. Complex
Event Processing (CEP) engines provide a container for analytical applications
that work on streaming data (market data, for example). Unlike in databases,
queries are evaluated against data, in-memory, and continuously as data arrives
into the CEP engine. CEP engines are therefore an essential component of
event-driven architectures. CEP engines have found acceptance in the front
office for algorithmic trading, etc., but they have wide applicability across
all lines of business and even in retail banking: for detecting events in
real-time as payment or core banking transactions are generated.
Distributed
data grid technologies play a complementary role to CEP engines. Data grids not
only store data in memory (usually as objects, i.e., in semi-structured form)
they also allow distributed processing on data in memory. More sophisticated
implementations support event based triggers for processing and MapReduce style
processing across nodes. Using a distributed data grid, transactions can be
collected and aggregated in real-time from different systems and processing can
be done on the aggregated set in real-time. For example a bank could integrate
transactions from different channels in a distributed data grid and have
real-time analytics run on the collected set for superior multi-channel
experience.
Reducing Data for Analysis
A guiding principle at the heart of the Big Data movement is that
all data is valuable and that all data must be analyzed to extract
business insight. But not all data sets contain equal amount of value, which is
to say that the value-density or "signal-to-noise ratio" of data sets
within the bank differ. For example, data from social media feeds maybe less
value-dense than data from the bank’s CRM system. Value-density of data sets
provides us with another dimension to apply to the framework for decomposing
technology requirements for data management. For example, it may be far more
cost effective to first store less value-dense data such as social media
streams on a distributed file system than in a relational database. As you move
from left to right in the framework, low-value density data should be
aggregated, cleaned and reduced to high value–density data that is ready to be
analyzed. This is not to say that the diagram implies that all data should
eventually be stored in a relational data warehouse for analysis and
visualization. Data can be organized in-place: The output of a MapReduce
process on Hadoop may be the HDFS file system itself and analytical tools for
Big Data should be able to provide present outputs from analysis of data
present in both non-relational and relational systems.
No comments:
Post a Comment