In the financial services industry, risk management and Big Data
-- the popular buzzword for enormous sets of structured and
unstructured data that institutions are facing as their enterprises
go digital -- go hand-in-hand. One simply cannot separate the two
and must address the opportunities and challenges of each in
tandem.
Fueled by the financial crisis of 2008 and ongoing uncertainty
in Europe, regulatory bodies and, by extension, the industry are
focused like never before on identifying, measuring, and managing
risk exposure across asset classes, lines of business, and
enterprises. Managing large amounts of data (including positions,
reference, market data, etc.) is a critical component in the
accurate assessment of risk, and is one of the reasons Big Data
management has recently ascended to top-of-mind status among
C-level executives and regulators. While prudent financial services
organizations recognize this strategic shift, many are left
wondering how to leverage the value of growing amounts of data.
To date, most Big Data discussions have focused on web-based
companies such as Google and Facebook and the large amounts of
unstructured data they generate. There's been a lot of attention to
harnessing that data for commercial goals, and certainly the
banking industry is examining these possibilities. One could argue,
however, that the more urgent task is that of harnessing the value
of data generated and collected and applying insights to address
critical business concerns such as risk management.
This article will focus on how Big Data is transforming the
industry, the different components that comprise Big Data, and
various technology strategies financial organizations can utilize
to manage this transformation efficiently and with a focus on
innovation.
Big Data has many definitions, but key components can be
categorized around the four Vs: volume, velocity, variety and
value.
Handling Large Volumes
The web is becoming the world's central data store, and as such
provides a rich source of information on everything from public
sentiment to customer behavior and market intelligence. The web is
not the only place seeing explosive growth in data volumes. Our
industry has witnessed exponential growth in trade data, beginning
with electronic markets and skyrocketing with market fragmentation
and the widespread use of algorithmic, program and high-frequency
trading. Increased volumes also mean there are much larger amounts
of historical tick and positions data that need to be analyzed. New
regulations require ever more extensive data retention and
analysis, and sophisticated strategy development requires growing
amounts of historical data for back testing.
Many systems are struggling to keep up with these volumes of
data while still performing primary or business-critical tasks. The
challenge financial services organizations are facing now is
strategizing how to keep up with the sheer quantity of data
generated on a continuous basis.
The most relevant technical strategy to manage growing data
volumes is parallelism. While we have been spending a lot of effort
parallelizing computation, data parallelism remains a challenge and
is the focal point of most current IT projects. Additionally, it is
becoming apparent in many cases that compute grids are becoming
bottlenecks for data access. As a result, the pattern of moving
computing tasks to the data, rather than moving large amounts of
data over the network, is becoming increasingly prevalent.
Several technical approaches combine these strategies,
parallelizing both data management and computation, while bringing
compute tasks close to the data:
Engineered machines integrate software
and hardware mechanisms, combining data and compute parallelization
with partitioning, compression, and a high-bandwidth backplane to
provide very high throughput for data processing while minimizing
data movement.
Integrated analytics also involves
moving computation to the data rather than the other way around.
Whether it's OLAP (online analytical processing), predictive or
statistical analytics, modern databases are capable of doing a lot
of computation right where the data is stored.
Data grids focus on maximizing data
parallelism by distributing in-memory data objects across a large,
horizontally scaled cluster, and some even provide the ability to
ship compute tasks to the nodes holding the data in memory, rather
than sending data to compute nodes as most grids do.
NoSQL, or schema-less data management,
has been gaining momentum. At its core is the notion that
developers can be more productive by circumventing the need for
complex schema design during the development lifecycle of
data-intensive applications, especially when the data lends itself
to key-value modeling (e.g. time series data).
Hadoop is a complete open-source stack
for storing and analyzing massive amounts of data, and is quickly
becoming a de facto standard, with multiple distributions
available. Like the technologies mentioned above, the Hadoop
framework achieves massive scalability by sending compute tasks to
the nodes storing the data, and a rich ecosystem of analytical
tools offers high level functionality on top of that.
| 2 | 3 Next Page ►