How MNC’s Manage BIG DATA

Abhijeet Bakale
5 min readJan 17, 2022

Big data refers to data that is so large, fast or complex that it’s difficult or impossible to process using traditional methods. The act of accessing and storing large amounts of information for analytics has been around for a long time. But the concept of big data gained momentum in the early 2000s when industry analyst Doug Laney articulated the now-mainstream definition of big data as the three V’s:

Volume. Organizations collect data from a variety of sources, including transactions, smart (IoT) devices, industrial equipment, videos, images, audio, social media and more. In the past, storing all that data would have been too costly — but cheaper storage using data lakes, Hadoop and the cloud have eased the burden.

Velocity. With the growth in the Internet of Things, data streams into businesses at an unprecedented speed and must be handled in a timely manner. RFID tags, sensors and smart meters are driving the need to deal with these torrents of data in near-real time.

Variety. Data comes in all types of formats — from structured, numeric data in traditional databases to unstructured text documents, emails, videos, audios, stock ticker data and financial transactions.

At SAS, we consider two additional dimensions when it comes to big data:

Variability

In addition to the increasing velocities and varieties of data, data flows are unpredictable — changing often and varying greatly. It’s challenging, but businesses need to know when something is trending in social media, and how to manage daily, seasonal and event-triggered peak data loads.

Veracity

Veracity refers to the quality of data. Because data comes from so many different sources, it’s difficult to link, match, cleanse and transform data across systems. Businesses need to connect and correlate relationships, hierarchies and multiple data linkages. Otherwise, their data can quickly spiral out of control.

What is Big Data ?

Well most of the people think that big data is a technology used for managing high volume of data , but actually big data is a problem faced by many multinational companies like Facebook , Google ,Amazon etc. The data generated by these companies are beyond their storage capacity and the data is so large and complex that none of the traditional data management tools are able to store it or process it efficiently. So that’s the reason why big data is a problem for many MNC’s.

Now let’s see how facebook , google ,amazon manages their data

Facebook

Facebook generates 4 petabytes of data per day — that’s a million gigabytes. And it’s system produces around 2.5 billion pieces of content every day .All that data is stored in what is known as the Hive, which contains about 300 petabytes of data. This enormous amount of content generation is without a doubt connected to the fact that Facebook users spend more time on the site than users spend on any other social network, putting in about an hour a day.

For big data management Facebook has designs its own servers and networking. It designs and builds its own data centers. Its staff writes most of its own applications and creates virtually all of its own middleware. Everything about its operational IT unites it in one extremely large system that is used by internal and external folks alike

Google

Google is the world’s most “data-oriented” company. It is a part of the largest implementers of Big Data technologies.

Map all the Internet data. Identify what most uses. More clicked. More interacted. What is most beneficial? These are the main data tasks of Google.

Based on the SEARCH service, its first product, over time, Google creates a lot of other data products. Google Apps, Google Docs, Google Maps, YouTube, Translator, and so on.

Some people estimated that google’s database size is about 10 exabytes , which is 10 million terabytes.

So the question is how google manages such a huge amount of data ?

The answer is implementing computing tools and technologies equal to Hadoop and BigQuery (Google’s NoSQL technology).

Amazon

Amazon uses largest number of server for hosting their data they host around 1,000,000,000 gigabytes of data across more than 1,400,000 servers.

Amazon generates data two-fold. The major retailer is collecting and processing data about its regular retail business, including customer preferences and shopping habits. But it is also important to remember that Amazon offers cloud storage opportunities for the enterprise world.

Amazon S3 — on top of everything else the company handles — offers a comprehensive cloud storage solution that naturally facilitates the transfer and storage of massive data troves. Because of this, it’s difficult to truly pinpoint just how much data Amazon is generating in total.

Instead, it’s better to look at the revenue flowing in for the company which is directly tied to data handling and storage. The company generates more than $258,751.90 in sales and service fees per minute.

Problems faced for managing big Data

In this article I will be discussing two problem of big data

  1. Volume: As we know that for storing big data we need more amount of storage. And we may think that buying a bigger harddisk or data server will solve the problem. But the problem in this solution is that companies don’t know how much data they will be storing. As bigger companies generates lots of data so it won’t be possible for them to buy one single harddisk or data server. They may end up buying lots of harddisk and still their complete data cannot be stored.
  2. Velocity: Generally data is stored inside a harddisk but the storing and retrieving of data takes lot of time in harddisk. This storing and retrieving of data in a storage is known as Innput/output operations. So storing lots of data in a one single harddisk takes lot of time and this cannot be the solution for storing big data.

How to solve big Data problem?

One way to solve big data is to use Distributed Storage System . This system follows master-slave topology .In this system lots of storage resources are connected to the one main master storage. This master node is known as Name node and the slave storage nodes are known as data node. All these data nodes provide their storage to the one single master node. Their can be thousands of data node providing their resources to the master or Name node.

With the help of this system we can solve both the volume and velocity problem. We now don’t require the unlimited numbers of harddisk for the data storage instead we now require thousands of data node which can be added on the basis of storage requirement to the Name node. Thus the problem of the volume can be easily solved.

Velocity problem can also be solved with this system as for doing input/output operations we can divide the task between the different nodes. As thousands of data nodes will be storing and retrieving data so the speed of I/O operation will automatically increase.

--

--