matrix-4037389_1920.jpg

How to build a network using company data?

22/09/08 admin

Today, the size of corporate data assets is constantly growing, making their exploitation a key issue. However, data owners need to be aware that their data is not equally usable.

Let's start with the basics!

 

There are two main types of data in databases: (i) numeric data, and (ii) textual data. Let us now focus on the latter! In the table below, I have grouped the text data according to their length and the distribution of possible variable values. 

 

 

Categorical Variables

Personal and Contact Details

Text Data

Example

Type of residence, gender ("male"/"female"), what package the customer has

Name, address, phone number, etc.

Email text, audio transcript, blog post, etc.

Properties of the Variables

Typically between 2-10 (but certainly not 100 or more)

It can take thousands or millions (or even more) of different values. In general, the occurrence of the most common value is not significant

In practice, there is no repetition, i.e. the variable values are different

Typical Length of Texts

Short (often only a few characters)

Consists of a few words

A coherent text, which may consist of several sentences.

For a long time, only categorical variables could be used for data analysis, the other two data types were considered as "waste" as we did not have the tools to process the data. Later, text mining/NLP technology provided the possibility to exploit text data.

However, what is the situation with your personal and contact details? Obviously, there is no point in analyzing that:

  • Customers called „Smith” buy more of a product than the „Harrison” family.
  • Customers whose phone number ends in 2 are more likely to cancel their contract
  • People living on "York" street have 10% more accidents than the people living on "Rock" street (although this may be relevant)

 

Even if the result would be statistically significant, it is not usual to do such analyses. Certain information can of course be extracted from these data:

 

Name

What is the gender of the client?

Address

What is the type of the settlement, the client live in?

Mobile Phone number

What is the provider?

Phone number

Does the client have a secondary phone?

email

What is the provider?

However, in the majority of cases, even the above data mining does not take place. But personal and contact data can provide a whole new type of information, namely the customer's network of contacts:

The idea is very simple: connect all customers with the same attribute with an edge. Let's look at the example above:

·       A and B are linked because they have the same address

·       A and G are linked because they have the same phone number

·       B and G are linked because they have the same email address

·       D and F are linked because they have the same address.

Moreover, in addition to the above can be introduced even "brother" link when:

·       Two individuals have the same surname AND the same mother's name.

We have previously established that basic data (name, address, telephone number, ...) cannot be used with traditional analysis methods. However, the resulting graph can. The above contact network is useful for detecting insurance fraud. If a customer is found to have committed fraud, the above contact network can be used to detect and prevent further fraud through the connections of the fraudster.

 

If you are interested, please feel free to contact us to know more details!