Today, the size of corporate data assets is constantly growing, making their exploitation a key issue. However, data owners need to be aware that their data is not equally usable.
Let's start with the basics!
There are two main types of data in databases: (i) numeric data, and (ii) textual data. Let us now focus on the latter! In the table below, I have grouped the text data according to their length and the distribution of possible variable values.
|
Categorical Variables |
Personal and Contact Details |
Text Data |
Example |
Type of residence, gender ("male"/"female"), what package the customer has |
Name, address, phone number, etc. |
Email text, audio transcript, blog post, etc. |
Properties of the Variables |
Typically between 2-10 (but certainly not 100 or more) |
It can take thousands or millions (or even more) of different values. In general, the occurrence of the most common value is not significant |
In practice, there is no repetition, i.e. the variable values are different |
Typical Length of Texts |
Short (often only a few characters) |
Consists of a few words |
A coherent text, which may consist of several sentences. |
For a long time, only categorical variables could be used for data analysis, the other two data types were considered as "waste" as we did not have the tools to process the data. Later, text mining/NLP technology provided the possibility to exploit text data.
However, what is the situation with your personal and contact details? Obviously, there is no point in analyzing that:
Even if the result would be statistically significant, it is not usual to do such analyses. Certain information can of course be extracted from these data:
Name |
What is the gender of the client? |
Address |
What is the type of the settlement, the client live in? |
Mobile Phone number |
What is the provider? |
Phone number |
Does the client have a secondary phone? |
|
What is the provider? |
However, in the majority of cases, even the above data mining does not take place. But personal and contact data can provide a whole new type of information, namely the customer's network of contacts:
The idea is very simple: connect all customers with the same attribute with an edge. Let's look at the example above:
· A and B are linked because they have the same address
· A and G are linked because they have the same phone number
· B and G are linked because they have the same email address
· D and F are linked because they have the same address.
Moreover, in addition to the above can be introduced even "brother" link when:
· Two individuals have the same surname AND the same mother's name.
We have previously established that basic data (name, address, telephone number, ...) cannot be used with traditional analysis methods. However, the resulting graph can. The above contact network is useful for detecting insurance fraud. If a customer is found to have committed fraud, the above contact network can be used to detect and prevent further fraud through the connections of the fraudster.
If you are interested, please feel free to contact us to know more details!