4  Formal Concept Analysis

El Formal Concept Analysis, o FCA, es una técnica de análisis de datos originada en la teoría de conjuntos formales, la lógica matemática y la teoría de retículos. Su objetivo principal es descubrir y representar estructuras conceptuales dentro de conjuntos de datos, especialmente conjuntos de datos que contienen información de tipo jerárquico o taxonómico.

Las principales aplicaciones de FCA son la extracción de conocimiento, agrupamiento y clasificación, aprendizaje automático, conceptos, ontologías, reglas, reglas de asociación e implicaciones de atributos.

Para el FCA, nuestros datos se dividen en objetos y atributos. En nuestro dataSet, los objetos son las cuentas de usuario y los atributos son las columnas como “Tiene foto de perfil, No es fake, …”.

library(fcaR)
Warning: package 'fcaR' was built under R version 4.3.3
library(readr)
datos <- read_csv("Data/train.csv") 
Rows: 576 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (12): profile pic, nums/length username, fullname words, nums/length ful...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
datos_refinados <- datos

columnas_binarias = c("profile pic","name==username","external URL","fake","private")

for (columna in columnas_binarias) {
  datos_refinados[[columna]] <-  factor(datos_refinados[[columna]], labels = c("No", "Si"))
}

fc_datos <- FormalContext$new(datos_refinados)
fc_datos
FormalContext with 576 objects and 12 attributes.
# A tibble: 576 × 12
   `profile pic` `nums/length username` `fullname words` `nums/length fullname`
   <fct>                          <dbl>            <dbl>                  <dbl>
 1 Si                              0.27                0                      0
 2 Si                              0                   2                      0
 3 Si                              0.1                 2                      0
 4 Si                              0                   1                      0
 5 Si                              0                   2                      0
 6 Si                              0                   4                      0
 7 Si                              0                   2                      0
 8 Si                              0                   2                      0
 9 Si                              0                   0                      0
10 Si                              0                   2                      0
# ℹ 566 more rows
# ℹ 8 more variables: `name==username` <fct>, `description length` <dbl>,
#   `external URL` <fct>, private <fct>, `#posts` <dbl>, `#followers` <dbl>,
#   `#follows` <dbl>, fake <fct>

4.1 Escalado

Como necesitamos que nuestro dataSet sea binario, necesitamos aplicarles tecniac como el escaldo para obtener el resultado deseado:

4.1.1 Escalado nominal

El escalado nominal se utiliza para atributos cuyos valores son excluyentes entre sí, como por ejemplo, los atributos que son “Sí” y “No”.

fc_datos$scale("profile pic",type = "nominal",c("Si","No"))
fc_datos$scale("name==username",type = "nominal",c("Si","No"))
fc_datos$scale("fake",type = "nominal",c("Si","No"))
fc_datos$scale("private",type = "nominal",c("Si","No"))
fc_datos$scale("external URL",type = "nominal",c("Si","No"))
fc_datos
FormalContext with 576 objects and 17 attributes.
# A tibble: 576 × 17
   `profile pic = Si` `profile pic = No` `nums/length username` `fullname words`
                <dbl>              <dbl>                  <dbl>            <dbl>
 1                  1                  0                   0.27                0
 2                  1                  0                   0                   2
 3                  1                  0                   0.1                 2
 4                  1                  0                   0                   1
 5                  1                  0                   0                   2
 6                  1                  0                   0                   4
 7                  1                  0                   0                   2
 8                  1                  0                   0                   2
 9                  1                  0                   0                   0
10                  1                  0                   0                   2
# ℹ 566 more rows
# ℹ 13 more variables: `nums/length fullname` <dbl>,
#   `name==username = Si` <dbl>, `name==username = No` <dbl>,
#   `description length` <dbl>, `external URL = Si` <dbl>,
#   `external URL = No` <dbl>, `private = Si` <dbl>, `private = No` <dbl>,
#   `#posts` <dbl>, `#followers` <dbl>, `#follows` <dbl>, `fake = Si` <dbl>,
#   `fake = No` <dbl>

4.1.2 Escalado intervalo

Como los demás datos son valores continuos, tenemos que utilizar un tipo de escalado distinto. Podemos utilizar modos como el ordinal; sin embargo, este nos generarían conceptos demasiado largos. Por lo tanto, el mejor modo a emplear para estos datos es el intervalo.

fc_datos$scale("nums/length username", 
         type = "interval", 
         values =c(0, 0.2, 0.4, 0.6, 0.8, 1)
         )

fc_datos$scale("nums/length fullname", 
         type = "interval", 
         values = c(0, 0.2, 0.4, 0.6, 0.8, 1) 
        )

fc_datos$scale("fullname words", 
         type = "interval", 
         values = c(0, 1, 3, 5, Inf) 
         )

fc_datos$scale("description length", 
         type = "interval", 
         values =c(0, 15, 25, 80, 150)
         )

fc_datos$scale("#posts", 
         type = "interval", 
         values =  c(0,1, 5, 10, 50, Inf)
         )

fc_datos$scale("#followers", 
         type = "interval", 
         values = c(0, 10, 60, 200, Inf)
         )

fc_datos$scale("#follows", 
         type = "interval", 
         values = c(0, 10, 60, 200, Inf)
         )

4.2 Conceptos

Una vez tenemos los datos en la forma que buscamos, podemos utilizar el paquete fcaR para generar conceptos. Los conceptos son componentes fundamentales que representan agrupaciones de objetos y atributos con una relación particular.

De manera formal, un concepto (𝐴,𝐵) se define como un par donde:

  • 𝐴 es el conjunto de objetos (extensión) que tienen todos los atributos de 𝐵.

  • 𝐵 es el conjunto de atributos (intensión) que son poseídos por todos los objetos de 𝐴.

4.2.0.1 Cálculo de los conceptos del contexto

Para calcular los conceptos de nuestros datos, utilizamos la función find_concepts.

fc_datos$find_concepts()

fc_datos$concepts$size()
[1] 7008

Vemos que hemos obtenido un gran numero de conceptos, vamos a ver los primeros:

head(fc_datos$concepts)
A set of 6 concepts:
1: ({1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255, 256, 257, 258, 259, 260, 261, 262, 263, 264, 265, 266, 267, 268, 269, 270, 271, 272, 273, 274, 275, 276, 277, 278, 279, 280, 281, 282, 283, 284, 285, 286, 287, 288, 289, 290, 291, 292, 293, 294, 295, 296, 297, 298, 299, 300, 301, 302, 303, 304, 305, 306, 307, 308, 309, 310, 311, 312, 313, 314, 315, 316, 317, 318, 319, 320, 321, 322, 323, 324, 325, 326, 327, 328, 329, 330, 331, 332, 333, 334, 335, 336, 337, 338, 339, 340, 341, 342, 343, 344, 345, 346, 347, 348, 349, 350, 351, 352, 353, 354, 355, 356, 357, 358, 359, 360, 361, 362, 363, 364, 365, 366, 367, 368, 369, 370, 371, 372, 373, 374, 375, 376, 377, 378, 379, 380, 381, 382, 383, 384, 385, 386, 387, 388, 389, 390, 391, 392, 393, 394, 395, 396, 397, 398, 399, 400, 401, 402, 403, 404, 405, 406, 407, 408, 409, 410, 411, 412, 413, 414, 415, 416, 417, 418, 419, 420, 421, 422, 423, 424, 425, 426, 427, 428, 429, 430, 431, 432, 433, 434, 435, 436, 437, 438, 439, 440, 441, 442, 443, 444, 445, 446, 447, 448, 449, 450, 451, 452, 453, 454, 455, 456, 457, 458, 459, 460, 461, 462, 463, 464, 465, 466, 467, 468, 469, 470, 471, 472, 473, 474, 475, 476, 477, 478, 479, 480, 481, 482, 483, 484, 485, 486, 487, 488, 489, 490, 491, 492, 493, 494, 495, 496, 497, 498, 499, 500, 501, 502, 503, 504, 505, 506, 507, 508, 509, 510, 511, 512, 513, 514, 515, 516, 517, 518, 519, 520, 521, 522, 523, 524, 525, 526, 527, 528, 529, 530, 531, 532, 533, 534, 535, 536, 537, 538, 539, 540, 541, 542, 543, 544, 545, 546, 547, 548, 549, 550, 551, 552, 553, 554, 555, 556, 557, 558, 559, 560, 561, 562, 563, 564, 565, 566, 567, 568, 569, 570, 571, 572, 573, 574, 575, 576}, {})
2: ({1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255, 256, 257, 258, 259, 260, 261, 262, 263, 264, 265, 266, 267, 268, 269, 270, 271, 272, 273, 274, 275, 276, 277, 278, 279, 280, 281, 282, 283, 284, 285, 286, 287, 288}, {fake = No})
3: ({1, 2, 4, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 27, 31, 32, 34, 35, 36, 37, 38, 42, 43, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 62, 65, 66, 67, 68, 69, 70, 71, 72, 74, 75, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 103, 106, 107, 108, 109, 111, 112, 113, 115, 116, 117, 118, 119, 120, 122, 123, 126, 127, 128, 130, 131, 132, 133, 134, 136, 137, 138, 139, 141, 142, 143, 144, 146, 147, 148, 149, 151, 152, 153, 154, 155, 156, 157, 158, 159, 161, 162, 164, 165, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 184, 185, 186, 187, 188, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 218, 219, 220, 221, 223, 224, 226, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 239, 241, 244, 246, 248, 250, 251, 254, 255, 256, 257, 260, 261, 262, 263, 264, 265, 266, 267, 268, 269, 270, 271, 272, 275, 276, 277, 278, 279, 280, 282, 284, 285, 286, 289, 291, 293, 296, 303, 310, 317, 320, 322, 324, 325, 328, 332, 338, 339, 343, 344, 349, 354, 356, 359, 360, 361, 378, 380, 382, 385, 399, 401, 403, 415, 420, 421, 422, 427, 428, 434, 436, 443, 445, 447, 451, 454, 460, 461, 462, 467, 471, 476, 482, 483, 491, 508, 509, 511, 512, 518, 523, 527, 531, 532, 534, 535, 536, 538, 539, 541, 542, 547, 549, 553, 554, 558, 561, 563, 566, 567, 568, 570, 572, 574, 576}, {#follows is (200, Inf]})
4: ({3, 5, 6, 7, 8, 25, 26, 28, 29, 30, 33, 39, 40, 47, 61, 63, 64, 73, 105, 110, 114, 121, 124, 125, 129, 135, 140, 145, 150, 160, 163, 182, 183, 189, 190, 191, 192, 204, 216, 217, 222, 225, 227, 238, 240, 243, 249, 252, 253, 258, 274, 283, 288, 292, 294, 306, 318, 323, 334, 335, 345, 347, 348, 352, 358, 362, 363, 364, 365, 368, 370, 372, 376, 381, 383, 391, 393, 395, 396, 397, 400, 402, 404, 405, 406, 407, 410, 411, 413, 416, 417, 419, 424, 429, 431, 433, 435, 438, 439, 440, 444, 446, 448, 450, 453, 457, 458, 463, 474, 477, 484, 488, 528, 530, 546, 548, 550, 555, 560, 562, 564, 569, 571, 573, 575}, {#follows is (60, 200]})
5: ({41, 44, 76, 102, 104, 205, 242, 245, 247, 259, 273, 281, 287, 290, 295, 297, 299, 300, 305, 307, 311, 312, 313, 314, 315, 319, 321, 326, 327, 329, 330, 331, 333, 336, 342, 350, 351, 353, 355, 357, 366, 367, 369, 371, 373, 374, 375, 379, 384, 386, 387, 388, 389, 390, 394, 398, 408, 414, 423, 425, 426, 432, 437, 441, 442, 449, 452, 455, 459, 464, 465, 466, 473, 475, 478, 479, 481, 485, 487, 492, 494, 495, 496, 498, 500, 501, 502, 503, 505, 506, 507, 510, 513, 515, 521, 522, 526, 529, 533, 540, 543, 544, 545, 551, 552, 556, 557, 565}, {#follows is (10, 60]})
6: ({45, 166, 298, 301, 304, 308, 309, 316, 337, 341, 377, 409, 412, 418, 430, 456, 468, 469, 470, 472, 480, 486, 490, 504, 514, 516, 519, 524, 525, 537, 559}, {#follows is (0, 10]})

Observamos un curioso resultado: vemos una gran cantidad de números. Estos números representan los índices de las cuentas que tienen dichos atributos. Sin embargo, esta información no nos es útil. Vamos a calcular el “extent” del atributo “fake = Si”, y veremos que nos devuelve los índices de todas las cuentas que son falsas.

s1 <- Set$new(fc_datos$attributes)
s1$assign(fake = "Si")
fc_datos$extent(s1)
{1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22,
  23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41,
  42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60,
  61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79,
  80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98,
  99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114,
  115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129,
  130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144,
  145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159,
  160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174,
  175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189,
  190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204,
  205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219,
  220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234,
  235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249,
  250, 251, 252, 253, 254, 255, 256, 257, 258, 259, 260, 261, 262, 263, 264,
  265, 266, 267, 268, 269, 270, 271, 272, 273, 274, 275, 276, 277, 278, 279,
  280, 281, 282, 283, 284, 285, 286, 287, 288, 289, 290, 291, 292, 293, 294,
  295, 296, 297, 298, 299, 300, 301, 302, 303, 304, 305, 306, 307, 308, 309,
  310, 311, 312, 313, 314, 315, 316, 317, 318, 319, 320, 321, 322, 323, 324,
  325, 326, 327, 328, 329, 330, 331, 332, 333, 334, 335, 336, 337, 338, 339,
  340, 341, 342, 343, 344, 345, 346, 347, 348, 349, 350, 351, 352, 353, 354,
  355, 356, 357, 358, 359, 360, 361, 362, 363, 364, 365, 366, 367, 368, 369,
  370, 371, 372, 373, 374, 375, 376, 377, 378, 379, 380, 381, 382, 383, 384,
  385, 386, 387, 388, 389, 390, 391, 392, 393, 394, 395, 396, 397, 398, 399,
  400, 401, 402, 403, 404, 405, 406, 407, 408, 409, 410, 411, 412, 413, 414,
  415, 416, 417, 418, 419, 420, 421, 422, 423, 424, 425, 426, 427, 428, 429,
  430, 431, 432, 433, 434, 435, 436, 437, 438, 439, 440, 441, 442, 443, 444,
  445, 446, 447, 448, 449, 450, 451, 452, 453, 454, 455, 456, 457, 458, 459,
  460, 461, 462, 463, 464, 465, 466, 467, 468, 469, 470, 471, 472, 473, 474,
  475, 476, 477, 478, 479, 480, 481, 482, 483, 484, 485, 486, 487, 488, 489,
  490, 491, 492, 493, 494, 495, 496, 497, 498, 499, 500, 501, 502, 503, 504,
  505, 506, 507, 508, 509, 510, 511, 512, 513, 514, 515, 516, 517, 518, 519,
  520, 521, 522, 523, 524, 525, 526, 527, 528, 529, 530, 531, 532, 533, 534,
  535, 536, 537, 538, 539, 540, 541, 542, 543, 544, 545, 546, 547, 548, 549,
  550, 551, 552, 553, 554, 555, 556, 557, 558, 559, 560, 561, 562, 563, 564,
  565, 566, 567, 568, 569, 570, 571, 572, 573, 574, 575, 576}

4.3 Implicaciones

Las implicaciones son reglas derivadas de los datos que describen relaciones lógicas entre conjuntos de atributos. En FCA, las implicaciones se extraen a partir de los conceptos y se utilizan para describir las dependencias entre los atributos de manera formal.

Estas implicaciones las podemos ver como las reglas de asociación que obtuvimos anteriormente.

4.3.1 Calculo de los implicaciones del contexto

Para calcular las implicaciones de nuestros datos, utilizamos la función find_implications.

fc_datos$find_implications()

¿Cuántas implicaciones se han extraído?

fc_datos$implications$cardinality()
[1] 1905

Vemos que hemos obtenido un gran numero de implicaciones, vamos a ver los primeros:

head(fc_datos$implications)
Implication set with 6 implications.
Rule 1: {fake = Si} -> {external URL = No}
Rule 2: {#follows is (200, Inf], fake = No} -> {name==username = No}
Rule 3: {#follows is (60, 200], fake = No} -> {profile pic = Si}
Rule 4: {#follows is (60, 200], #follows is (200, Inf]} -> {profile pic = Si,
  profile pic = No, nums/length username is (0, 0.2], nums/length username
  is (0.2, 0.4], nums/length username is (0.4, 0.6], nums/length username
  is (0.6, 0.8], nums/length username is (0.8, 1], fullname words is (0, 1],
  fullname words is (1, 3], fullname words is (3, 5], fullname words is (5,
  Inf], nums/length fullname is (0, 0.2], nums/length fullname is (0.2, 0.4],
  nums/length fullname is (0.4, 0.6], nums/length fullname is (0.6, 0.8],
  nums/length fullname is (0.8, 1], name==username = Si, name==username = No,
  description length is (0, 15], description length is (15, 25], description
  length is (25, 80], description length is (80, 150], external URL = Si,
  external URL = No, private = Si, private = No, #posts is (0, 1], #posts is (1,
  5], #posts is (5, 10], #posts is (10, 50], #posts is (50, Inf], #followers is
  (0, 10], #followers is (10, 60], #followers is (60, 200], #followers is (200,
  Inf], #follows is (0, 10], #follows is (10, 60], fake = Si, fake = No}
Rule 5: {#follows is (10, 60], fake = No} -> {profile pic = Si, name==username =
  No}
Rule 6: {#follows is (10, 60], #follows is (200, Inf]} -> {profile pic = Si,
  profile pic = No, nums/length username is (0, 0.2], nums/length username
  is (0.2, 0.4], nums/length username is (0.4, 0.6], nums/length username
  is (0.6, 0.8], nums/length username is (0.8, 1], fullname words is (0, 1],
  fullname words is (1, 3], fullname words is (3, 5], fullname words is (5,
  Inf], nums/length fullname is (0, 0.2], nums/length fullname is (0.2, 0.4],
  nums/length fullname is (0.4, 0.6], nums/length fullname is (0.6, 0.8],
  nums/length fullname is (0.8, 1], name==username = Si, name==username = No,
  description length is (0, 15], description length is (15, 25], description
  length is (25, 80], description length is (80, 150], external URL = Si,
  external URL = No, private = Si, private = No, #posts is (0, 1], #posts is (1,
  5], #posts is (5, 10], #posts is (10, 50], #posts is (50, Inf], #followers is
  (0, 10], #followers is (10, 60], #followers is (60, 200], #followers is (200,
  Inf], #follows is (0, 10], #follows is (60, 200], fake = Si, fake = No}

Como tenemos un gran número de implicaciones, vamos a intentar reducirlas y quedarnos con las más importantes aplicando técnicas de simplificación.

4.3.2 Cálculo de la media de la parte izquierda y derecha de las implicaciones

Este cálculo nos proporciona una medida cuantitativa de las relaciones entre atributos. El tamaño de una implicación se refiere al número de atributos en sus conjuntos de premisa A y su consecuente B. La media de estos tamaños se obtiene haciendo la media del número de atributos en las partes izquierda y derecha de todas las implicaciones, ofreciendo una visión general.

colMeans(fc_datos$implications$size())
     LHS      RHS 
5.809974 4.205249 

Con esto valores obtenemos, en la parte derecha de la regla suele haber una media de 5,8 elementos mientras que en la parte derecha una media de 4,2 elementos.

4.3.3 Lógica de simplificación

Vamos a intentar de simplificar nuestras implicaciones para poder quedarnos con las mas importantes y significativas.

fc_datos$implications$apply_rules(rules = c("simplification"))
Processing batch
--> Simplification: from 1905 to 1905.
head(fc_datos$implications)
Implication set with 6 implications.
Rule 1: {fake = Si} -> {external URL = No}
Rule 2: {#follows is (200, Inf], fake = No} -> {name==username = No}
Rule 3: {#follows is (60, 200], fake = No} -> {profile pic = Si}
Rule 4: {#follows is (60, 200], #follows is (200, Inf]} -> {profile pic = Si,
  profile pic = No, nums/length username is (0, 0.2], nums/length username
  is (0.2, 0.4], nums/length username is (0.4, 0.6], nums/length username
  is (0.6, 0.8], nums/length username is (0.8, 1], fullname words is (0, 1],
  fullname words is (1, 3], fullname words is (3, 5], fullname words is (5,
  Inf], nums/length fullname is (0, 0.2], nums/length fullname is (0.2, 0.4],
  nums/length fullname is (0.4, 0.6], nums/length fullname is (0.6, 0.8],
  nums/length fullname is (0.8, 1], name==username = Si, name==username = No,
  description length is (0, 15], description length is (15, 25], description
  length is (25, 80], description length is (80, 150], external URL = Si,
  external URL = No, private = Si, private = No, #posts is (0, 1], #posts is (1,
  5], #posts is (5, 10], #posts is (10, 50], #posts is (50, Inf], #followers is
  (0, 10], #followers is (10, 60], #followers is (60, 200], #followers is (200,
  Inf], #follows is (0, 10], #follows is (10, 60], fake = Si, fake = No}
Rule 5: {#follows is (10, 60], fake = No} -> {profile pic = Si, name==username =
  No}
Rule 6: {#follows is (10, 60], #follows is (200, Inf]} -> {profile pic = Si,
  profile pic = No, nums/length username is (0, 0.2], nums/length username
  is (0.2, 0.4], nums/length username is (0.4, 0.6], nums/length username
  is (0.6, 0.8], nums/length username is (0.8, 1], fullname words is (0, 1],
  fullname words is (1, 3], fullname words is (3, 5], fullname words is (5,
  Inf], nums/length fullname is (0, 0.2], nums/length fullname is (0.2, 0.4],
  nums/length fullname is (0.4, 0.6], nums/length fullname is (0.6, 0.8],
  nums/length fullname is (0.8, 1], name==username = Si, name==username = No,
  description length is (0, 15], description length is (15, 25], description
  length is (25, 80], description length is (80, 150], external URL = Si,
  external URL = No, private = Si, private = No, #posts is (0, 1], #posts is (1,
  5], #posts is (5, 10], #posts is (10, 50], #posts is (50, Inf], #followers is
  (0, 10], #followers is (10, 60], #followers is (60, 200], #followers is (200,
  Inf], #follows is (0, 10], #follows is (60, 200], fake = Si, fake = No}
fc_datos$implications$cardinality()
[1] 1905

Vemos que el número de implicaciones no se ha reducido como podríamos haber pensado. Esto se debe a que al simplificar, realmente no se reduce la cantidad de implicaciones, sino los atributos de estas, eliminando verdades absolutas u otros parámetros redundantes.

4.3.4 Eliminar la redundancia

También vamos a aplicar composition, generalization, simplification y rsimplification para eliminar la redundancia dentro de las implicaciones.

fc_datos$implications$apply_rules(rules = c("composition",
                                              "generalization",
                                             "simplification",
                                             "rsimplification"))
Processing batch
--> Composition: from 1905 to 1905.
--> Generalization: from 1905 to 1905.
--> Simplification: from 1905 to 1905.
--> Right Simplification: from 1905 to 1905.
head(fc_datos$implications)
Implication set with 6 implications.
Rule 1: {fake = Si} -> {external URL = No}
Rule 2: {#follows is (200, Inf], fake = No} -> {name==username = No}
Rule 3: {#follows is (60, 200], fake = No} -> {profile pic = Si}
Rule 4: {#follows is (60, 200], #follows is (200, Inf]} -> {#follows is (0, 10]}
Rule 5: {#follows is (10, 60], fake = No} -> {profile pic = Si, name==username =
  No}
Rule 6: {#follows is (10, 60], #follows is (200, Inf]} -> {#follows is (0, 10]}
fc_datos$implications$cardinality()
[1] 1905

Al igual que antes, el numero de implicaciones no se ha reducido, como podíamos haber pensado. Esto se debe a que al simplificar realmente no reduce la cantidad e implicaciones, sino los atributos de estas, eliminando verdades absolutas o otras parámetros redundantes.

colMeans(fc_datos$implications$size())
     LHS      RHS 
3.317060 1.247769 

Ahora, después de simplificar nuestras implicaciones, la media de atributos de cada parte de la regla ha bajado considerablemente.

4.3.5 Análisis de implicaciones importantes

Al igual que con las reglas, nos interesa las implicaciones que tengan en su parte derecha los atributos sobre si la cuenta es falsa o no, puesto que nuestro objetivo es detectar estas cuentas falsas.

head(fc_datos$implications$filter(rhs="fake = Si"))
Implication set with 6 implications.
Rule 1: {#followers is (10, 60], #follows is (200, Inf]} -> {fake = Si}
Rule 2: {#followers is (10, 60], #follows is (0, 10]} -> {name==username = No,
  private = No, fake = Si}
Rule 3: {#followers is (0, 10], #follows is (60, 200]} -> {name==username = No,
  private = Si, fake = Si}
Rule 4: {#followers is (0, 10], #follows is (0, 10]} -> {fake = Si}
Rule 5: {private = No, #followers is (0, 10]} -> {fake = Si}
Rule 6: {description length is (25, 80], #posts is (1, 5], #follows is (10, 60]}
  -> {fake = Si}

Entendido, aquí está la corrección:

Observando esta serie de reglas, podemos obtener gran cantidad de información para poder detectar y diferenciar las cuentas fake de las reales. Por ejemplo, una que puede parecer muy obvia es que si sigue a mucha gente pero le siguen poca gente, es falsa.

Vamos a ver también las cuentas reales:

head(fc_datos$implications$filter(rhs="fake = No"))
Implication set with 6 implications.
Rule 1: {#followers is (200, Inf], #follows is (10, 60]} -> {private = No, fake
  = No}
Rule 2: {external URL = Si} -> {profile pic = Si, fake = No}
Rule 3: {#posts is (50, Inf], #follows is (0, 10]} -> {private = No, #followers
  is (200, Inf], fake = No}
Rule 4: {#posts is (0, 1], #followers is (200, Inf]} -> {#follows is (200, Inf],
  fake = No}
Rule 5: {description length is (25, 80], #posts is (50, Inf]} -> {profile pic =
  Si, fake = No}
Rule 6: {description length is (25, 80], #posts is (5, 10]} -> {fake = No}

Al contrario de lo anterior, si sigue a poca gente y mucha gente le sigue, significa que la cuenta es real.

Ambas suposiciones las podemos obtener gracias a que sabemos que para seguir a una persona, no es necesario que esa persona dé su consentimiento, sino que puede ser algo automático. Sin embargo, obtener seguidores requiere a una segunda persona que desee seguir a esa cuenta, pudiendo verla previamente, lo que es más difícil de conseguir para cuentas falsas.

4.4 Funciones interesantes

Dentro del paquete fcaR hay funciones interesantes para exportar a Latex, a arules, …

reglas <- fc_datos$implications$to_arules()
#latex <- fc_datos$implications$to_latex()

También podemos hacer gráficos de nuestros conceptos:

#fc_datos$concepts$plot()