In [1]:
Aǹalise de Dados dr Sensores

Os dados s̃ao convertidos de xlsx para csv e depois lidos pelo Pandas
/home/vsc/.local/lib/python3.8/site-packages/IPython/core/interactiveshell.py:3062: DtypeWarning: Columns (23,24,29,30,31,32,33,34,35) have mixed types. Specify dtype option on import or set low_memory=False.
  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,
In [ ]:
%matplotlib  inline
import seaborn as sns
import matplotlib.pyplot as plt
from datetime import datetime
import pandas as pd
import matplotlib

def cvtime(x):
        t=[]
        for i in x:
            years=i[0:2]
            months=i[3:5]
            hr=i[9:11]
            day=i[6:8]
            min=i[12:14]
            sec=i[15:17]
            t += [pd.to_datetime(datetime(2000+int(years),int(months),int(day),int(hr),minute=int(min),second=int(sec)))]
        return t

df = pd.read_csv("HISTORICOALARMES_IN.csv", sep=";", parse_dates=["DATAHORA"],date_parser=cvtime)

Segue-se os dados iniciais. Notar algumas tags de 2013. O que se segue e uma introducao aos dados.

In [2]:
df        
Out[2]:
TAG DATAHORA TIPOALARME PRIORIDADE DESCRICAO ENTIDADE QUALIDADE URGENTE VALOR CODIGOALARME ... ID_CONDUTOR LOGALARMEUMREQID COMPLEMENTO_DESCRICAO_2057 COMPLEMENTO_DESCRICAO_1044 COMPLEMENTO_DESCRICAO_1046 COMPLEMENTO_DESCRICAO_1036 COMPLEMENTO_DESCRICAO_2052 COMPLEMENTO_DESCRICAO_1030 Unnamed: 34 Unnamed: 35
0 117BVCA01COMESTA 2013-01-02 11:29:10 0 0 3 Validator 1.01 out-of-service alarm 1 0 0 0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 117BVCA01COMMEMO 2013-01-02 11:29:10 0 0 3 Validator 1.01 out-of-memory alarm 1 0 0 0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 117BVCA01COMUNIC 2013-01-02 11:29:10 0 0 3 Communication failure with the validator 1.01 1 0 0 0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 117BVCA02COMESTA 2013-01-02 11:29:10 0 0 3 Validator 2.01 out-of-service alarm 1 0 0 0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 117BVCA02COMMEMO 2013-01-02 11:29:10 0 0 3 Validator 2.01 out-of-memory alarm 1 0 0 0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
752506 PCCSSV126ATRMANB 2020-03-25 20:39:20 588000000 1 2 Non recoverable delay due to maneuvers in a te... 1 0 0 0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
752507 PCCSSV113ATRMANB 2020-03-25 21:09:11 930000000 1 2 Non recoverable delay due to maneuvers in a te... 1 0 0 0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
752508 PCCSSV119ATRMANB 2020-03-25 22:03:20 946000000 1 2 Non recoverable delay due to maneuvers in a te... 1 0 0 0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
752509 PCCSSV119ATRMANB 2020-03-25 22:08:06 891000000 1 2 Non recoverable delay due to maneuvers in a te... 1 0 0 0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
752510 PCCSSV112ATRMANB 2020-03-25 22:20:23 911000000 1 2 Non recoverable delay due to maneuvers in a te... 1 0 0 0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

752511 rows × 36 columns

Renover dados com datas inatingiveis. Pandas describe apresenta os principais parametros

In [ ]:
 
In [3]:
df1=df[(df['DATAHORA'] > '2019-01-01') & (df['DATAHORA'] < '2020-07-01')]
df1.describe()
Out[3]:
TIPOALARME PRIORIDADE DESCRICAO QUALIDADE URGENTE VALOR CODIGOALARME ID_VEICULO ID_LINHA NR_TURNO ... ID_VEICULO2 HIS_TIMESTAMP DATAHORAACK REGISTOMANUAL COMENTARIO COMPLEMENTO_DESCRICAO ID_CONDUTOR2 ID_CONDUTOR LOGALARMEUMREQID COMPLEMENTO_DESCRICAO_2057
count 7.506160e+05 750616.000000 750616.000000 750616.000000 750616.0 750616.0 750616.000000 750616.000000 750616.000000 0.0 ... 0.0 0.0 246148.000000 246148.000000 750616.0 0.0 2403.000000 2461.000000 11128.000000 15233.000000
mean 5.345073e+08 0.965928 2.943850 1.021870 0.0 0.0 182.329405 11629.862381 213.680664 NaN ... NaN NaN -0.128025 -0.128118 0.0 NaN 381.078652 7.196262 272.087257 507699.083437
std 2.709792e+08 0.198874 0.232423 0.167427 0.0 0.0 6180.487802 4950.120754 9.429559 NaN ... NaN NaN 6.767613 6.767480 0.0 NaN 1327.565465 53.517394 147.457457 209639.283639
min 0.000000e+00 0.000000 1.000000 1.000000 0.0 0.0 0.000000 11101.000000 0.000000 NaN ... NaN NaN -1.000000 -2.000000 0.0 NaN 0.000000 -1.000000 0.000000 8.000000
25% 3.050000e+08 1.000000 3.000000 1.000000 0.0 0.0 0.000000 11118.000000 207.000000 NaN ... NaN NaN -1.000000 -1.000000 0.0 NaN 0.000000 -1.000000 221.000000 587689.000000
50% 5.390000e+08 1.000000 3.000000 1.000000 0.0 0.0 0.000000 11118.000000 213.000000 NaN ... NaN NaN -1.000000 -1.000000 0.0 NaN 0.000000 -1.000000 330.000000 592324.000000
75% 7.660000e+08 1.000000 3.000000 1.000000 0.0 0.0 0.000000 11118.000000 221.000000 NaN ... NaN NaN -1.000000 -1.000000 0.0 NaN 0.000000 -1.000000 388.000000 598267.000000
max 9.990000e+08 2.000000 3.000000 3.000000 0.0 0.0 618824.000000 91102.000000 227.000000 NaN ... NaN NaN 78.000000 78.000000 0.0 NaN 5004.000000 418.000000 418.000000 604013.000000

8 rows × 21 columns

Distribuicao de eventos por dia da semana. Redicao a sexta e sabado, como seria de esperar.

In [4]:
#gbyd=df1.groupby(    [df1["DATAHORA"].t.month]).TAG.count()
ax = sns.countplot(df1['DATAHORA'].dt.weekday)

Agregacao por hora do dia. De notar que os noruegueses ñao parecem muito activos antes das 5 da manha.

In [5]:
#gbyd=df1.groupby(    [df1["DATAHORA"].t.month]).TAG.count()
ax = sns.countplot(df1['DATAHORA'].dt.hour)
In [ ]:
Agregar pela linha requereria entender a topologia do sistema,
In [ ]:
Entidades< ǹumero grande de entidades diferentes, uma parece dominante.
In [6]:
ax = sns.countplot(df1['ID_LINHA'])
In [7]:
ax = sns.countplot(df1['ENTIDADE'])

In [10]:
Discretizàc̃ao e contagem de eventos por entidade
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-10-ff6dcb82abed> in <module>
----> 1 type()

TypeError: type() takes 1 or 3 arguments
In [ ]:
 
In [9]:
df1['Y'] = df1.loc[:,'ENTIDADE'].astype('category')
df1.Y.describe()
df1.Y.value_counts()

# failure=df1.loc["failure" in df1['ENTIDADE'].str.lower()]
# comms=df1.loc[df1['ENTIDADE'].str.lower().str.find("communication")>=0]
# total=df1.loc[df1['ENTIDADE'].str.lower().str.find("total number of")>=0]
# memory=df1.loc[df1['ENTIDADE'].str.lower().str.find("memory")>=0]
# validator=df1.loc[df1['ENTIDADE'].str.lower().str.find("validator")>=0]
# print(failure.ENTIDADE.describe())
# print(comms.ENTIDADE.describe())
# print(total.ENTIDADE.describe())
# print(validator.ENTIDADE.describe())
# print(memory.ENTIDADE.describe())
# print(memory.ENTIDADE.describe())
<ipython-input-9-c6a12bd133a3>:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df1['Y'] = df1.loc[:,'ENTIDADE'].astype('category')
Out[9]:
Delay of a driver in logging in the badging workst    660347
Non recoverable delay due to maneuvers in a termin     21757
Terminus departure schedule not respected              17542
Vehicle stopped for a long time on the line            10800
Data communication with vehicle failure                 9212
Wifi radio failure                                      6389
Service started without presentation at the badgin      5178
Communication fault - CAPSYS 1                          4006
Vehicle prepared                                        3025
Driver change                                           2278
Communication failure with the MPU                      1612
Late Driver change                                      1368
Total number of daily operating hours                   1246
Total number of Kms                                     1246
Early Driver change                                      852
Service start with a vehicle different from the sc       623
Delay at the end of servic                               577
Communication failure with the driver's console #        512
Beacon reader failure                                    439
Communication with the AVLS's workstations               380
Failure of a PCE module                                  314
TAG READER CAB 2                                         285
Driver change failure                                    183
Odometer failure                                          97
Communication fault - CAPSYS 2                            73
Vehicle localisation impossible                           63
Delay at the start of service                             55
Passenger Emergency Alarm                                 50
Public announcement system failure                        36
ACTIVE DRIVER CABIN FAILURE                               28
Urgent call                                               20
Delay leaving Depot                                       14
Communication failure with workstation 5                   8
Emergency call                                             1
Name: Y, dtype: int64

A funcao plotall escreve a distribuicao de tipo de evento por

  • dias desde o inicio da medicao
  • dia da semana
  • hora do dia

Alguns tipos de eventos tem reduzido ao longo do tempo (atraso devido a manobras), outros tem-se mantido estaveis, etc.

A maioria dos eventos reduz-se no fim de semana, mas carros parados e falhas mecanicas sao mais comuns no dia 5 e 6.

A variacao ao longo do dia e mais complexa do que esperado. De notar que muitas falhas de comunicacao sao noturnas.

In [11]:
ev = 'Urgent call'
dat = df1[df1['Y']==ev]
data={'x':dat.DATAHORA.dt.day,
      'y':dat.TAG}
#ax = sns.countplot(df1[df1['Y']==ev]['DATAHORA'].dt.weekday

def plotall(df1):
    vcs= df1.Y.value_counts()
    vs = vcs.index.to_list()
    evs = vcs.to_list()
    l = len(vs)
    for ev in vs:
        fig, (ax1,ax2,ax3) =plt.subplots(1,3,figsize=(17,7))
        dat = df1[df1['Y']==ev]
        cases={'Weeks_Past':(dat.DATAHORA-datetime(year=2019,month=10,day=1)).dt.days,'y':dat.TAG,
               'Week_Day':dat.DATAHORA.dt.weekday,
               'Hour_Day':dat.DATAHORA.dt.hour,
               'Quarter':dat.DATAHORA.dt.quarter
              }
        mind = cases['Weeks_Past'].min()
        cases['Weeks_Past'] -= mind
        maxd = cases['Weeks_Past'].max()
        # ax0.set_title(ev+" ("+str(vcs[ev])+")")
        ax1.set_title(ev+" ("+str(vcs[ev])+")")
        ax2.set_title(ev+" ("+str(vcs[ev])+")")
        ax3.set_title(ev+" ("+str(vcs[ev])+")")
        # sns.distplot(cases['Quarter'],ax=ax0)
        sns.distplot(cases['Weeks_Past'],ax=ax1,bins=maxd//30+1)
        sns.distplot(cases['Week_Day'],ax=ax2,bins=7)
        sns.distplot(cases['Hour_Day'],ax=ax3,bins=24)
        # ax0.set_xlabel("Quarter")
        ax1.set_xlabel("Days Past")
        ax2.set_xlabel("Day of Week")
        ax3.set_xlabel("Hour of Day")
        # ax1.set_xlim(-1,5)
        ax1.set_xlim(-10,maxd+1)
        ax2.set_xlim(-1,7)
        ax3.set_xlim(-1,25)
        plt.show()
        
plotall(df1)
/home/vsc/.local/lib/python3.8/site-packages/seaborn/distributions.py:283: UserWarning: Data must have variance to compute a kernel density estimate.
  warnings.warn(msg, UserWarning)
/home/vsc/.local/lib/python3.8/site-packages/seaborn/distributions.py:283: UserWarning: Data must have variance to compute a kernel density estimate.
  warnings.warn(msg, UserWarning)
/home/vsc/.local/lib/python3.8/site-packages/seaborn/distributions.py:283: UserWarning: Data must have variance to compute a kernel density estimate.
  warnings.warn(msg, UserWarning)

Porque tantos eventos do driver nao logging in?

In [13]:
%%bash
grep "Delay of a driver in logging in the badging workst"   *csv|head -20
PCCSSV127ATRAPRE;20.03.13 23:26:04;188000000;1;3;Delay of a driver in logging in the badging workst;1;0;0;0;11118;227;;;;;20.03.13;;;;;0;;;;;;;;;;;;;;
PCCSSV113ATRAPRE;20.03.13 23:26:04;188000000;1;3;Delay of a driver in logging in the badging workst;1;0;0;0;11118;213;;;;;20.03.13;;;;;0;;;;;;;;;;;;;;
PCCSSV108ATRAPRE;20.03.13 23:26:04;188000000;1;3;Delay of a driver in logging in the badging workst;1;0;0;0;11118;208;;;;;20.03.13;;;;;0;;;;;;;;;;;;;;
PCCSSV111ATRAPRE;20.03.13 23:26:04;188000000;1;3;Delay of a driver in logging in the badging workst;1;0;0;0;11118;211;;;;;20.03.13;;;;;0;;;;;;;;;;;;;;
PCCSSV122ATRAPRE;20.03.13 23:26:04;188000000;1;3;Delay of a driver in logging in the badging workst;1;0;0;0;11118;222;;;;;20.03.13;;;;;0;;;;;;;;;;;;;;
PCCSSV102ATRAPRE;20.03.13 23:26:04;188000000;1;3;Delay of a driver in logging in the badging workst;1;0;0;0;11118;202;;;;;20.03.13;;;;;0;;;;;;;;;;;;;;
PCCSSV121ATRAPRE;20.03.13 23:26:04;188000000;1;3;Delay of a driver in logging in the badging workst;1;0;0;0;11118;221;;;;;20.03.13;;;;;0;;;;;;;;;;;;;;
PCCSSV112ATRAPRE;20.03.13 23:27:07;680000000;1;3;Delay of a driver in logging in the badging workst;1;0;0;0;11118;212;;;;;20.03.13;;;;;0;;;;;;;;;;;;;;
PCCSSV119ATRAPRE;20.03.13 23:31:16;315000000;1;3;Delay of a driver in logging in the badging workst;1;0;0;0;11118;219;;;;;20.03.13;;;;;0;;;;;;;;;;;;;;
PCCSSV104ATRAPRE;20.03.13 23:31:16;315000000;1;3;Delay of a driver in logging in the badging workst;1;0;0;0;11118;204;;;;;20.03.13;;;;;0;;;;;;;;;;;;;;
PCCSSV114ATRAPRE;20.03.13 23:31:16;315000000;1;3;Delay of a driver in logging in the badging workst;1;0;0;0;11118;214;;;;;20.03.13;;;;;0;;;;;;;;;;;;;;
PCCSSV101ATRAPRE;20.03.13 23:31:16;315000000;1;3;Delay of a driver in logging in the badging workst;1;0;0;0;11118;201;;;;;20.03.13;;;;;0;;;;;;;;;;;;;;
PCCSSV115ATRAPRE;20.03.13 23:31:16;315000000;1;3;Delay of a driver in logging in the badging workst;1;0;0;0;11118;215;;;;;20.03.13;;;;;0;;;;;;;;;;;;;;
PCCSSV124ATRAPRE;20.03.13 23:31:16;315000000;1;3;Delay of a driver in logging in the badging workst;1;0;0;0;11118;224;;;;;20.03.13;;;;;0;;;;;;;;;;;;;;
PCCSSV105ATRAPRE;20.03.13 23:31:16;315000000;1;3;Delay of a driver in logging in the badging workst;1;0;0;0;11118;205;;;;;20.03.13;;;;;0;;;;;;;;;;;;;;
PCCSSV110ATRAPRE;20.03.13 23:31:16;315000000;1;3;Delay of a driver in logging in the badging workst;1;0;0;0;11118;210;;;;;20.03.13;;;;;0;;;;;;;;;;;;;;
PCCSSV112ATRAPRE;20.03.13 23:31:16;315000000;1;3;Delay of a driver in logging in the badging workst;1;0;0;0;11118;212;;;;;20.03.13;;;;;0;;;;;;;;;;;;;;
PCCSSV106ATRAPRE;20.03.13 23:31:16;315000000;1;3;Delay of a driver in logging in the badging workst;1;0;0;0;11118;206;;;;;20.03.13;;;;;0;;;;;;;;;;;;;;
PCCSSV127ATRAPRE;20.03.13 23:31:16;315000000;1;3;Delay of a driver in logging in the badging workst;1;0;0;0;11118;227;;;;;20.03.13;;;;;0;;;;;;;;;;;;;;
PCCSSV113ATRAPRE;20.03.13 23:31:16;315000000;1;3;Delay of a driver in logging in the badging workst;1;0;0;0;11118;213;;;;;20.03.13;;;;;0;;;;;;;;;;;;;;
In [ ]:
Duplicatas! Vamos eliminar
In [14]:
sdata=df1.drop_duplicates(['Y','DATAHORA'])
sdata.describe()
Out[14]:
TIPOALARME PRIORIDADE DESCRICAO QUALIDADE URGENTE VALOR CODIGOALARME ID_VEICULO ID_LINHA NR_TURNO ... ID_VEICULO2 HIS_TIMESTAMP DATAHORAACK REGISTOMANUAL COMENTARIO COMPLEMENTO_DESCRICAO ID_CONDUTOR2 ID_CONDUTOR LOGALARMEUMREQID COMPLEMENTO_DESCRICAO_2057
count 1.454700e+05 145470.000000 145470.000000 145470.000000 145470.0 145470.0 145470.000000 145470.000000 145470.000000 0.0 ... 0.0 0.0 73087.000000 73087.000000 145470.0 0.0 2398.000000 2458.000000 11113.000000 15209.000000
mean 4.753731e+08 0.825675 2.712669 1.111521 0.0 0.0 940.807692 13758.728453 212.948553 NaN ... NaN NaN -0.229494 -0.229808 0.0 NaN 379.786489 7.206265 272.162962 507711.633441
std 3.053546e+08 0.422135 0.458315 0.365166 0.0 0.0 14013.885458 10991.393471 13.788839 NaN ... NaN NaN 6.207397 6.206904 0.0 NaN 1325.498579 53.549290 147.452486 209616.650150
min 0.000000e+00 0.000000 1.000000 1.000000 0.0 0.0 0.000000 11101.000000 0.000000 NaN ... NaN NaN -1.000000 -2.000000 0.0 NaN 0.000000 -1.000000 0.000000 8.000000
25% 2.200000e+08 1.000000 2.000000 1.000000 0.0 0.0 0.000000 11104.000000 206.000000 NaN ... NaN NaN -1.000000 -1.000000 0.0 NaN 0.000000 -1.000000 221.000000 587687.000000
50% 4.800000e+08 1.000000 3.000000 1.000000 0.0 0.0 0.000000 11118.000000 213.000000 NaN ... NaN NaN -1.000000 -1.000000 0.0 NaN 0.000000 -1.000000 330.000000 592311.000000
75% 7.390000e+08 1.000000 3.000000 1.000000 0.0 0.0 0.000000 11118.000000 221.000000 NaN ... NaN NaN -1.000000 -1.000000 0.0 NaN 0.000000 -1.000000 388.000000 598252.000000
max 9.990000e+08 2.000000 3.000000 3.000000 0.0 0.0 618824.000000 91102.000000 227.000000 NaN ... NaN NaN 78.000000 78.000000 0.0 NaN 5004.000000 418.000000 418.000000 604013.000000

8 rows × 21 columns

In [ ]:
refazer da os mesmos resulltados mas mais equilibrados.
In [15]:
plotall(sdata)
/home/vsc/.local/lib/python3.8/site-packages/seaborn/distributions.py:283: UserWarning: Data must have variance to compute a kernel density estimate.
  warnings.warn(msg, UserWarning)
/home/vsc/.local/lib/python3.8/site-packages/seaborn/distributions.py:283: UserWarning: Data must have variance to compute a kernel density estimate.
  warnings.warn(msg, UserWarning)
/home/vsc/.local/lib/python3.8/site-packages/seaborn/distributions.py:283: UserWarning: Data must have variance to compute a kernel density estimate.
  warnings.warn(msg, UserWarning)
In [ ]:
... 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]: